|Home | About | Journals | Submit | Contact Us | Français|
Cathepsin B is a potential target for the development of drugs to treat several important human diseases. A number of inhibitors targeting this protein have been developed in the past several years. Recently, a group of small molecules were identified to have inhibitory activity against cathepsin B through high throughput screening (HTS) tests. In this study, traditional continuous and binary QSAR models were built to classify the biological activities of previously identified compounds and to distinguish active compounds from inactive compounds for drug development based on the calculated molecular and physicochemical properties. Strong correlations were obtained for the continuous QSAR models with regression correlation coefficients (r2) and cross-validated correlation coefficients (q2) of 0.77 and 0.61 for all compounds, and 0.82 and 0.68 for the compound set excluding 3 outliers, respectively. The models were further validated through the leave-one-out (LOO) method and the training-test set method. The binary models demonstrated a strong level of predictability in distinguishing the active compounds from inactive compounds with accuracies of 0.89 and 0.94 for active and inactive compounds, respectively, in non-cross-validated models. Similar results were obtained for the cross-validated models. Collectively, these results demonstrate the models’ ability to discriminate between active and inactive compounds, suggesting that the models may be used to pre-screen compounds to facilitate compound optimization and to design novel inhibitors for drug development.
As members of the papain superfamily,1 cathepsins are involved in many biological processes related to human diseases and disorders.1–4 Previously identified cathepsins include cathepsin B, D, H, K, L, and S. Several of these proteins have been selected as biological targets to develop therapeutic treatments, and a number of inhibitors have been identified and developed for many of these enzymes. Cathepsin B inhibitors are highly sought after chemical agents since many diseases, such as neurodegenerative disorder, cardiovascular disease, cancer, inflammation, rheumatoid arthritis, and Alzheimer’s disease5–12, have been connected with unusual levels or abnormal function of cathepsin B. As an ubiquitous lysosomal cysteine proteases, cathepsin B has been found to be responsible for intracellular as well as extracellular proteolysis in mammalian cells and can facilitate cell migration by dissolving the extracellular barriers, which result in tumor metastasis and angiogenesis.13–15 The biological activity and function of cathepsin B is also important during viral infection and replication for several viruses, such as Ebola, SARS (Severe Acute Respiratory Syndrome) in human cells.16, 17 Due to its important biological functions, which are directly related to several important human diseases, cathepsin B has been chosen as a drug development target in many efforts.
A number of compounds have been found to inhibit cathepsin B activity, and some of these compounds have been tested and are effective in animal experiments.5–12, 18–20 Most of these cathepsin inhibitors disable the biological activity of cathepsin B through forming irreversible covalent chemical bond in the catalytic site of the enzyme. These irreversible inhibitors include dipeptidyl nitriles21, vinyl sulfones, expoxysuccinates, acyloxymethyl ketones, fluoromethyl ketones, hydrazides, and bis-α-amidoketones.22 Structural studies have provided detailed insights into the biological mechanisms of these inhibitors. Inhibitors bind to the catalytic active site of cathepsin B, and form an irreversible covalent bond with the protein in the active site.21, 23 Meanwhile, computational studies, such as docking and virtual screening, were also used to explore the binding and inhibition mechanism of these inhibitors.24–26 These computational approaches have proved efficient and essential for optimizing and designing chemical agents with improved biological activities.27–29
In an effort to identify chemical probes through high throughput screening technology (HTS), a number of chemicals have been found to be active against cathepsin B in the screening campaign for inhibitor identification, with the NIH Molecular Libraries Program (MLP). The screening results were deposited into the PubChem (http://pubchem.ncbi.nlm.nih.gov), a database for molecular structures and associated biological activities which is available to general public. The database contains the 2D structures and the biological activity (IC50) derived from dose-response HTS experiments for the active compounds tested.
In a previous study,30, 31 docking simulation was used to model the binding structures of the active compounds identified by the high throughput screening (HTS) tests to the binding site of the protein. A relative binding affinity was calculated for each compound studied based on the respective modeled bound structure using the linear response molecular mechanics Poisson Boltzmann-surface area method (LR-MM-PBSA). Strong correlations between the calculated binding affinities and experimental biological activities were obtained.30, 31 Three-dimension (3D) Comparative Molecular Field Analysis (CoMFA) quantitative structure-activity relationships (QSAR) models were also established based on the multi-conformation method with high correlation coefficients for these active compounds (to be published). Given the importance of the enzyme in disease treatment and the broad research interest to screen and identify potent inhibitors, efforts have been taken further to build the QSAR models to correlate the molecular and physicochemical properties with the biological activities of the compounds tested in the Cathepsin B inhibition screening experiments. Efforts were also undertaken to build statistical models to classify inhibition activities against Cathepsin B for specific compounds. This work reports the results of the continuous (continuous in bioactivity space) and binary (two discrete bioactivity status, e.g. ‘active’ vs ‘inactive’) QSAR models in order to obtain insight into the relationship of chemical structure and physicochemical properties/biological activity. Additionally, the results of this work may provide means to pre-screen compounds to facilitate in silico design of de novo cathepsin B inhibitors and the development of drugs for the treatment of varies human diseases using some other molecular targets.
The 3D chemical structures of all small molecule compounds were first converted from the 2D molecular structures obtained from the PubChem database (PubChem bioAssay accession: 820 and 523),32 which were subsequently subjected to energy minimization calculation until the root mean square deviation (RMSD) of potential energy was smaller than 0.001, using the Optimized Potentials for Liquid Simulations (OPLS) force field33–36 by the Molecular Operating Environment (MOE) program (Version 2007.09, developed by Chemical Computing Group, Montreal, Canada). In Bioassay AID (Pubchem bioassay accession) 820, 75 compounds were tested by a dose response confirmatory screening experiment, whereas 37 compounds were confirmed to have inhibitory activity, 35 compounds were confirmed to have no inhibitory activity, and the bioactivity outcome for the remaining 3 compounds were reported as “inconclusive”. By further examining these active compounds using different experimental protocol, assay depositors suggested that some of them were likely artifacts. Further investigation using several independent bioassay tests (different biological experiments), which may be affected by similar causes of artifact, suggested that five compounds bear greater chance as being false positives, and they were removed from the current analysis. So, 32 active (37–5) and 35 inactive compounds from bioassay 820 were included in the work. Bioassay 523 employed a similar experimental protocol to the Bioassay 820, where 27 compounds were tested, 10 compounds were confirmed having inhibitory activity, 16 compounds were confirmed having no inhibitory activity, and one (1) compound was reported as “inconclusive”. Four out of the 10 active compounds are unique structures and were added to the active compound list. Total 36 active compounds (32 from bioassay 820 and 4 from bioassay 523) and 51 inactive compounds (35 from bioassay 820 and 16 from bioassay 523) were used in the modeling. The average IC50 reported in the bioassay depositions was used for the bioactivity of the active compounds. The PubChem compound accession (CID), molecular formula, and biological activity (IC50) for the 86 compounds are listed in Table 1 (one active compound was not included in QSAR modeling as no 3D conformer was obtained in docking simulation). Molecular structure and other information can be obtained from PubChem website based on CIDs (http://pubchem.ncbi.nlm.nih.gov) and the bioassay protocol information can be obtained from PubChem website by BioAssay ID (AID) 820 and 523 (http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=820 and http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=523).32 The compounds tested in the confirmatory assays were cherry picked based on the screening results for over 60,000 compounds in a single dose HTS assay for potential cathepsin B inhibition activity (PubChem AID:453) by measuring the release of the fluorophore aminomethyl coumarin (AMC) from the hydrolysis of an AMC-labeled dipeptide, The large number of inactive compounds from the primary HTS assay were not included in the model since the statistical methods typically do not perform well on unbalanced data sets. Therefore claiming all compounds as inactive will yield an accuracy of 99.94% as pointed out by Weis and coworkers.37
The three dimensional structure coordinate of cathepsin protein with bound Dipeptidyl Nitrile (DPN) was obtained from Protein Databank (PDB code: 1GMY21). The PDB file for the crystallographic structure complex of cathepsin B contains three chains A, B and C. The first two chains form a dimer and were used in docking simulation (Figure 1). Hydrogen atoms and partial charges were added to the protein and then short steps of minimization were performed to relax the newly-added hydrogen atoms and the potential steric contacts in the original PDB coordinates. The minimizations were performed using an OPLS force field following the standard protocol of the Glide program (version 8 in FirstDiscovery suite).38, 39 The minimized protein structure was then used to generate a docking grid which was used to docked all active compounds into the DPN binding site. A docking box (exterior box) of 16 × 18 × 22 Å for the placement of all ligand atoms and a restraining box (interior box) of 8 × 10 × 14 Å for the placement of ligand geometric center were used to generate the docking grid. The boxes were centered at the geometric center of DPN ligand. Default values were used for all other settings within Glide (version 8).
All active compounds were flexibly docked into the active site of the cathepsin protein based on the generated docking grid. All docking simulations were carried out using Glide program. Details of the docking simulation can be obtained from previously reported work.30, 31, 40 Docking simulation was used to determine the “active” conformation of each compound according to the structural and physicochemical properties of the binding site. The 3D conformation for each inactive compound was modeled by flexibly aligning it to the 3D conformer of DPN, the ligand in the crystal complex. The alignment was carried out based on molecular shape (volume) and pharmacophore elements using the flexible alignment module implemented in MOE, Default parameters were used for the flexible alignment. The conformation for each active compound was also obtained using the same alignment for the comparison with the conformation obtained from docking simulations in QSAR modeling.
Partial Least-Squares analysis (PLS) method was used to conduct statistical analysis and to derive a continuous QSAR model based on the descriptors calculated for each inhibitor. The PLS method has been proved to be a priority method for such statistical analysis especially in the case where the number of descriptors is large. All PLS calculations were carried out with the module implemented in MOE. The log10IC50 value was used as a dependent variable and the calculated molecular and physicochemical properties were used as independent variables for the analysis. Over 50 descriptors were calculated and preliminary factor analysis was conducted to obtain the optimum correlation between each individual descriptor and the biological activity (logIC50). The nine descriptors listed in Table 2 were selected and used to build QSAR models. An optimal number of components (9 components for continuous models and 8 for binary models) were used to build final models.
Model validation was carried out by the Leave-one-out (LOO) cross-validation procedure and training-test set approach in which the compounds were split into training sets for model building and test sets for which activities were predicted by the built model. Both continuous and binary models were validated by LOO and three (3) training-test experiments in which nine compounds were randomly selected as test compounds and all other compounds were used as training set for QSAR model building. The LOO technique provides a good way to quantitatively evaluate the predictive ability and robustness of a model by predicting each compound s activity using a QSAR model built based on information of the remaining compounds, which avoids the effect of a compound on its own activity prediction. This approach tries to eliminate the over-fitting problem normally existing in conventional regression method which includes all compounds in model construction. A cross-validated correlation coefficient (q2) was used to measure the predictability of a model and the conventional correlation coefficient (r2) was used to measure the quality of a model.
In order to calculate the physicochemical and molecular property-based descriptors to build QSAR models, 3D confirmation for each studied compound was needed. The 3D structures of these compounds were first converted from their 2D structures. As stated in the method section, totally 36 active compounds and 51 inactive compounds with unique structures were obtained from two confirmatory bioassay data entries (PubChem AID 820 and 523). Docking simulations and molecular alignment methods were applied to obtain active 3D confirmers for the 36 active compounds. In the docking simulations, the active compounds were docked into the active site of cathepsin B using the same methodology and confirmation determination procedure as described previously.30 Reliable docked poses were obtained for all but one compound. Detail results and discussion on conformer selection and comparison for most compounds can be found in previous report.30 As a result, the 35 compounds with docking results were included in the QSAR modeling. The docked cluster of the active compounds is depicted in Figure 2. It is seen that all these compounds were docked in the binding site with comparable binding conformers and roughly the similar locations.
In the molecular alignment, all 86 (active and inactive) compounds were superposed onto DPN using flexible molecular alignment method. The best fitted conformation of each compound was selected. The fitting was scored based on molecular shape (occupied volume) and pharmacophore matching which includes hydrogen donor, hydrogen acceptor, hydrophobic atoms, polar hydrogen atoms, and aromaticity of the two molecules. The 3D conformation for each compound was selected based on the best fit from MOE. Two separate sets of molecular structures were prepared, one based on docked structures for active compounds and molecular alignment for inactive compounds and the other based on the molecular alignment for all compounds. The two sets of 3D structures were used to build QSAR models for comparison and they yielded the very similar results. The information, including PubChem compound accession, e.g. CID, molecular formula, SMILES (Simplified Molecular Input Line Entry Specification) string, molecular weight (MW), and the activity (IC50), as obtained from PubChem BioAssay database (AID: 820 and 523) of all 86 compounds used in the work are listed in Table 1. To build a binary QSAR model, the activities of the active and the inactive compounds were assigned as of 1 and 0, respectively.
Over 250 descriptors are available with MOE QSAR module. Initial selection for the descriptors was attempted empirically based on the nature of the descriptors, the features of the studied molecular set, and the problem studied. Those which are not relevant to the modeling, such as the total energy, heat of formation, the numbers of bromine, boron, fluorine, or phosphorus atoms, were eliminated from further consideration in the work. Over 50 descriptors of physicochemical molecular properties were selected and calculated for all studied compounds based on the constructed 3D conformations using the methods implemented in MOE (version 2007.09). Factor analyses were performed to examine the correlation between each individual descriptor and the biological activity (logIC50) of these compounds. The nine descriptors that demonstrated an apparent correlation with the observed biological activity were chosen to build QSAR models, which are listed in Table 2. Among these descriptors, Nacc+don is the sum of the counts of hydrogen acceptor and donor. Apol, MR, and Dipole are the sum of atomic polarizabilities,41 molar refractivity, and the dipole moment of a molecule calculated from the partial charges of the molecule, respectively. SA and SApol are the water accessible surface areas (WASA) of whole molecule and the hydrophilic part of WASA calculated using a radius of 1.4 Å for the probe, respectively. Vol, Esol, and logP(o/w) are the molecular volume, the empirical solvation energy calculated based on OPLS force field, and the log value of octanol/water partition coefficient of a compound, respectively. Most of these descriptors are related to the molecular solvation properties to some degree. Calculation of molecular solvation or partition of a compound between hydrophobic and hydrophilic compartments still remains challenging. Combination of various computational methods would be a better choice to estimate the chemical s partition property which is well regarded to have direct and large effect on its biological activities.
Conventional QSAR models were constructed for the 35 active compounds based on the measured biological activities (logIC50) using the PLS regression method. A model for the entire compound set was first built to examine model coherence of all compounds and to identify the potential outliers. The model was then validated using the LOO method and three (3) training-test sets to examine the predictability and robustness of the QSAR models. To extend the prediction spectrum to include inactive compounds, seven (7) inactive compounds were randomly chosen and included in the models. The inhibitory activity (reading) of these compounds at the max tested concentration (50 μM) was close to the reading of the control experiments. It was assumed that their IC50 (if exists) was much high than 50 μM. To enable the numerical analysis, a value of −3.00 was assigned as the IC50 of these five inactive compounds in the QSAR analysis.
The predicted results and prediction errors for the model based on all 42 compounds (All Model) are listed in Table 3. Two thresholds, one and half units, respectively were used to validate the calculated results. The former corresponds to a ± 10 fold variation ranges or with a range of a single order of magnitude on each direction from the experimental value of IC50 and the latter corresponds to ± 3.33 fold variation ranges or within a window of a single order of magnitude on IC50. The non-validated model based on 9 descriptors (properties) showed strong ability to predict the biological activities with errors (difference between the predicted value and experimental value) of smaller than one unit for all 42 compounds. Twenty six compounds (62%) have predicted errors smaller than or equal to 0.5 unit by the model. Model evaluation using the LOO validation method generated satisfactory results. The observed activities vs. the predicted activities by the non-validated fitting model and cross-validated model are plotted in Figure 3A. By further checking the predicted results of the cross-validation model, it was noted that the three compounds (compounds 24, 26 and 34) had predicted errors close to one unit. When these three compounds were treated as outliers and excluded from the model construction, the new model (3 Outlier Model) consisting of the remaining 39 compounds demonstrated an improved quality. The predicted activities for these 39 compounds are listed in Table 4. All compounds have predicted errors smaller than one unit in the non-validated model and 37 compounds have predicted errors smaller than on unit in the cross-validated model. Twenty seven and more than half compounds have predicted errors smaller than half (0.5) unit according to the non-validated and cross-validated results, respectively. These cross-validation results demonstrated the strong predictability of the constructed models for the application of cathepsin B inhibition activity prediction.
To further evaluate the model s predictability and robustness, three additional models were built using the training-test-set method, in which the compounds were randomly split into a training set and test set. In this work, six compounds were randomly selected for the test set and this process was repeated three times. The predicted results for the three models are listed in Table 5–7. The plots of the observed activities vs. the predicted activities for the training compounds and testing compounds are shown in Figure 3B--2D.2D. It is seen that the predicted results for the testing compounds are close to their experimental activities with reasonable deviations. All three models based on the training set of compounds demonstrated similar predictability results.
As a summary, the non-validated regression coefficient, cross-validated coefficient, and root mean square error (RMSE) for the All Model are 0.771, 0.608, and 0.483, respectively (Table 8). The non-validated, cross-validated coefficients, and RMSE for the 3 Outlier Model are 0.821, 0.683, and 0.424, respectively (Table 8). The statistics results for the three training-test models are also listed in Table 8. All three models based on the training set of compounds have similar statistical properties as the All Model described above, with non-validated correlation coefficients of 0.804–0.807 and cross-validated coefficients of 0.637–0.675. The predicted errors for all test compounds but one (compound 34) in the three training sets are smaller than one unit. The Model 1 and 3 (Table 5 and and7)7) has two test compounds predicted with errors larger than half unit and the Model 2 has the most compounds predicted with errors larger than half unit. These results show that the predictability of these models is acceptable. Considering the fact that the compound structures and bioactivities studied are reasonably diverse, it is reasonable to conclude that the models demonstrated relatively strong prediction power for estimating the activity of the “unknown” compounds based on the known” compounds.
The availability of confirmed active and inactive compounds provides an opportunity to develop binary models for classifying active cathepsin B inhibitors from inactive compounds. Although modern HTS technology allows rapid screening of tens of thousands of compounds in a matter of a few days, such experiments are expensive, and can hardly cover all chemical space. In silico screening helps to eliminate potentially inactive compounds and to build compound libraries with enriched promising lead candidates. Thus in silico screening is a cost efficient strategy to facilitate the identification of potential drug candidates. Such prescreening provides a way to eliminate potentially inactive molecules in novel compound design for synthesis and make the limited experimental resources available for screening only the potentially active molecules.
The same nine descriptors used in the continuous models were also used to build binary QSAR models relying on the method developed by the Chemical Computing Group which was implemented as a module in MOE.42, 43 The binary QSAR model was developed as an economic QSAR model to be used distinguish active vs. inactive compounds in HTS experiments to assist model drug development. All binary models were built using the binary method module of MOE.42, 43 Similarly to the process of continuous model construction, the “All Binary Model” was first built with all 86 compounds to examine the predictability of the model for the “inside” compounds, the compounds used to build a model, and followed by validations using the LOO method and three training-test sets to examine the predictability and robustness of the QSAR models for “outside” compounds. The predicted results for the All Binary Model are listed in Table 9 and the statistical results are listed in Table 13. All 86 compounds, except for four active and three inactive ones, were predicted correctly with this non-validated model (Table 9). The prediction accuracies (Table 13) for all compounds are 0.919, and for active and inactive compounds are 0.886 and 0.941 respectively. This demonstrates the strong classification power of this model for discriminating inactive compounds from active compounds. In addition, four cross-validated models were built, one from LOO and three from the training-test-set method. In the latter three models, the compounds were randomly split into training set (80 compounds) and test set (6 compounds). The four models were designated as “Binary Model”, “Bin. Train-test 1”, “Bin. Train-test 2”, and “Bin. Train-test 3”, and the predicted results from these four cross-validated models are listed in Tables 9–12. The three training-test models produced consistent results that are similar to those from the LOO-based Binary Model. For the 80 training set compounds, the non-validated prediction accuracies for all compounds are 0.906–9.25, and for active and inactive compounds are 0.906 and 0.938, respectively. The cross-validated prediction accuracies for all compounds are 0.900, active and inactive compounds are 0.844 and 0.938, respectively. For the six test compounds, five compounds were predicted correctly, the prediction accuracies for the randomly selected test compounds in all three training-test models are 0.833 (Table 13). These validated models clearly demonstrated their abilities to distinguish active vs. inactive compounds with high efficiency and accuracy, which indicates their strong potential for in silico screening of small molecules.
It is noted that the prediction accuracies for active compounds were slightly, but consistently, lower than those for the inactive compounds. Such phenomenon could arise from the unbalanced nature of the data set regarding to bioactivity classes (active and inactive in this work). The larger population of inactive compounds over that of the active ones shifts the splitting point used to determine the active vs. inactive to the active side. Such shift results in more active compounds at the border line of being classified as “inactive”. To solve the problem, one could design a weighting schema to “overestimate” the type of compounds that are underrepresented in the data set. In many cases, however, such unbalance would not cause a severe effect on the application of the method when the focus is to select a molecule with a given property, such as eliminating inactive molecules to further develop active compounds in designing cathepsin B inhibitors.
The availability of molecular structures and their inhibition activity against cathepsin B identified through HTS experiments provided an opportunity to conduct a structure-activity relationship study to facilitate chemical probe development, and potential drug development for related disease treatment. The “active” conformers of active compounds were modeled using docking simulations, and the conformations of the inactive compounds were determined based on flexible alignment of each compound to the 3D conformer of the original ligand (DPN) in the crystal structure complex. Conventional continuous models were built for predicting cathepsin B inhibition activity of active small molecules. Binary models were constructed for the “active” vs. “inactive” classification. Nine (9) molecular and physicochemical properties were calculated based on 3D conformational information, and were selected and used as descriptors to build the QSAR models.
The continuous models demonstrated reasonable correlations between the predicted and the observed activities with non-validated (r2) and cross-validated (q2) regression correlation coefficients of 0.771 and 0.608 for all compounds, and 0.821 and 0.683 for the compounds excluding 3 outliers, respectively. The model showed strong predictability with reasonable predicted error rates over the entire test set and across several models. Almost all the 42 compounds used in the model have predicted errors smaller than one unit, while the majority of the compounds have predicted errors less than a half unit from both the non-validated and cross-validated results. The predictability of the models were further validated using the three training-test methods with the results that the predicted errors for all test compounds are smaller than one unit, while a significant fraction of them are less than a half unit in all three models together.
The consistent performance of the four cross-validated models demonstrated that the models are robust in predicting the inhibition activities of small molecules for the biological system of cathepsin B. The success of the QSAR models proves that the modeling approach based on docking conformation determination and the nine physicochemical properties may be a promising method to predict biological activity of a small molecule for drug development based on the knowledge derived from the identified cathepsin B inhibitors.
Furthermore, binary models were built for classifying cathepsin B inhibition activities, e.g. active vs. inactive categories. The statistical results showed that the models have demonstrated excellent performance with the prediction accuracies for bioactivity classes of 0.919, 0.886, and 0.941 for all 86 tested, 35 active, and 51 inactive compounds in the non-validated binary models, respectively. All compounds, except for four active and three inactive compounds, were correctly predicted by the non-validated fitting model. Similar prediction results were obtained with the cross-validated model. With the three training-test models, the bioactivity categories for the five out of six tested compounds were predicted correctly, with a prediction accuracy of 0.833. The factor that all of these models yielded high accuracies upon the prediction of cathepsin B inhibition activity demonstrates that such models enable the classification of active vs. inactive inhibitory compounds with high accuracies. The robustness and feasibility of this binary method indicates its potential in several applications to facilitate molecular design and drug development. In particular, this approach may prove highly efficient for pre-screening in silico designed compounds to optimize the biological activities based on the information of known bioactive compounds. Such in silico pre-screening may be performed to filter out the ones that are unlikely to have the desired biological activities, thus to allow the limited efforts to be focused on the highly promising candidates. With such advantages, we anticipate the utilization of the combination of the continuous and binary classification models will potentially benefit modern drug development for therapeutic purposes.
This research was supported by the Intramural Research Program of the NIH, NLM. The authors would like to thank the NIH Fellows Editorial Board (FEB) for reviewing the manuscript and the developers of Pymol software for sharing the program to prepare the molecular figures used in this paper.