Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Mol Graph Model. Author manuscript; available in PMC 2012 September 1.
Published in final edited form as:
PMCID: PMC3167229

Multi-conformation 3D QSAR study of benzenesulfonyl-pyrazol-ester compounds and their analogs as cathepsin B inhibitors


Cathepsin B has been found being responsible for many human diseases. Inhibitors of cathepsin B, a ubiquitous lysosomal cysteine protease, have been developed as a promising treatment for human diseases resulting from malfunction and over-expression of this enzyme. Through a high throughput screening assay, a set of compounds were found able to inhibit the enzymatic activity of cathepsin B. The binding structures of these active compounds were modeled through docking simulation. Three-dimensional (3D) quantitative structure-activity relationship (QSAR) models were constructed using comparative molecular field analysis (CoMFA) and comparative molecular similarity indices analysis (CoMSIA) based on the docked structures of the compounds. Strong correlations were obtained for both CoMFA and CoMSIA models with cross-validated correlation coefficients (q2) of 0.605 and 0.605 and the regression correlation coefficients (r2) of 0.999 and 0.997, respectively. The robustness of these models was further validated using leave-one-out (LOO) method and training-test set method. The activities of eight (8) randomly selected compounds were predicted using models built from training set of compounds with prediction errors of less than 1 unit for most compounds in CoMFA and CoMSIA models. Structural features for compounds with improved activity are suggested based on the analysis of the CoMFA and CoMSIA contour maps and the property map of the protein ligand binding site. These results may help to provide better understanding of the structure-activity relationship of cathepsin B inhibitors and to facilitate lead optimization and novel inhibitor design. The multi-conformation method to build 3D QSAR is very effective approach to obtain satisfactory models with high correlation with experimental results and high prediction power for unknown compounds.

Keywords: multi-conformation QSAR, 4D QSAR, Docking, CoMFA, CoMSIA, cathepsin B, inhibitors, regression, partial least squares, PubChem


Cathepsins, as lysosomal peptidases, are responsible for intracellular as well as extracelluar proteolysis in mammalian cells. Several cathepsins (such as cathepsin B, D, H, K, L, and S) have been found to be involved in the biological processes related to a variety of diseases and disorders.[14] cathepsin B, a member of the papain superfamily,[3] is one of the ubiquitous lysosomal cysteine proteases. Researchers have found that over expression of this enzyme is related to several important human diseases, including neurodegenerative disorder, cardiovascular disease, cancer, inflammation, rheumatoid arthritis, and Alzheimer’s disease.[512] Cathepsin B has also been shown to play an important role in tumor metastasis and angiogenesis by facilitating cell migration and dissolving the extracellular barriers.[1315] Furthermore, it is an essential factor to viral entry and replication of several viruses, such as Ebola and SARS in human cells through its proteolysis functions.[1618] Since cathepsin B has been shown to play an important role in various human diseases, this enzyme was selected as a therapeutic molecular target for drug development. A number of inhibitors have been developed for the disease treatment,[1921] including several inhibitors that were effective against rheumatoid arthritis in animal models.[510, 12, 1921]

Biological studies have demonstrated that many cathepsin inhibitors bind to the enzyme irreversibly.[22, 23] The structural biology studies revealed that such inhibitors bind to the active site by forming a covalent bond with the catalytic thiol group (Cys30) of the enzyme.[22, 23] The reported irreversible cathepsin inhibitors include dipeptidyl nitriles,[23] vinyl sulfones, expoxysuccinates, acyloxymethyl ketones, fluoromethyl ketones, hydrazides, and bis-α-amidoketones.[24] However, some other inhibitors have been reported to form reversible covalent bond with the enzyme, such as α-ketoamides and aldehydes.[24] The experimental three dimensional structure shows that inhibitor dipeptidyl nitrile (DPN) binds at the active site, formed by amino acids Gln24, Cys27-Trp31, Gly69, Asp70, Asn73-Pro77, Ala174, Gly199-Ala201, and Glu245, close to the interface of the protein dimer. The inhibitor forms a covalent sulfur bond with protein amino acid Cys30.[22, 23]

In addition to experimental studies, computational methods have been used to study the binding structures of active inhibitors with the protein with the purpose to characterize interaction features and to interpret inhibition mechanism of small molecules for several members of the cathepsin class[2527] including cathepsin L[28, 29] and cathepsin S.[29] The authors also performed a computational study to establish the interaction and activity mode between these inhibitors and cathepsin B.[30, 31].

CoMFA and CoMSIA 3D QSAR analysis have been applied in numerous molecular systems to extract useful information from compounds with bio-activity for lead optimization and design. The 3D modeling process requires a predetermined conformation for each compound and an overall molecular alignment among the compounds. Several strategies for building compound alignments have been suggested and successfully applied to 3D QSAR research. Yet, the determination of “active” conformation still remains challenging. Therefore, docking simulation has been applied to determinate the “active” conformation and to perform compound alignment. However, multi-conformations were often obtained for a single compound, thus usually a conformation had to be picked based on simplified rules or assumptions. In the light of the emergence of the 4D QSAR concept,[32] an approach involving multi-conformations QSAR was developed and evaluated in the work for the construction of 3D QSAR model. This approach utilized multi-conformations for each small molecule in QSAR model construction and optimization with the purpose of eliminating manual conformation selection and obtaining best correlations.

Materials and Methods

Molecular structures and Bioactivity

The three dimensional structure coordinate of cathepsin protein with bound Dipeptidyl Nitrile (DPN) was obtained from the protein databank (PDB code: 1GMY[23]). The PDB file for the crystallographic structure complex contains three chains (A, B, and C) of cathepsin B and 3 copies of the small molecule ligand. Chains A and B form a dimer and the coordinates of this dimer were used in this study for the 3 dimension (3D) quantitative structure-activity relationship (QSAR) model construction.

The two dimension (2D) chemical structures and the biological activities of the small molecules were taken from the PubChem database website ( The biological activity of these compounds was measured based on a dose response confirmatory screening experiment (PubChem BioAssay ID: 820 and 523 at),[33] where the inhibitory activities for a total of 75 and 27 compounds were tested in assay 820 and 523, respectively, by the Center for Molecular Discovery of University of Pennsylvania. In the confirmatory biological screenings, compounds were reported as active if they have inhibitory activity against the catalytic functionality of the human liver cathepsin B in a multiple-dose response experiment by measuring the release of the fluorophore aminomethyl coumarin (AMC) from the hydrolysis of an AMC-labeled dipeptide. Detailed information for the assay protocol and materials is available on the PubChem website (,523). These tested compounds were identified in a HTS assay for the inhibition activity of over 60,000 compounds against cathepsin B where a single dose of compound concentration was tested. Thirty seven (37) were identified as active compounds, 35 compounds were reported as inactive compounds, while 3 compounds were suggested as “inconclusive” in the results of assay 820. Meanwhile, the numbers of active, inactive, and “inconclusive” compounds reported in assay 523 are 10, 16, and 1, respectively. Six (6) active compounds were reported in both assays. So the number of unique compounds reported as active are 41. Through further examining these active compounds with different protocols, the experimental authors suggested than some of them could be artifacts. To eliminating the potential artifacts, the authors examined these active compounds using several independent bioassay tests (different biological experiments) that were affected by the similar causes of artifacts. The examination resulted in the elimination of five active compounds from our active compound list as they bear greater chance as being false positives. The final 36 active compounds were included in the work. To expend the activity range of the modeled compounds, 7 inactive compounds with the similar sizes and structures were added in the modeling. In training-test experiments, 8 compounds were randomly selected as test set of compounds while the rest 34 compounds were used as training set of compounds to build models.

The 3D structures of these compounds were constructed from their 2D structures using the Molecular Operating Environment (MOE) program (Version 2007.09, developed by Chemical Computing Group, Montreal, Canada). The structure of each organic molecule was minimized until the root mean square (RMS) deviation of energy is smaller than 0.01 using optimized potential for liquid simulation (OPLS) force field.[3437] The structure and biological activity of these active compounds are listed in Table 1.

Table 1
The molecular structure, property, and biological activity of 35 active and 7 inactive compounds.*

Small molecule alignment by Docking Simulation

Docking simulation was used to determine the “active” conformation of each compound and align the small molecules according to the structural and physicochemical properties of the binding site.

The cathepsin B structure was prepared following the standard protocol of GLIDE docking program (version 8 in FirstDiscovery suite)[38, 39] by adding hydrogen atoms and assigning partial charges using the OPLS-AA force field.[3437] A coarse energy minimization with a small number of steps was performed to the complex structure to relax side chain amino acids and the added hydrogen atoms. A docking grid was generated based on the refined protein structure with a geometric center at the center of the original ligand and the dimension of the docking grid was defined to enclose the whole binding site.

The ligand compounds were docked into the binding pocket of the protein receptor using the GLIDE program. Conformational flexibility of the ligand was considered by generating rotamers from explicit rotation of single-bond torsion angles. All rotamers for a given ligand were docked and oriented in the site, and grid-based refinement was then executed for 5000 poses. The 800 top refined poses for each ligand were subjected to 400 steps of conjugate gradient energy minimization. Poses of a given compound were saved if they differed from those obtained previously by more than 0.5 Å RMS deviations in heavy atom coordinates, or involved atomic displacements by greater than 1 Å. The flexibility of the receptor was not included in GLIDE docking. The docking poses were ranked by Gscore (GLIDE score) and the top five poses were selected and used for the QSAR model construction. The docking simulation followed the similar dockings reported previously in detail.[30, 40, 41]

Alignment and multi-conformation selection

Docking simulation was performed to determine the “active” conformation of each ligand and to align the compounds together according to their and the receptor’s structural and physicochemical properties and the top five poses were selected for each ligand from the docking results, which were included in the initial dataset for the QSAR model construction. A cluster of 5 conformers (multi-conformation) were initially included in the QSAR modeling. The selection of the final conformation for each compound was done by identifying the conformation that correlates the best with other compounds, a process that involving multiple-steps of iterations. In each step, the conformation with the lowest correlation coefficient with all other compounds was eliminated from the list. Through the iteration, the conformation with the highest correlation coefficient was retained at the end, one for each compound. The last one conformation of each compound was then used to build the final QSAR model.

QSAR modeling by statistical analysis

Tripos Sybyl 7.0 program was used to conduct CoMFA and CoMSIA QSAR modeling. The Partial Least-Squares analysis (PLS) method was used to conduct statistical analysis and to derive a QSAR model based on the field values of the CoMFA and CoMSIA descriptors. The CoMFA standard scaling and column filtering of 5.0 were used in PLS analysis.

Leave-one-out (LOO) cross-validation in PLS was used to obtain the optimum number of components (ONC) for building the regression models, where ONC is defined as the number of components used to construct model with the smallest cross-validated standard error (or the number giving the largest value of q2, as they are consistent most of the time). The LOO technique predicts the activity of each compound using a QSAR model built by excluding that particular compound, providing a good way to quantitatively evaluate the internal predictive ability of a model. By literarily calculating the activity for all compounds in a data set, the LOO attempts to evaluate the robustness of a model and lower the over-fitting effect that could exist in conventional regression analysis due to the inclusion of the compound information in its prediction. The quality of a model is expressed as the cross-validated correlation coefficient q2.

The optimum number of components obtained by cross-validated regression was then used in all conventional fitting regression to derive the final QSAR models including all of the compounds (without cross-validation). Four methods including LOO, cross-validation, Bootstrapping validation, and training-test sets were used for model cross-validations. The Bootstrapping method performs a statistical simulation by randomly generating many new data sets from the original one and calculating statistical parameters for the original set and each of the new sets (see Tripos Sybyl manual on CD for details). The cross-validated correlation coefficient (q2) calculated from these cross-validations were used to evaluate the predictability and robustness of a model and the conventional correlation coefficient (r2) is used to measure the quality of the model.

Results and Discussion

Docking validation

The docking protocol was first validated by reproducing the binding structure of the DPN ligand to active site of cathepsin B protein in the experimental crystal complex (PDB code: 1GMY[23]). The dimer coordinates were extracted from the PDB file and used for the docking simulations. In the original crystal complex structure, the DPN ligand is covalently linked to the thiol group of amino acid Cys30 of cathepsin B.[23] Prior to the docking grid preparation, the S-C bond between Cys30 and DPN was first broken to separate the ligand from the protein, and the N-C group of the ligand was modified into linear nitrile, a pre-reaction format. A docking grid was then generated based on the protein dimer using GLIDE after DPN ligands were removed from the active sites. To validate the docking protocol, the DPN ligand was then docked back into the active site of the first protein of the dimer. As depicted in Figure 1, the docked pose from the dominant cluster[40] superimposes very well with the original ligand except that the methyl-phenyl ring close to G198 rotated for 180 degrees, and as a result, the methyl group turns to the opposite direction. It seems that it is difficult for GLIDE to prioritize the original conformation, either due to lack of a structural element to distinguish the two, or because the experimental conformer does not provide enough binding elements to be recognized by the force field employed in GLIDE. Nevertheless, the overlap RMS deviations between the docked poses and the original pose are ~1.2Å. If the 6 carbons of the methyl-phenyl ring are not distinguished, the RMS deviations will be as low as 0.4 Å. In the docked structure, it is noticed that the nitrile group has potential to form hydrogen bond with H199 amino acid. The hydrogen bonding helps to stabilize the position of the nitrile group that C29 amino acid attacks to form covalent bond. The results show that the docking simulation protocol is able to reproduce the ligand binding structure as observed in the experimental crystal complex. The same docking protocol was applied to determine the “bound” conformations for the cathepsin B inhibitors, as well as, to align these compounds together according to the receptor features for the QSAR study.

Figure 1
The docked pose (stick-ball mode in cyan) of DPN compared with the experimental structure of the inhibitor (stick-ball mode in grey) in the crystal complex (PDB code: 1GMY). The key interacting residues are shown in stick mode. The red dot lines indicate ...

Conformation determination and alignment construction

When applying the described docking protocol to this set of compounds, binding structures were obtained from the docking simulation for 43 of them. The docking simulation did not output any pose for one compound. The results of these 42 compounds were included in the QSAR study. The molecular formula, name, PubChem compound ID (CID) and the activity (IC50) of these compounds as obtained from PubChem BioAssay database (AID: 820 and 523), are listed in Table 1.

It was noted that multiple docking poses were obtained for all these compounds especially for smaller compounds that can be docked into the binding pocket with different modes and only slight differences on docking score (Gscore). As shown in Figure 2A, two poses of compound 24(CID:326798) obtained from the docking simulation have different binding structures, where the whole molecule flip over with the triazole ring directs to opposite directions, but with the two different orientations, the molecule fits well into the binding site and align well with the experimental bound structure of DPN. b. Interestingly, with the respective binding modes reflected by the two docking poses, the triazole ring either had direct hydrogen bond/ionic interaction with amino acid E122 of the first protein or with amino acid S175 of the second protein. Since both binding modes have strong interactions with one part or the other of the protein dimer, it was difficult to prioritize one over the other merely based on docking score or general knowledge. The similar phenomena have been observed in other docked compounds. This uncertainty is sometimes common in docking simulations, and selection among the multi docked poses for one compound is the key to the success of QSAR and other computational studies. In the docking simulations, similar binding positions and orientations have been observed among the poses of the similar compounds. As exampled in Figure 2B, the docked poses of compound 16 (CID: 660829) and 24 were superposed very well. Also their docked structures are well aligned with DPN. How to determine which theoretically derived binding mode is the closest to the status in the biological environment is a challenge. 4D QSAR was suggested to tackle the complication/difficulty on the conformer selection for each compound by including multi-conformations of each small molecule into QSAR model construction. In the light of this method, a multi-conformation 3D QSAR approach is introduced in the work to model the 3D QSAR models of cathepsin inhibitors. The conformers of each compound were generated from docking simulations and multi-conformers were included for each compound in QSAR constructions. Multi-step iteration was adopted to eliminate the abundant conformers to get the final models. To balance the accuracy, complexity, and computer time consumption, we included five top poses for each compound based on the Gscore in this work to evaluate the multi-conformation QSAR concept.

Figure 2
The superposition of the docked poses of the compounds with the experimental structure of the DPN inhibitor (in grey) in the crystal complex (PDB code: 1GMY). The ligands are depicted in stick-ball mode and key interaction residues are in stick mode. ...

QSAR model construction

The five selected poses of each compound were initially included in model constructions. CoMFA and CoMSIA descriptors were calculated separately to build models. Based on the calculated field values, the correlation of each sample (incident, row) with all the rest were analyzed and evaluated using PLS statistical methods, and the least correlated pose for each compound was eliminated from the dataset. The PLS method has been used in numerous applications in correlating the activity with various physicochemical properties, especially when the number of descriptors is larger than the number of samples (incidents, row) in which a traditional multi-regression is not suitable. The PLS regression tries to build a correlation between a dependent variable (normally an activity) and several independent variables (property descriptors).

The same statistical analysis process was iterated to eliminate extra conformations for each compound step by step. In each step, the docked conformation with least correlation to other compounds was removed for each compound until a single pose was left for each of the compounds in the optimized dataset. The final optimized set with single conformer for each compound was used to construct the final models. The superposition of the selected docking pose for the 36 active compounds is shown in Figure 3. It was seen that all these compounds aligned well against each other based on their molecular similarity and fit properly into the binding sites considering the structural feature (shape) and the physicochemical properties (pharmacophore) of the binding site. In this approach, the binding site information of the receptor was integrated into molecular alignment of small molecule ligands and was further integrated into 3D QSAR construction. The alignment based on the binding site focuses on the overall molecular fit and pharmacophore overlay, which resulted in a different image of compound superposition comparing to that derived by atom fitting as shown in Figure 3, in which certain common molecular fragments used for alignment were over emphasized in fitting. Studies have showed that the docking alignment based on this approach resulted in better results than atom fit alignment method.[42]

Figure 3
The cluster of all compounds docked in the binding site of cathepsin B. The inhibitors are represented as stick mode in different colors. The original DPN ligand is shown in stick-ball mode with grey color. The proteins of the dimer are depicted as ribbon-cartoon ...

The statistical analyses for QSAR model construction were carried out based on the clusters of molecules selected as described earlier. An optimum numbers of components were first determined separately by cross-validation regression for both the CoMFA and CoMSIA models, which were then applied in the final analysis in the subsequent study. Both the CoMFA and CoMSIA models demonstrated good statistical results as shown in Table 2 (Whole set). The regression coefficient (r2) and the cross-validated coefficient (q2) of the QSAR model constructed by CoMFA are 0.999 and 0.605, respectively. The corresponding coefficients for the CoMSIA model are 0.997 and 0.605, respectively. There is no apparent difference in the quality of the two models, although results from the CoMSIA model show slightly higher cross-validated correlation coefficients than those from the CoMFA model, which indicates moderately enhanced predictability of the CoMSIA model. The standard errors of estimate for the two models are comparable as 0.012 and 0.054, respectively. In the CoMFA model, the steric descriptors make marginally larger contribution to the model than the electrostatic descriptors (0.568 vs. 0.432). In the CoMSIA model, the hydrophobic descriptors are the largest contributor and the electrostatic descriptors are the second followed by hydrogen acceptor and steric descriptors. No apparent correlation was noticed between these descriptors and experimental activity (logIC50), so hydrogen donor descriptors were not included in the model construction. The result that the hydrophobic descriptor is the largest contributor for the CoMFA QSAR models is consistent with the observation that more than half of the binding site area is occupied by hydrophobic groups (will be discussed in following section as shown in Figure 8).

Figure 8
The surface properties of the binding site represented by colors: Magenta – hydrophilic, green – hydrophobic, red – exposed. The proteins are depicted with cartoon mode, the primary protein of the dimer on left is colored in red ...
Table 2
Summary of statistics and field contributions calculated using different methods for the best CoMFA mode.

The predicted activities and residues (the difference between the predicted and the observed activities) by different statistical methods are summarized in the Table 3 and and4.4. Four statistical methods, fitting regression (non-validated), LOO validation, cross-validation, and Bootstrapping validation were applied based on both CoMFA and CoMSIA models.

Table 3
The predicted bioactivity by CoMFA models vs. the observed bioactivity.
Table 4
The predicted bioactivity by CoMSIA models vs. the observed bioactivity.

The results for the CoMFA models (Table 3) show that the calculated descriptors were fitted precisely with the experimental activity (logIC50) with the residues between the experimental data and the predicted value less than 0.05 for all compounds (Figure 5). Most compounds (39/42, 93%) had residue values less than 0.03. The prediction errors of the cross-validation method are slightly higher than those of the fitting regression method. Except three outliers, all other 39 compounds have predicted errors less than 1.0 in the cross-validated results. The Bootstrapping validation produced similar results as the fitting regression. The overall low prediction error rates from the four methods indicate that the models are robust and reliable with strong predictability when applied to this set of compounds. The scattered plots (Figure 4A) of the predicted activities vs. the observed activities (logIC50) based on the CoMFA models show that the fitting regression produced nearly perfect results as the data dots for the cross-validated results are somewhat scatted away from linear trend line (perfect case).

Figure 4
The plot of predicted vs. observed bioactivity (logIC50) for compounds predicted from the models. A: Predicted activities by CoMFA modes. B: Predicted activities by CoMSIA models. C: Predicted activities using Bootstrapping methods based on CoMFA and ...
Figure 5
The plot of residues of the predicted activity compared to observed activity (logIC50) for compounds based on the models. The blue bars are the results of CoMFA results and the red bars are results of CoMSIA modes. A: regression (fitting) mode. B: cross-validated. ...

The predicted results for the CoMSIA models are listed in Table 4. The calculated CoMSIA descriptors were fitted well with the observed activity (logIC50) in the fitting regression model. The overall prediction errors for all compounds are less than 0.2 (Figure 5). Except for two compounds, the prediction errors for all other compounds are less than 0.1. The Bootstrapping validation also produced competent results with error less than 0.2 for all compounds, while the cross-validated results produced errors less than half (0.5) unit for 30 (71%) compounds. The plots of the predicted vs. observed activities for the CoMFA and CoMSIA models are depicted in Figure 4A and B, which show that the predicted results of the CoMSIA models are similar to those of the CoMFA models. The predictions of the Bootstrapping validation for the respective CoMFA and CoMSIA models are compared and described in Figure 4C. It is seen that the Bootstrapping regression produced a very similar results as conventional regression that yielded very accurate predicted activity compared with experimental results in the work. All these validations demonstrated that both the CoMFA and CoMSIA model have strong predictability for the experimental activity of these 36 compounds.

To further evaluate and validate the models, a training-test set approach was selected to construct QSAR models. The data set was randomly split into training set and test set. The list of compounds in the training set (28 compounds) and the test set (8 compounds) is marked in Table 1. Regression model were constructed using the training set, which was then applied to activity prediction for the compounds contained in the test set. The statistical results for the training set are listed in Table 2, which agree well with the corresponding results of the whole set for both CoMFA and CoMSIA models, which indicates that the training set has very similar statistical properties to the original whole set of compounds.

The predicted activity (logIC50) for the compounds in training set and test set for the CoMFA models are listed in Table 5. It is seen that the calculated data for the training set fits very well to the experimental results with errors less than 0.05 for all compounds. The model constructed from training set of compounds also provide accurate predictions for the biological activities of the test set with the errors between the predicted values and the experimental data less than 1.0 for all of the 8 compounds in that set and less than 0.5 for half of the compounds.

Table 5
The predicted bioactivity of 8 randomly selected compounds in test set based on CoMFA model that was constructed with the rest 34 compounds in training set.

Similar cross-validation analysis was conducted for the CoMSIA models, which again demonstrates strong prediction power of these models. The calculated data (Table 6) for compounds in training set fit well with experimental results with errors less than 0.1 for all compounds and the prediction errors for the all but one tested compounds are less than 1.0.

Table 6
The predicted bioactivity of 8 randomly selected compounds in test set based on CoMSIA model that was constructed with the rest 34 compounds in training set.

All results from the three validation experiments and the training-test experiments suggest that the CoMFA and CoMSIA models provide a reliable computational approach for the study of binding activities of the small molecules as cathepsin B inhibitors. The consistent performances and suggestions of these different validation experiments prove the robustness and the prediction power of the QSAR models, thus allow one to utilize the information gained from the models and look into to the interaction mechanisms between the identified inhibitors and their protein receptor.

Analysis of ligand receptor interaction

Based on the docked structures (Figure 1, ,2,2, and and3),3), apparently several amino acid residues including G27, C29, Y75, G198, H199, and W211 interact directly with the ligands. These interactions (polar and non-polar) between these residues and ligand presumably play important roles in the ligand binding affinity and binding mode recognitions. To further explore the hypothetical interaction feature of a ligand with its receptor, the contributions and distributions of the steric, electrostatic, and hydrophobic factors are analyzed. Steric and electrostatic contour maps of the CoMFA models are shown in Figure 6, and the steric, electrostatic, and hydrophobic contour maps of the CoMSIA models are shown in Figure 7A–7C. Compound 26 (CID:3241895, logIC50 = −6.362) is shown along with the contour map to facilitate the analysis. Considering the steric contours of the CoMFA (Figure 6A) and CoMSIA (Figure 7A) models first, it is seen that two models gave a very similar predictions for the steric interactions. Both models suggest that the areas around thiofuran ring were disfavored regions (yellow). In addition, the areas on phenyl ring facing out of paper were also predicted as disfavored regions (yellow) by both models, which indicate that addition of bulky groups in these regions would not favor activity increase. The areas on the other side of the phenyl ring were predicted as favored regions (green), however, where an addition of bulky groups may increase the compound’s activity. By checking the docked structure (Figure 8) of the compound in cathepsin B protein, it was noted that the favored region predicted from QSAR around the phenyl ring has space for extra substituent to fit in. On the other hand, the disfavored regions on the other side of the phenyl ring are already occupied by amino acid residues of the receptor, and there is no room for extra chemical group. Thus the suggestions from QSAR analysis of these two regions are consistent with those based on the docked structure. On the other hand, while the area around the thiofuran ring was predicted as disfavored regions, the docked structure shows open space in the surrounding area.

Figure 6
The contour maps of CoMFA results. A panel: the steric map, the favored regions of binding are colored in green and the disfavored regions are colored in yellow. B panel: the electrostatic map, the favored regions are colored in blue and the disfavored ...
Figure 7
The contour maps of CoMSIA results. A panel: the steric map, the favored regions are colored in green and the disfavored regions are colored in yellow. B panel: the electrostatic map, the favored regions are colored in blue and the disfavored regions ...

When checking the electrostatic maps of the CoMFA (Figure 6B) and CoMSIA models (Figure 7B), it is noticed that both models provided similar predictions for two favored regions- one large area located around thiofuran ring, and the other area close to the sulfone group and phenyl ring. The CoMFA model predicted a disfavored area near the sulfone group, which has a small portion of overlap with the favored areas predicted by CoMSIA model. It is reasonable to have disfavored areas around phenyl ring, while the prediction of disfavored areas around sulfone group seems contradictory to common understanding that the interaction of the polar group (sulfone) with polar amino acids increases ligand binding affinity. The hydrophobic property map (Figure 7C) from the CoMSIA model suggested two favored regions (cyan)-one area around the oxygen atom of ester and the other area located on one side of the phenyl ring nearby the triazole ring. Four disfavored regions (magenta) were identified as well by this analysis, which include areas located on the thiofuran ring, triazole ring, and the two sides of phenyl ring. Comparing the protein surface properties at the binding site (Figure 8), the locations of three predicted hydrophobic-favored areas match the hydrophobic areas on the protein surface. Also, some hydrophilic areas are found on the protein surface map close to the corresponding locations of the disfavored hydrophobic areas on contour map of the CoMSIA model. Generally speaking, suggestions from the contour map agree with the general feature of the protein surface properties around the protein binding site. All these predictions demonstrated that both the CoMFA and CoMSIA model provide valuable insights into the factors critical for determining the ligand binding and binding affinity against cathepsin B.

In the docking simulation, shape fitting and electrostatic matching are the two most important forces to determine a preferable binding structure of a ligand. If studied ligands have similar electrostatic patterns and are able to fit into the binding site, they will bind in a similar position with the same orientation. If one ligand has an opposite electrostatic pattern comparing to the other ligands, it could ideally be docked in with an orientation which is turned 180 degrees. However, a ligand in such a case may be aligned with the same orientation as the template ligand with the template based atom fitting method. The alignment from docking simulation, however, seems to have an edge over the template-based atom-fitting alignment and produced nearly ideal results for such a situation. However, the effect of such ligands on the entire 3D QSAR model needs to be checked manually in case they may need to be treated as outliers.

To build CoMFA-like 3D QSAR, determination of “active” conformation and alignment are the two most important elements to determine the quality and success of a QSAR model. When applied to determining the conformation and alignment, docking simulations normally produces multi-conformations for a compound. A preferable conformation for a compound could be determined based on the knowledge of known bound structures, docking score, etc. in some case. But in other cases, it may be difficult to select a preferable conformation. In general, how to select the docked structure for a compound is tricky, challenging, and difficult, it sometimes means whether a successful model can be constructed. In this work, we introduced multi-conformation QSAR approach that utilizes multi-poses from docking simulation in QSAR model construction. The current approach allows gradual and step by step elimination of docking poses during the QSAR building process. It combines receptor-based docking simulation with ligand-based 3D QSAR (CoMFA and CoMSIA) by introducing the binding site properties into the ligand-based models, thus enhancing the prediction power of conventional QSAR models. The results of the statistical models demonstrated the power of using the current approach to determine conformation and alignment. These results provided insights into crucial structure features affecting ligand-receptor interactions and their binding affinities, thus demonstrating the advantage of the proposed approach for 3D QSAR modeling in drug design.


By combining docking simulations with CoMFA and CoMSIA analysis, the binding structures and quantitative structure-activity relationship (QSAR) of cathepsin B inhibitors have been studied. Robust 3D QSAR models were constructed with cross-validated correlation coefficients (q2) and regression correlation coefficients (r2) of 0.605 and 0.999 for CoMFA models, and 0.605 and 0.997 for CoMSIA models, respectively. Various methods were used to validate these studies, including LOO, cross-validation, Bootstrapping, and training-test set methods. In the last experiment, the activity of the tested compounds predicted by the models built from the training compounds was shown to be well in line with their experimental activity with a maximum error of 1 for CoMFA and CoMSIA models. All validations show that the models exhibited strong prediction for evaluating the inhibitory activity (logIC50) of the compounds as cathepsin B inhibitors. The strong correlation between the calculated activity and the experimental activity demonstrated the robustness of the models, and the feasibility and advantage of the computational approach in this study.

The contour maps predicted from CoMFA and CoMSIA analyses were discussed and compared with the surface property maps of the protein binding site. The areas on phenyl ring facing out of paper (Figure 6A and and7A)7A) are disfavored regions for bulky groups, while the areas on the other side of the phenyl ring are favored regions for bulky groups. It is also predicted that the area located around thiofuran ring and the area close to the sulfone group are favored electrostatic regions. The two analyses provided similar suggestions for favored and disfavored regions around the ligands when considering steric factors. These suggestions generally match the amino acids allocation and distribution around docked ligand at the binding site. The major features of the electrostatic maps from the two analyses are also consistent. The hydrophobic map for favored and unfavored areas derived from the CoMSIA analysis roughly matches the hydrophobic and hydrophilic areas of the protein surface of the binding site. Such analysis based on the CoMFA and CoMSIA results may provide valuable information for understanding the interactions between small molecules and cathepsin B, and thus can provide guidance for modifying the identified small molecules, or in designing novel ones for optimized potency and target specificity.

One remaining challenge for the traditional 3D QSAR study is binding conformation determination. When experimental structure data is unavailable, docking simulation is often chosen as an alternative approach to obtain “active” conformation for studied compounds. However, many studies have shown that selecting a conformation for a compound from multi-docked poses is tricky and difficult, and is often the leading cause of failure to achieve good QSAR models. In this work, we introduced a multi-conformation QSAR approach in an attempt to tackle this challenge. Multi-docked poses were introduced into statistical analysis for QSAR model construction and the final conformation for a compound is identified by an iterative procedure that optimizes the conformation selection for the best correlation. The final model is constructed based on the optimized conformer data set. This approach for conformation selection contributed substantially to the successful construction of the suggested models, which may significantly increase the chance to derive a model with satisfactory results in general.

In a traditional QSAR model study, no receptor information is included, therefore suggestions from such model cannot guarantee to agree with the features of receptor binding site, which eventually determine the possibility and strength of a ligand binding and its biological consequence (activity). This work utilized receptor structure-based docking simulation to determine the “active” conformations of each ligand compound and the compound alignments. The success of the QSAR models demonstrated that the approach, by combining both receptor and ligand features, is superior to the conventional template atom-fitting QSAR approach. The models built in this study provide detailed and deep insights for understanding the individual physical chemical elements affecting ligand-receptor interactions and provide valuable suggestions for novel inhibitor design and further chemical optimization. This would help to generate a library of cathepsin inhibitors and identify lead compounds with improved inhibition potency, target selectivity and specificity. This work will doubtlessly help the chemical optimization for this set of compounds, as well as facilitate the design and development of novo cathepsin B inhibitors as drugs to cure human diseases where the members of the papain superfamilies are known to play important roles.


This research was supported by the Intramural Research Program of the NIH, NLM. The authors thank NIH Fells Editorial Board (FEB) for assistance on the manuscript reviewing and the developers of Pymol software for sharing the program to prepare the molecular figures used in the paper.


Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.


1. Barrett AJ, Kirschke H. Cathepsin B, Cathepsin H, and cathepsin L. Methods Enzymol. 1981;80(Pt C):535–61. [PubMed]
2. Chapman HA, Jr, Munger JS, Shi GP. The role of thiol proteases in tissue injury remodeling. Am J Respir Crit Care Med. 1994;150:S155–9. [PubMed]
3. Rooprai HK, McCormick D. Proteases and their inhibitors in human brain tumours: a review. Anticancer Res. 1997;17:4151–62. [PubMed]
4. Roberts LR, Adjei PN, Gores GJ. Cathepsins as effector proteases in hepatocyte apoptosis. Cell Biochem Biophys. 1999;30:71–88. [PubMed]
5. Giusti I, D’Ascenzo S, Millimaggi D, Taraboletti G, Carta G, Franceschini N, et al. Cathepsin B mediates the pH-dependent proinvasive activity of tumor-shed microvesicles. Neoplasia. 2008;10:481–8. [PMC free article] [PubMed]
6. Gounaris E, Tung CH, Restaino C, Maehr R, Kohler R, Joyce JA, et al. Live imaging of cysteine-cathepsin activity reveals dynamics of focal inflammation, angiogenesis, and polyp growth. PLoS ONE. 2008;3:e2916. [PMC free article] [PubMed]
7. Ha SD, Martins A, Khazaie K, Han J, Chan BM, Kim SO. Cathepsin B is involved in the trafficking of TNF-alpha-containing vesicles to the plasma membrane in macrophages. J Immunol. 2008;181:690–7. [PubMed]
8. Haque A, Banik NL, Ray SK. New insights into the roles of endolysosomal cathepsins in the pathogenesis of Alzheimer’s disease: cathepsin inhibitors as potential therapeutics. CNS Neurol Disord Drug Targets. 2008;7:270–7. [PubMed]
9. Sandes E, Lodillinsky C, Cwirenbaum R, Arguelles C, Casabe A, Eijan AM. Cathepsin B is involved in the apoptosis intrinsic pathway induced by Bacillus Calmette-Guerin in transitional cancer cell lines. Int J Mol Med. 2007;20:823–8. [PubMed]
10. Lutgens SP, Cleutjens KB, Daemen MJ, Heeneman S. Cathepsin cysteine proteases in cardiovascular disease. Faseb J. 2007;21:3029–41. [PubMed]
11. Hook V, Kindy M, Hook G. Cysteine protease inhibitors effectively reduce in vivo levels of brain beta-amyloid related to Alzheimer’s disease. Biol Chem. 2007;388:247–52. [PubMed]
12. Downs LS, Jr, Lima PH, Bliss RL, Blomquist CH. Cathepsins B and D activity and activity ratios in normal ovaries, benign ovarian neoplasms, and epithelial ovarian cancer. J Soc Gynecol Investig. 2005;12:539–44. [PubMed]
13. Vasiljeva O, Korovin M, Gajda M, Brodoefel H, Bojic L, Kruger A, et al. Reduced tumour cell proliferation and delayed development of high-grade mammary carcinomas in cathepsin B-deficient mice. Oncogene. 2008;27:4191–9. [PubMed]
14. Vasiljeva O, Papazoglou A, Kruger A, Brodoefel H, Korovin M, Deussing J, et al. Tumor cell-derived and macrophage-derived cathepsin B promotes progression and lung metastasis of mammary cancer. Cancer Res. 2006;66:5242–50. [PubMed]
15. Jane DT, Morvay L, Dasilva L, Cavallo-Medved D, Sloane BF, Dufresne MJ. Cathepsin B localizes to plasma membrane caveolae of differentiating myoblasts and is secreted in an active form at physiological pH. Biol Chem. 2006;387:223–34. [PubMed]
16. Akache B, Grimm D, Shen X, Fuess S, Yant SR, Glazer DS, et al. A two-hybrid screen identifies cathepsins B and L as uncoating factors for adeno-associated virus 2 and 8. Mol Ther. 2007;15:330–9. [PubMed]
17. Chandran K, Sullivan NJ, Felbor U, Whelan SP, Cunningham JM. Endosomal proteolysis of the ebola virus glycoprotein is necessary for infection science. Science. 2005;308:1643–5. [PubMed]
18. Simmons G, Gosalia DN, Rennekamp AJ, Reeves JD, Diamond SL, Bates P. Inhibitors of cathepsin L prevent severe acute respiratory syndrome coronavirus entry. Proc Natl Acad Sci USA. 2005;102:11876–81. [PubMed]
19. Hook VY, Kindy M, Hook G. Inhibitors of cathepsin B improve memory and reduce beta-amyloid in transgenic Alzheimer disease mice expressing the wild-type, but not the Swedish mutant, beta-secretase site of the amyloid precursor protein. J Biol Chem. 2008;283:7745–53. [PubMed]
20. Hosokawa M, Kashiwaya K, Eguchi H, Ohigashi H, Ishikawa O, Furihata M, et al. Over-expression of cysteine proteinase inhibitor cystatin 6 promotes pancreatic cancer growth. Cancer Sci. 2008;99:1626–32. [PubMed]
21. Parker BS, Ciocca DR, Bidwell BN, Gago FE, Fanelli MA, George J, et al. Primary tumour expression of the cysteine cathepsin inhibitor Stefin A inhibits distant metastasis in breast cancer. J Pathol. 2008;214:337–46. [PubMed]
22. Yamamoto A, Tomoo K, Hara T, Murata M, Kitamura K, Ishida T. Substrate specificity of bovine cathepsin B and its inhibition by CA074, based on crystal structure refinement of the complex. J Biochem. 2000;127:635–43. [PubMed]
23. Greenspan PD, Clark KL, Tommasi RA, Cowen SD, McQuire LW, Farley DL, et al. Identification of dipeptidyl nitriles as potent and selective inhibitors of cathepsin B through structure-based drug design. J Med Chem. 2001;44:4524–34. [PubMed]
24. Otto HH, Schirmeister T. Cysteine Proteases and Their Inhibitors. 1997;97:133–72. [PubMed]
25. Mladenovic M, Ansorg K, Fink RF, Thiel W, Schirmeister T, Engels B. Atomistic Insights into the Inhibition of Cysteine Proteases: First QM/MM Calculations Clarifying the Stereoselectivity of Epoxide-Based Inhibitors. J Phys Chem B. 2008;112:11798–808. [PubMed]
26. Redzynia I, Ljunggren A, Abrahamson M, Mort JS, Krupa JC, Jaskolski M, et al. Displacement of the occluding loop by the parasite protein, chagasin, results in efficient inhibition of human cathepsin B. J Biol Chem. 2008;283:22815–25. [PubMed]
27. Watanabe D, Yamamoto A, Tomoo K, Matsumoto K, Murata M, Kitamura K, et al. Quantitative evaluation of each catalytic subsite of cathepsin B for inhibitory activity based on inhibitory activity-binding mode relationship of epoxysuccinyl inhibitors by X-ray crystal structure analyses of complexes. J Mol Biol. 2006;362:979–93. [PubMed]
28. Markt P, McGoohan C, Walker B, Kirchmair J, Feldmann C, Martino GD, et al. Discovery of Novel Cathepsin S Inhibitors by Pharmacophore-Based Virtual High-Throughput Screening. J Chem Inf Model. 2008;48:1693–705. [PubMed]
29. Beavers MP, Myers MC, Shah PP, Purvis JE, Diamond SL, Cooperman BS, et al. Molecular docking of cathepsin L inhibitors in the binding site of papain. J Chem Inf Model. 2008;48:1464–72. [PMC free article] [PubMed]
30. Zhou Z, Wang Y, Bryant SH. Computational Analysis of the Cathepsin B Inhibitors Activities Through LR-MMPBSA Binding Affinity Calculation Based on Docked Complex. Journal of Computational Chemistry. 2009;30:2165–75. [PMC free article] [PubMed]
31. Zhou Z, Wang Y, Bryant SH. QSAR models for predicting cathepsin B inhibition by small molecules—Continuous and binary QSAR models to classify cathepsin B inhibition activities of small molecules. Journal of Molecular Graphics and Modelling. 2010;28:4. [PMC free article] [PubMed]
32. Hopfinger AJ, Wang S, Tokarski JS, Jin B, Albuquerque M, Madhav PJ, et al. Construction of 3D-QSAR models using the 4D-QSAR analysis formalism. J Am Chem Soc. 1997;119:10509–24.
33. Myers MC, Napper AD, Motlekar N, Shah PP, Chiu CH, Beavers MP, et al. Identification and characterization of 3-substituted pyrazolyl esters as alternate substrates for cathepsin B: the confounding effects of DTT and cysteine in biological assays. Bioorg Med Chem Lett. 2007;17:4761–6. [PMC free article] [PubMed]
34. Jorgensen WL, Tirado-Rives J. The OPLS Potential function for proteins Energy minimizations for crystals of cyclic peptides and crambin. J Am Chem Soc. 1988;110:1657–66.
35. Damm W, Frontera A, Tirado-Rives J, Jorgensen WL. OPLS all-atom force field for carbohydrates. Journal of Computational Chemistry. 1997;18:1955–70.
36. Rizzo RC, Jorgensen WL. OPLS All-Atom Model for Amines: Resolution of the Amine Hydration Problem. Journal of the American Chemical Society. 1999;121:4827–36.
37. Jorgensen WL, Maxwell DS, Tirado-Rives J. Development and Testing of the OPLS All-Atom Force Field on Conformational Energetics and Properties of Organic Liquids. Journal of the American Chemical Society. 1996;118:11225–36.
38. Friesner RA, Banks JL, Murphy RB, Halgren TA, Klicic JJ, Mainz DT, et al. Glide: A new approach for rapid, accurate docking and scoring. 1. method and assessment of docking accuracy. Journal of Medicinal Chemistry. 2004;47:1739–49. [PubMed]
39. Halgren TA, Murphy RB, Friesner RA, Beard HS, Frye LL, Pollard WT, et al. Glide: A new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening. Journal of Medicinal Chemistry. 2004;47:1750–9. [PubMed]
40. Zhou Z, Bates M, Madura JD. Structure modeling, ligand binding, and binding affinity calculation (LR-MM-PBSA) of human heparanase for inhibition and drug design. Proteins: Structure, Function, and Bioinformatics. 2006;65:580–92. [PubMed]
41. Zhou Z, Khaliq M, Suk JE, Patkar C, Li L, Kuhn RJ, et al. Antiviral Compounds Discovered by Virtual Screening of Small Molecule Libraries against Dengue Virus E Protein. ACS Chemical Biology. 2008;3:765–75. [PMC free article] [PubMed]
42. Zhou Z, Madura JD. CoMFA 3D-QSAR Analysis of HIV-1 RT Nonnucleoside Inhibitors, TIBO Derivatives Based on Docking Conformation and Alignment. Journal of Chemical Information and Computer Sciences. 2004;44:2167–78. [PubMed]