|Home | About | Journals | Submit | Contact Us | Français|
Geranylgeranylation is critical to the function of several proteins including Rho, Rap1, Rac, Cdc42, and G-protein gamma subunits. Geranylgeranyltransferase type I (GGTase-I) inhibitors (GGTIs) have therapeutic potential to treat inflammation, multiple sclerosis, atherosclerosis, and many other diseases. Following our standard QSAR modeling workflow, we have developed and rigorously validated Quantitative Structure Activity Relationship (QSAR) models for 48 GGTIs using variable selection k nearest neighbor (kNN), automated lazy learning (ALL), and partial least square (PLS) methods. The QSAR models were employed for virtual screening of 9.5 million commercially available chemicals yielding 47 diverse computational hits. Seven of these compounds with novel scaffolds and high predicted GGTase-I inhibitory activities were tested in vitro, and all were found to be bona fide and selective micromolar inhibitors. Notably, these novel hits could not be identified using traditional similarity search. These data demonstrate that rigorously developed QSAR models can serve as reliable virtual screening tools.
The proper functioning of proteins often relies on post-translational modification of the polypeptide leading to changes in chemical characteristics. Found at the extreme carboxyl terminus of the protein, one post-translational “program” utilized for over 140 proteins is the so called ‘CaaX box’, where ‘C’ is a cysteine, ‘aa’ is any aliphatic dipeptide, and X is the terminal residue that directs which of two prenyl groups is added1,2. The protein prenylation cascade begins with the addition of either a 15-carbon isoprene farnesyl lipid when X residues are Ser, Met, Gln, Cys, and Ala; or a 20-carbon geranylgeranyl lipid is added when the X residue is Leu3. The CaaX prenyltranferases include protein farnesyltransferase (FTase) that adds the 15-carbon farnesyl group to proteins like Ras GTPases, nuclear lamins, several protein kinases and phosphatases, as well as other regulatory proteins4. Protein geranylgeranyltransferase type I (GGTase-I) transfers the 20-carbon geranylgeranyl group to proteins including critical signaling molecules from many classes, e.g., the Ras superfamily (including K-Ras, Rho, Rap, Cdc42 and Rac), several G-protein gamma subunits, protein kinases (rhodopsin kinase, phosphorylase kinase, and GRK7), and protein phosphatases5,4.
CaaX protein lipidation is obligate for the protein to be further modified by a protease termed Rce1, which removes the three terminal ‘aaX’ residues. The resulting isoprenylcysteine carboxylic acid is then methylated by isoprenylcysteine carboxymethyltranferase (Icmt) to create a protein terminus with a now mature (and very hydrophobic) isoprenylcysteine carboxymethylester6. Protein prenylation is important in the localization, interactions, and activity of modified proteins. Many of the prenylated proteins are found at the cytoplasmic face of cell membranes, where cell signaling is concentrated. Additionally, protein prenylation is required for cellular transformation by oncogenic Ras, providing the initial evidence that prenylation-dependent localization of proteins is critical in the Ras function7.
The first prenyltranferase inhibitors were farnesyltransferase inhibitors (FTIs), that were rapidly developed from early CaaX peptide mimics8 into the small organic ligands. The first peptidomimetic protein prenyltransferase inhibitors were mixed inhibitors, but highly selective inhibitors were rapidly developed. Using the example of one of the canonical oncogenes H-Ras, rational application of FTIs have shown efficacy in leukemias, gliomas, and breast cancers, providing impetus for targeting GGTase-I in cancers driven by geranylgeranylated oncogenes9;10. Moreover, some Ras-dependent tumors are resistant to FTIs. This departure from prediction is likely due to so-called cross-prenylation by GGTase-I. During FTIs treatment some proteins, most notably K-Ras, that are typically farnesylated by FTase, are found geranylgeranylated, which restores at least a portion of the activity11. Dual FTase/GGTase inhibitors have received little attention and this type of treatment would impact a large number of proteins which make result interpretations complicated.
Several GGTIs have been developed that inhibit C20 lipid modification of GGTase-I substrates. GGTIs have been primarily developed for use as cancer therapeutics, particularly in cancers that have high levels, or activating mutations of geranylgeranylated proteins3,5. GGTIs are now receiving broad interest for clinical use. Besides the continuing development as anti-cancer agents, GGTIs’ are now postulated to have a potential in treating a wide array of other diseases including inflammation, multiple sclerosis, atherosclerosis, viral infection (HepC/HIV), apoptosis, angiogenesis, rheumatoid arthritis, psoriasis, glaucoma, and diabetic retinopathy1,12. In addition, GGTase function is prerequisite in the normal functioning of many parasites and fungi, which has led to discovery programs to develop and use non-human selective GGTIs as antifungals and antiparasitics13;14.
A wide variety of GGTIs have been reported in various publications in the relatively short time (~12 years) when the enzyme has been studied. Many of these have been designed rationally based on the substrates of GGTase-I: geranylgeranyl diphosphate (GGpp) or the CaaX peptide. There are also a number of natural compounds that were identified in a screen for inhibition of GGTase-I from Candida sp. A comprehensive review of known GGTIs was published recently12. Unfortunately, many of the known GGTIs’ binding mode(s) were never characterized and IC50 data for the same compounds are often in disagreement when measured by different laboratories. These observations make the large portion of GGTIs less than optimal for QSAR model building. However, there are two known scaffolds that have been well characterized with respect to their binding to the peptide pocket and using similar estimates for IC50 values. These include a number of CaaL peptidomimetics including aminobenzoic acid derivatives such as GGTI-298 and GGTI-215415,16 and benzoyleneurea-based compounds17. More recently three newer classes of GGTIs have been published including one based on a piperazin-2-one backbone18, dihydropyrole/tetrahydropyridine based small molecules19, and allenoate compounds20.
At the molecular level the principal effect of GGTIs is to block interactions (either with the membrane, or decreased interaction with protein binding partners), leading to mislocalization of signaling molecules. There are ~70 protein targets for GGTase-I, however their susceptibility is not equal. Many GGTase targets are rapidly inhibited and since the modification is post-translational this suggests high turnover rates of the modified target21. However, some geranylgeranylated proteins (Gγ subunits in particular) appear to have very long half-lives making them resistant to GGTIs22. At the cellular level inhibition of GGTase-I leads to cell cycle arrest at G0/G1 at low dose23,24 and complete blockade typically leads to apoptosis in both normal25,26 and transformed cell lines27.
While there are several known chemical scaffolds for selective inhibition of GGTase-I, the list of potential uses for GGTI’s highlights the need for chemical diversity of inhibitors targeting the enzyme. As an example, the therapeutic targeting of glaucoma might benefit from compound characteristics that are quite different than those for treating multiple sclerosis (MS). For instance, positive characteristics for a potential topical treatment of disorders like glaucoma or psoriasis would entail limited systemic bioavailability with perhaps a short half-life, while a MS drug would need to penetrate the blood-brain barrier and would benefit from an extended half-life. Additionally, there is a major therapeutic potential for creating species-selective GGTIs for use as antipathogens. The potential to manipulate these characteristics benefits more from having the flexibility of multiple scaffolds.
The development of small molecule inhibitors for clinical use is a multi-step process with many potential dead ends. In the preclinical setting drug development typically begins with designating a target protein/enzyme whose inhibition may lead to a desired physiologic response. Generally, the next step is to use carefully designed in vitro assay that allows screening of small molecule libraries. The goal of this screening process is to identify active molecules as defined by the particular activity assay.
Drug discovery and development can take many forms. It is often the case that a primary aim is to increase the affinity of a drug to its target. However, in some situations it eventually becomes clear (and often quite late in the development) that the actual drug scaffold has problems, particularly with bioavailability and metabolism, which cannot be solved though traditional lead optimization. It would be of great advantage to take the knowledge gained from the drug development process to more efficiently train models and search for novel scaffolds. Novel scaffolds are also desirable means of circumventing ADME problems that are often encountered at the later stages of the drug discovery process.
Quantitative Structure Activity Relationship (QSAR) modeling has been used extensively as a major computational tool for rationalizing the experimental data on binding or inhibitory activity of chemical compounds. QSAR is typically performed in two distinct modes, frequently referred to as 2D vs. 3D QSAR. In 2D QSAR, chemical descriptors are calculated from chemical graphs and no information about three-dimensional configuration of molecules is utilized. In 3D QSAR methods such as still popular Comparative Molecular Field Analysis (CoMFA)28, conformational analysis and global 3D structure alignment should take place before descriptors are calculated. 2D QSAR has an inherent advantage of being independent of drug conformation, although it has a disadvantage of being much less robust in terms of model interpretation. However, because of its compound conformation and 3D alignment independence, 2D QSAR affords much higher computational efficiency and degree of automation when models are applied for virtual screening of large external libraries29.
The use of QSAR models for virtual screening has not been viewed historically as its mainstream application; on the contrary, QSAR modeling approach has been typically considered as a lead optimization technology. Nevertheless, our group has been advocating for and advancing the use of validated, externally predictive QSAR models for virtual screening for a number of years starting as early as 200130. We have published several earlier papers demonstrating the possibility of discovering novel bioactive compounds by the means of rigorous QSAR modeling coupled with virtual screening (e.g.,30–34; see for a recent review35). Critical to this approach is an extensive model validation, in which known compounds are divided into groups that are used to build the model and a separate “external” set of known compounds that is used to test if the model is capable of accurately predicting activities of external compounds36. An ensemble of robust and validated models can then be used to virtually screen a chemical database for compounds with potential target receptor activity30;37. The use of QSAR models as virtual screening tools and for that matter, the methodology of QSAR modeling itself remains the area of active investigation, and the choice of methodology, such as the model building algorithms and the types of chemical descriptors, can dramatically influence the success and applicability of the approach38. Our current approach that we term combinatorial QSAR modeling38;39 relies on the concurrent use of several QSAR modeling techniques for data analysis. Our aim is to identify models (or a combination of models) that afford the highest prediction accuracy and therefore could be expected to be successful in identifying novel bioactive molecules by the means of virtual screening.
There were few computational studies on GGTIs, including QSAR, reported in the literature. The only one searchable is done by Polley et al in 200440, using bayesian regularized artificial neural network on a GGTI dataset of 446 compounds. They had one division for training and test sets, thus only one single model was generated. The statistics of the model are optimal, with R 2 of 0.893 for training set, and q 2 of 0.778 for test set. It should be pointed out that there was no cross-validation during model building, and they did not apply models to virtual screening of chemical libraries to identify novel hits.
In this study, we have employed the combinatorial QSAR modeling strategy using three different approaches (described in detail below) to develop rigorous and validated models of 44 GGTIs with two chemical scaffolds. One scaffold (the GGTI-DUx series) was identified through initial random screening with extensive iterative follow-up medicinal chemistry22. The second set (GGTI-x) was initially developed following a rational peptidomimetic approach41. The workflow of our study is shown in Figure 1. The best models were applied to virtual screening of a large collection of ca. 9,500,000 compounds compiled from publicly available chemical databases. These searches resulted in only 47 consensus hits42 (i.e., predicted active by all models), none of which were present in the original dataset or have ever been characterized as GGTIs. Seven of these hits were validated in vitro and all were found to be active at micromolar level. Notably, three compounds incorporated novel scaffolds that were never reported before as potential GGTIs. For comparison, the traditional fingerprint based similarity search using all training set compounds as queries was also employed and the resulting hits were found to have little overlap with 47 QSAR/VS hits. Furthermore, none of experimentally confirmed QSAR/VS hits could be identified by the similarity search. These results support the notion that the combined application of rigorous QSAR modeling and virtual screening could serve as a powerful general modeling approach towards the discovery of novel drug candidates.
The pharmacological data for 48 GGTIs used in this study were generated as part of an iterative drug discovery program that led to GGTI-DU4022. The details of the medicinal chemistry effort that resulted in this compound will be reported separately (J.P. Strachen et al, in preparation). The synthesis work was conducted in Pharmaceutical Product Development, Inc. (PPD, Research Triangle Park). All 48 GGTIs were confirmed to be of greater than 95% purity by the means of LCMS, and the detailed spectra are with the company. The structure of GGTI-DU40 can be discussed in the context of the CaaL peptide framework. There is a free amide group, a spacer domain relating to the dialiphatic motif, and critical sulfur as found in the requisite cysteine residue of GGTase-I’s substrates. In even simpler terms, the structure can be described as a hydrophobic head linked to a hydrophilic tail. Four additional GGTIs included in the data set were peptidomimetics as well including GGTI-28741, GGTI-29743, and GGTI-215444. These four compounds were developed as CaaL peptidomimetics and are reasonably similar to each other but quite dissimilar to the GGTI-DU40 series (cf. Chart S1 of Supporting Information). Chemical structures of all inhibitors used in QSAR modeling and their associated IC50 values are given in Chart S1. The pIC50 values for all compounds ranged from 3.8 to 7.6 with a near Gaussian distribution (cf. Figure 2). Importantly, the combination of data sets including compounds with different chemical scaffolds of the wide distribution of pairwise chemical similarities within the entire dataset (Figure S1 of Supporting Information), which in theory (and as we have established in this study, in practice) should have enabled the identification of chemically diverse virtual hits from virtual screening.
All chemical structures were generated using ACD/ChemSketch software before converting them to SMILES. MolconnZ software version 4.09 (MZ4.09)45 was used to generate the molecular topological index descriptors46. MZ4.09 calculates more than 700 different descriptors, however, many are used for accounting purposes and several more have either zero values or zero variance. Once non-redundant descriptors were removed, a set of 274 chemically relevant descriptors remained. In order to prevent unequal weighting, descriptors were linearly normalized to fall within the range of zero to one based on the minimum and maximum values of each descriptor (i.e., range-scaled)47. The use of range-scaling avoids giving descriptors with significantly broader ranges a disproportional weight upon distance calculations in multidimensional MZ4.09 descriptor space. We then follow our standard protocols to subdivide the whole dataset into multiple training/test set pairs using the Sphere Exclusion method48 implemented in our laboratory. The number of compounds in the test set was varied to achieve the largest possible size of the test set, while ensuring that the training set models were still able to predict the biological activities of the test set compounds accurately.
The k Nearest Neighbor (kNN) QSAR method used in this study employs the kNN pattern recognition principle 49 and variable selection. In short, a subset of variables (descriptors) is selected randomly in the beginning as a Hypothetical Descriptor Pharmacophore (HDP)50. The HDP is validated by LOO-CV, where each compound is eliminated from the training set and its GGTase-I inhibition activity is predicted as the weighted average of the activity\ies of the k most similar molecules (k varies from 1 to 5). The weighted molecular similarity is represented by the modified Euclidean distance between compounds in HDP multidimensional space as shown in Equation 1 and Equation 2. Essentially, the neighbor with the smaller distance from a compound is given a higher weight in calculating the predicted activity:
where di is the Euclidean distance between the compound i and its kth nearest neighbors; wi is the weight for the kth nearest neighbor; yi is the experimentally measured activity value for the kth nearest neighbor; and y˜ is the predicted activity value.
Simulated annealing and Metropolis-like acceptance criteria were used to optimize the selecetion of variables. Details of the kNN method implementation including the description of the simulated annealing procedure used for stochastic sampling of the descriptor space, are given elsewhere47. The statistical significance of the models were estimated by the LOO-CV q2 in the training set, a coefficient of determination R02 (Equation 3) and linear fit R2 values for both internal and external test sets.
Here yk and ỹk are the observed and predicted activities of a compound k, respectively, and is the average activity of all compounds. Model acceptability cutoffs were q2 > 0.60 for training set and correlation coefficient R2 > 0.60 for internal test set51. Models that satisfied both criteria were applied to external validation sets.
We also employed two other methods, i.e. Automated Lazy Learning QSAR (ALL-QSAR) and Partial Least Square (PLS) QSAR, in this study. The ALL-QSAR was developed in our group and is ideal for a large or diverse dataset52. The PLS QSAR is arguably the most traditional and less sophisticated QSAR approach among those explored in this study. The modeling procedures were similar to those described in our previous studies39;52.
Formally, a QSAR model can predict the target property for any compound for which chemical descriptors can be calculated. However, since the training set models are developed in kNN QSAR modeling by interpolating activities of the nearest neighbor compounds, a special applicability domain (i.e., similarity threshold) should be introduced to avoid making predictions for compounds that differ substantially from the training set molecules. In brief, the distribution of distances (pairwise similarities) of compounds in our training set is computed to produce an applicability domain threshold, DT, calculated as follows:
Here, is the average Euclidean distance of the k nearest neighbors of each compound within the training set, σ is the standard deviation of these Euclidean distances, and Z is an arbitrary parameter to control the significance level. Based on previous studies, we set the default value of this parameter as 0.5, which formally places the boundary for the applicability domain at one-half of the standard deviation (assuming a Boltzmann distribution of distances between each compound and its k nearest neighbors in the training set). Thus, if the distance of the external compound from at least one of its nearest neighbors in the training set exceeds this threshold, the prediction is considered unreliable. Additional details can be found in our previous publications36;39.
Y-randomization test is a widely used validation technique to ensure the robustness of a QSAR model53. In this test, the dependent-variable vector, Y-vector, is randomly shuffled and new QSAR models are developed using the original independent-variable matrix. This process is repeated several (typically, 10) times. It is expected that the resulting QSAR models should generally have low LOO q2 and test set R2 values. It is likely that sometimes, though infrequently, high q2 values may be obtained due to a chance correlation or structural redundancy of the training set. If all QSAR models obtained in the Y-randomization test have relatively high R2 and LOO q2, it implies that an acceptable QSAR model cannot be obtained for the given dataset by the current modeling method. Y-randomization test was applied to all QSAR methods considered in this study.
Although we have employed three different QSAR methods for model building, kNN produced the most predictive and robust models (cf. Table 1). It was then selected for primary use in virtual screening. The screening was performed on our Molecular Modeling Laboratory (MML) in-house collection of 9,500,000 compounds, including the ZINC7.0 database of ca. 6,500,000 compounds54, the Maybridge database (2008.03) of ca. 56,000 compounds55, the World Drug Index (WDI) database of ca. 59,000 compounds56, the ASINEX Synergy libraries of ca. 11,000 compounds, the Chemizon Progenitor databases (2006 v1.1) of ca. 3,300 compounds57 and several other commercial databases. None of the compounds found in the training set were present in the mining databases.
As illustrated in the workflow of Figure 1, the rigorously validated QSAR models were employed for virtual screening. A global applicability domain (calculated using all descriptors) was applied first in order to filter out compounds that differed globally in their structure from the modeling set compounds. All 48 known GGTIs were used as probes in the calculations. During the consensus prediction, the results were accepted only when the compound was found within the applicability domains of more than 50% of all models used in consensus prediction and the standard deviation of estimated means across all models was small. Furthermore, we restricted ourselves to the most conservative applicability domain for each model using the cutoff (cf. Equation 4) Z = 0.5.
All the modeling and virtual screening calculations were done at a 352-processor Beowulf Linux cluster of the ITS Research Computing Division of the University of North Carolina at Chapel Hill. The compute nodes are Intel Xeon IBM BladeCenter of Dual Intel Xeon 2.8GHz, with 2.5GB RAM on each node. The cluster runs the Red Hat Enterprise Linux 4.0 (32-bit) and the nodes communicate via a Gigabit Ethernet network. The processing speed of QSAR-based screening is relatively high, ca. 100K compounds per minute. As could be expected, the processing speed was found to scale linearly with the size of the screening library.
The chemical similarity search was conducted with the MOE2006.08 package using the standard protocols. The MACCS structural keys were utilized with the Tanimoto Coefficient (Tc) as the similarity metric. The search was carried out independently for each of the 48 compound of the GGTIs modeling dataset. In the case that the hits from individual searches were the same, a special Scientific Vector Language (SVL) script was employed to remove one of them based on the chemical topology.
The kNN QSAR model building employed 274 MZ4.09 descriptors derived from 48 GGTIs as the independent variables. During the calculations, the conservative value of 0.5 was used for Zcutoff to define the applicability domain (cf. Eq. 4). In total, 6720 models were generated and only 104 models were accepted using the cutoff for both leave-one-out cross-validated q2 values for training sets and predictive R2 for test sets greater than 0.60. As shown in Figure 3a and Table 1, the kNN QSAR method afforded the best models with q2/R2 values as high as 0.82/0.85 for this GGTIs dataset (R02 = 0.83). These results suggest that the intrinsic inhibition activity relationships exist for GGTIs that can be described reasonably well by kNN models using MZ4.09 descriptor sets.
As part of our combinatorial QSAR strategy, PLS and ALL QSAR were employed to analyze the same dataset, using the same descriptors and the same training/test set divisions generated by Sphere Exclusion. These two additional QSAR approaches were expected to increase the chances of successful modeling of GGTIs so that only best models are selected for virtual screening. Multiple predictive models by ALL QSAR method were obtained with the highest R2 of 0.81 for 7 compounds in the test set, as can be seen in Figure 3b and Table 2. Additional model parameters for the same test set were R02 of 0.91 and the RMSE of 0.21. The models produced by the PLS QSAR method were less satisfactory. As shown in Figure 3c and Table 3, all five PLS QSAR models had superior values for cross-validated q2 of training set, ranging from 0.92 to 0.97. However, their predictive R2 for test sets were around 0.20 ~ 0.30, except for model #2 (R2 = 0.87, RMSE = 0.92). Thus, only this latter model could be used for consensus prediction.
To further evaluate the robustness of kNN QSAR models, the whole model building process was repeated but using randomized IC50 values in place of the actual measured IC50 values. Figure S2 of Supporting Information shows the distinctive distribution of all models for actual vs. Y-randomized data in term of q2/R2 values. As can be observed, the q2 ranged from 0.40 to 0.90 for actual models while from −0.30 to 0.80 for randomized ones. It should be pointed out that no models exceeded the 0.60 cutoff for both q2 and R2 when the activity values were randomized. The standard one-tail hypothesis test was conducted to evaluate the statistical significance of QSAR models derived from the actual data set, in comparison to the models from the random data set. The Z score that is calculated from the q2 value is 4.22, much higher than the tabular value of Zc, which corresponds to the level of significance α = 0.01. This suggests that kNN does not have the ability to correlate descriptors to random activities for GGTIs dataset, thus the QSAR models obtained with the real data are robust.
Three QSAR methods were used in this study, including kNN regression QSAR, ALL QSAR, and PLS QSAR. Each method was combined with MZ4.09 descriptor set and applied to the same training/test sets splits of GGTIs dataset making it possible to compare the performances of different algorithms. Overall, all three QSAR methods afforded predictive models that met the statistical thresholds (q2, R2 > 0.60) though the number of acceptable models varied (104 for kNN QSAR, 7 for ALL QSAR, and 1 for PLS QSAR). All of these acceptable models were used for virtual screening of chemical libraries and consensus prediction downstream of our modeling workflow (cf. Figure 1). Among the three, the kNN QSAR method afforded the largest number of acceptable models and the highest statistics with q2/R2 values of 0.82/0.85. The ALL QSAR yielded the best R2 value as high as 0.91 but the number of predictive models was limited. Similarly, the PLS QSAR method generated only one good model (q2/R2 of 0.92/0.87) and the results were dependent on the splits of training/test sets. It should be noted that there was no external validation dataset available in addition to the test sets, because of the small size of GGTIs dataset used in this study and the limited source of literature data. Thus, the external predictive ability of all acceptable models had to be validated by the screening hits in this particular case.
As the first step of our QSAR-based virtual screening, the preliminary filtering of the 9.5 million compounds in our screening library yielded 79 initial hits. This was done by using the global applicability domain of all 48 GGTIs in the modeling set. After consensus predictions by 104 validated kNN models, their predicted activities (pIC50) were found ranging from 4.51 to 5.96. Only 47 hits, including two pairs of stereoisomers, showed high predicted activity (pIC50 > 5.50) as well as high model coverage and were designated as the final hits. Concurrently, ALL and PLS QSAR models were employed to re-evaluate those 79 hits in order to identify the consensus hits among all three methods. In the end, seven compounds were prioritized for experimental validation based on high predicted activity, uniqueness of structure, and availability. The 2D chemical structures of the 47 compounds with predicted high inhibition activities are shown in the Chart S2 of Supporting Information. A large portion of the screening hits contained the pyridine-pyrazole-phenyl (6-5-6) ring structure which is prevalent in the training set. It was expected considering the empirical nature of QSAR modeling and the very conservative applicability domain used in the study.
Using purified recombinant GGTase-I as an enzyme source and GGpp and Ras-CVLL as substrates, seven hit compounds were tested in vitro as a matter of the experimental validation. The selection was based on high predicted activity, availability and structural uniqueness. All tested compounds showed inhibition of GGTase-I with the pIC50 ranging from 3.63 to 5.44 (cf. Figure 4 and Table 4). The comparison between predicted and experimental data is shown in Table 4. Using pIC50 > 4.00 as the threshold to define the actives, it is shown that kNN QSAR predicted correctly most hit compounds as active, except for GGTI-DU.Sig342. The R2 for the prediction is 0.45 in this case. PLS QSAR also identified the same six compounds as actives, but had a large error on GGTI-DU.As142 (absolute error of 2.85). ALL QSAR had the worst performance, however, predicting only 2 of 7 hits to be active. Thus, kNN based predictions were better than other methods in this case.
The unexpected result was to have several predicted actives that did not have this common ring feature in their structure. In fact, seven highly-ranked hits had no apparent relationship to any of the training set molecules. They had furan, triazole, tetrazole, and pyridine cores in their scaffolds while all non-peptidomimetic compounds of the training set were based on a pyrazole core. Therefore, the seven hit compounds without the 6-5–6 rings that were found in most non-peptidic GGTIs appear to be the structurally novel hits. Figure 4(b) list the chemical structures of three representative confirmed hits, GGTI-DU.Sig3, GGTI-DU.As2 and GGTI-DU.En2. The novel scaffolds (highlighted) among the three can be traced back to the general formulas of substructures found in the 47 screening GGTIs hits (cf. Figure S6 of Supporting Information). For example, GGTI-DU.Sig3 contains the novel scaffold defined by Formula IV while the new structural element in GGTI-DU.As2 belongs to Formula II. Again, these four structural formulas cannot be found in any compounds in the GGTIs training set. This observation lends additional support to the hypothesis that QSAR-based virtual screening is capable of ‘scaffold hopping’.
Although these QSAR/VS derived GGTIs hits had GGTase inhibition activity, it is possible that these effects are nonspecific. The ability of the compounds to inhibit GGTase-I in vitro is not the only requirement for their potential as therapeutics. Another major hurdle in the development of GGTIs is their selectivity towards GGTase versus FTase. These two proteins share ~35% sequence identity and have been known to have cross-reacting substrates, particularly K-Ras. In fact, it was the discovery of cross-prenylation in the presence of highly selective FTIs that led to increased interest in the development of selective GGTIs. We therefore tested four representative hit compounds for inhibitory activities against this highly related FTase. Impressively, all these four compounds that inhibited GGTase showed little to no activity in the FTase assay (cf. the examples in Figure S4 of Supporting Information). This indicates that QSAR-based VS hits proved to be target specific.
As expected, many of the 47 QSAR VS hits exhibit high degrees of similarities to the modeling set (cf. Chart S1). It is therefore more interesting to further analyze the 7 confirmed hits which have novel scaffolds. All 7 hits were compared to the GGTIs modeling set using the MACCS structural keys and the result is shown in Table 5. Notably, none of these confirmed hits had Tc > 0.80 when compared to any of the 48 GGTIs. In fact, the similarity of screening hits 89 and 9242 to most similar compounds in the modeling set had Tc < 0.70. Thus, these 7 hits are highly dissimilar to the modeling set as measured by MACCS structural key and the associated Tc metric.
An intriguing question now emerges as to what kind of hits would a MACCS based similarity search find using compounds in the same dataset as probes and how those hits would compare to hits identified with QSAR-based VS. To create a complete picture of the differences between QSAR and fingerprint based VS, we applied MACCS based similarity search to the same virtual screening library of ~9.5 million compounds. The search generated 8,132 hits with Tc > 0.80, 724 hits with Tc > 0.85 and only 22 hits with Tc > 0.90; among those 22 hits there were two pairs of isomers. Notably, there were few overlaps between the hits from QSAR VS and fingerprint based similarity search. Among the 724 hits at Tc = 0.85 (the default similarity cutoff in MOE2006.08 package), only 20 can also be found within the 79 preliminary hits of QSAR based VS. In other words, the remaining 59 QSAR/VS hits were dissimilar to the GGTIs dataset in term of global similarity defined by MACCS structural keys. When the threshold was set as high as of Tc = 0.90, there was only one compound 107 (PubChem CID: 3942219) of the QSAR/VS hits that was found among the 22 hits from the similarity search. The resulting MACCS VS hits, as expected, are highly similar to the GGTIs dataset and can be divided into those that are similar to the GGTI-DUx series of pyrazoles (16 compounds) and those belonging to the GGTI-X series of peptidomimetics (6 compounds). All pyrazole-like MACCS VS hits contain the 6-5–6 ring system, a hydrophobic tail and an amine(s) linker that connects the two. The peptiomimetic MACCS VS hits were visually less similar to the modeling dataset compounds. However, close examination indicates that their entire backbones are in fact highly similar. The primary reason for the confusion is the phenyl rings found at both termini of the MACCS VS hits.
To further validate the MACCS VS hits, five of the peptidomimetic hits were tested for the GGTase assay. Notably, none of these compounds exhibited inhibition activity in the GGTase-I assay (data not shown). These results suggested that the fingerprint-based similarity search was not effective in identifying novel biological active compounds effectively.
In order to correlate the biological activities to the relevant chemical features, variable selection QSAR methods search for the optimal subsets of descriptors using different algorithms. In current studies, both kNN and PLS methods identified the most relevant descriptors and many of them were found to be the same (cf. Figure S3 of Supporting Information). The descriptors were ranked based on their frequencies of use in models included in the consensus QSAR model. Among the frequent descriptors, the binary nHBint10 descriptor indicates the presence of potential internal hydrogen bonds within the structure (see compounds 15 and 48 22). The SssO descriptor is an integer and represents the sum of the electrotopological state indices for oxygen atoms. Its mean value is 0.935 for 25 out of 48 GGTIs in the modeling set. The Ncarboxylicacid is the group based descriptor, which indicates the presents of carboxylic acid functional groups. The functional groups that are encoded by frequent descriptors could be interpreted as GGTIs’ pharmacophoric elements.
Drug discovery paradigms have been changing rapidly due to advances in high-throughput screening technologies, combinatorial chemistry and computer-aided modeling methodologies. Often, drug candidates were abandoned after a single (or groups of similar) compounds had been found to be of use-limiting bioavailabilities or toxicities. The frequent possibility that a target-specific bioactive compound could have undesired ADME/Tox properties implies that chemically diverse hits should be ideally generated in the beginning of the drug discovery cycle. The state-of-the-art QSAR methodologies that rely on variable selection and extensive model validation have become increasingly more powerful in the areas of drug lead identification and optimization. As we have demonstrated in this study, variable selection QSAR modeling followed by virtual screening could be successfully used to enable the discovery of structurally novel hits. The identification of structurally novel GGTIs will bring us closer to the goal of making a selective GGTI that could also have plausible ADME properties. Fingerprint-based similarity search is another example of a technique that is able to find “remotely-similar” compounds58. However, typical implementations of this approach do not use variable selection (unlike many QSAR methods) to make the results more focused towards target-specific biological activity.
Despite a great interest in GGTIs only a limited number of lead scaffolds have emerged from traditional medicinal chemistry approaches. In this study, we have enabled the discovery of GGTIs with novel scaffolds by building robust QSAR models of training set compounds and then using these models for virtual screening of large chemical libraries. As we have shown in this report, using variable selection kNN QSAR method, we were able to generate more than a hundred of statistically robust models for a dataset including GGTIs of two types of scaffold. Alternative methods used in this study, i.e., ALL QSAR and PLS QSAR methods afforded acceptable models (values greater than 0.60 for both q2 and R2) but kNN produced more models with higher prognostic power.
Mining of the 9.5 million compound screening library for GGTIs using validated kNN, ALL, and PLS QSAR models, resulted in 47 hits with moderate to high predicted activity. The 7 compounds chosen for the highest predicted activity and greatest dissimilarity from the training set showed activity towards GGTase, indicating an apparent 100% success rate. None of the models afforded highly accurate quantitative prediction of the activity of experimentally confirmed hits but kNN models correctly predicted the order of activities. Several of these hits were also shown experimentally to be not only active but highly selective towards GGTase I. Thus, 2D-QSAR modeling was proven to be very efficient for enabling virtual screening of millions of compounds in a rapid fashion and selection of only a very small number of computational hits for the experimental validation. Notably, these novel QSAR hits cannot be obtained by traditional fingerprint based similarity search. The latter was conducted as the control but only yielded highly similar hits to the GGTIs dataset.
Most screening hits shared the 6-5-6 ring scaffold found in most of the training set GGTIs. These were expected as the QSAR/VS is designed to find chemically similar entities. However, several compounds lacking this scaffold were predicted to be GGTIs and were confirmed active experimentally. These results demonstrate that QSAR models can serve as reliable virtual screening tools capable of identifying novel biologically active scaffolds. The modeling strategy described in this report can be applied to many chemical biological systems for which experimental biological testing data for a series of chemicals is available.
Farnesyl diphosphate (Fpp) and geranylgeranyl diphosphate (GGpp) were purchased from Biomol, Inc. (Plymouth Meeting, PA). 3H-GGpp and 3H-Fpp were purchased from PerkinElmer (Boston, MA). The FTI L-744–832 was purchased from Sigma (Saint Louis, MO). GGTI-DU40 was synthesized by the Duke Small Molecule Synthesis Facility.
Protein GGTase-I or FTase activities were determined by following the incorporation of radiolabeled isoprenoid from 3H-GGpp or 3H-Fpp into Ras proteins as described previously59. Briefly, purified mammalian GGTase-I or FTase (50 ng, expressed in Sf9 cells)60 were used to initiate reactions containing 0.5 µM GGpp or Fpp, respectively, and 1 µM of the appropriate purified His-tagged Ras substrates (Ras-CVLL for GGTase-I; H-Ras for FTase). Final DMSO concentration was 2% for all samples. Reactions were carried out for 10 min at 30°C before precipitation and product determination. Nonspecific binding was defined by boiled enzyme and was identical to maximal inhibition by GGTI-DU40 for GGTase-I, and the well-characterized FTI L-744-832 for FTase. The data manipulation and curve fitting were performed using Prism (GraphPad, San Diego CA).
We thank Mihir Shah, Alexander Golbraikh, Carolyn Weinbaum, and Missy Infante for technical support. We also thank Rainbo Hultman for critical evaluation of the manuscript. We acknowledge the access to the computing facilities at the ITS Research Computing Division of the University of North Carolina at Chapel Hill. This work was supported in part by National Institutes of Health research grant F32-GM073420 (Y.K.P), GM46372 (P.J.C.), GM066940 (A.T.), the RoadMap Center planning grant P20-HG003898 (A.T.), and the UNC-CH University Research Council Research Grant A3-12988 (X.S.W.).
The heatmap of self-similarity matrix for GGTIs modeling set, distributions of models for Y-randomization tests, experimental data of GGTIs screening hits FTase activities, chemical structures and pIC50 values for GGTIs modeling dataset and screening hits, purity data for target compounds, and others supplementary data indicated in the text. This material is available free of charge via the Internet at http://pubs.acs.org.