|Home | About | Journals | Submit | Contact Us | Français|
4D-QSAR and 3D-pharmacophore models were built and investigated for the cytotoxicity using a training set of 25 lamellarins against human hormone dependent T47D breast cancer cells. Receptor-independent (RI) 4D-QSAR models were first constructed from the exploration of eight possible receptor binding alignments for the entire training set. Since the training set is small (25 compounds), the generality of the 4D-QSAR paradigm was then exploited to devise a strategy to maximize the extraction of binding information from the training set, and to also permit virtual screening of diverse lamellarin chemistry. 4D-QSAR models were sought for only six of the most potent lamellarins of the training set as well as another subset composed of lamellarins with constrained ranges in molecular weight and lipophilicty. This overall modeling strategy has permitted maximizing 3D-pharmacophore information from this small set of structurally complex lamellarins that can be used to drive future analog synthesis and the selection of alternate scaffolds. Overall, it was found that formation of an intermolecular hydrogen bond and hydrophobic interactions for substituents on the E ring most modulate the cytotoxicity against T47D breast cancer cells. Hydrophobic substitutions on the F-ring can also enhance cytotoxic potency. A complementary high throughput virtual screen to the 3D-pharmacophore models, a 4D-fingerprint QSAR model, was constructed using absolute molecular similarity. This 4D-fingerprint virtual high throughput screen permits a larger range of chemistry diversity to be assayed than the 4D-QSAR models. The optimized 4D-QSAR 3D-pharmacophore model has a LOO cross-correlation value of xv-r2 = 0.947, while the optimized 4D-fingerprint virtual screening model has a value of xv-r2 = 0.719. This work reveals that it is possible to develop significant QSAR, 3D-pharmacophore and virtual screening models for a small set of lamellarins showing cytotoxic behavior in breast cancer screens that can guide future drug development based upon lamellarin chemistry.
In 1985 the first four lamellarins, A-D were isolated from the marine prosobranch mollusk, Lamellaria sp. and their structures determined by an X-ray crystallographic and 1H-NMR study.1 A family of more than 30 lamellarins which consist of three structural groups such as an unsaturated D-ring fused (Figure 1(a)), a saturated D-ring fused (Figure 1(b)), and an unfused central pyrrole ring group (Figure 1(c)) have been isolated and investigated in terms of their biological activity profiles.2,3 These compounds, especially the fused central pyrrole ring lamellarins, have been found to be cytotoxic to a wide range of cancer cell lines. Lamellarins C and U (Tables 1 and and2)2) demonstrate potent cytotoxicity against 10 human tumor cell lines (A549, HCT-116, LOX IMVI, MALME-3M, MCF-7, MOLT-4, OVCAR-3, PC-3, SF-295, UO-31)3 lamellarin D shows potent cytotoxic activity against human prostate cancer cells (DU-145, LNCaP) and leukemia cells (K562),3 and lamellarins I, K, and L exhibit significant cytotoxicity against P388 and A549 cultured cancer cell lines.3 Additionally, lamellarin I and D have an effective cytotoxic activity against multidrug resistant reversal (MDR) cell lines by inhibiting P-glycoprotein (P-GP) mediated drug efflux.4,5 Some lamellarins have also been demonstrated to act on cancer cell mitochondria to induce apoptosis.6,7 Moreover, lamellarin D is an effective stabilizer of human topoisomerase I-DNA covalent complexes, and, thus, capable of stimulating DNA cleavage.2,8,9 Based on its biological actions, lamellarin D was identified as a novel lead candidate by Bailly and coworkers.2,3,6–11
The first reported study on structure activity relationship (SAR) of lamellarin was done by Ishibashi et al. in 2002.12 It was reported that the hydroxyl groups at position C-8 and C-20 were important for cytotoxicity against a HeLa cell line, while the hydroxyl group at C-14 and two methoxy groups at C-13 and C-21 were not essential for activity. The C5-C6 double bond in the D-ring, or planarity of the chromophore, is necessary for activity.5,8,12 In more recent findings, Chittchang et al. not only substantiated the significant contributions of the C5-C6 olefin moiety, as well as the hydroxyl groups at C8 and C20, but also demonstrated the importance of the C7-hydroxy group for the first time.13 These findings were also substantiated by carrying out three-dimensional quantitative structure-activity relationship (3D-QSAR) analyses this past year.14
Treatment of the lamellarins data set is representative of a class of real-world problems in drug discovery; namely how to optimize the extraction of SAR information for, in turn, optimizing lead development efforts from a small number of structurally complex, hard to synthesize compounds that have been tested and observed to exhibit a wide-range of endpoint activity. Modeling such small data sets can be criticized on the basis of too little data to generate reliable, and useful, results to drive lead optimization. Yet doing nothing with the information resident in such data sets obviously contributes nothing to streamlining lead development efforts. A key to resolving this dilemma may reside in the type and level of sophistication of the modeling employed. A high-level modeling approach wherein detailed structural, thermodynamic and electronic information about each complex compound of the data set may negate some of the drawbacks to the small size of the data set. In a sense, ‘quality’ is used to compensate for ‘quantity’.
Multiple complementary applications of the 4D-QSAR paradigm15 may be a good way to extend our knowledge and understanding the structure-activity relationships of lamellarins using this ‘quality’ for ‘quantity’ argument. The fourth ‘dimension’ of the 4D-QSAR paradigm is ensemble sampling the spatial features of the members of a training set.15 This sampling process, in turn, allows the construction of optimized dynamic spatial QSAR models, in the form of 3D pharmacophores, which are dependent on conformation, alignment, and pharmacophore-grouping.16 This method has been proven both useful and reliable for the construction of quantitative 3D pharmacophore models, especially for sets of flexible ligand analogues when the geometry of the corresponding receptor is not known.15–17
Complementary to building 4D-QSAR models that embed 3D-pharmacophores is the construction of high-throughput 4D-fingerprint models for virtual screening. The 4D-fingerprints can be derived independent of any molecular alignment, and are based upon an inductive approach to establish 4D molecular similarity measures across any collection of chemical compounds.18 The 4D-QSAR paradigm has been successfully applied for a variety of chemical classes and biological endpoints including glucose analogs, flavonoid analogs, propofol analogs, the AHPBA and THP inhibitors of HIV-1 protease,18 human serum albumin (HSA),19,20 a local lymph node assay (LLNA) data base,21,22 skin penetration enhancers,23 and HIV-1 integrase inhibitors.24
The search for structure-activity relationships and/or pharmacophores for natural, and synthetic, lamellarins screened for cytotoxic activity against 11 cancer cell lines is on-going. However, the prominent activities observed in a few of the lamellarins screened against human hormone-dependent T47D breast cancer cells seemed to us to be best explored, and the corresponding SAR delineated and exploited, by using the 4D-QSAR methodology for the reasons cited above.
Twenty six lamellarins were analyzed in this work. The chemical structures of all 26 lamellarins are given in Table 1 and and2.2. These compounds were synthesized and purified by Ploypradith et al.25 The cytotoxic activity (−logIC50) against human hormone-dependent T47D breast cancer cells have been recently reported and are included as parts of Tables 1 and and22.13
Since the geometry of the receptor is not available in this study, the receptor-independent form of 4D-QSAR analysis, referred to as RI-4D-QSAR, has been employed. The ten operational steps in RI-4D-QSAR have been presented in detail previously,15 and also given in the 4D-QSAR software version 3.0 User Guide.26 Therefore, these 10 steps of RI-4D-QSAR analysis are only summarized here as follows:
An initial 3D structure of each lamellarin was constructed in the neutral form using the HyperChem 7.5 software.27 Partial atomic charges were computed using the semiempirical AM1 method. Each structure was then minimized with no geometric constraint. These energy-minimized structures were used as the initial structures in the conformational ensemble sampling of step 3.
Atoms of each molecule were classified into seven types of interaction pharmacophore elements (IPEs). Each type is represented by different number code from 0 to 6 as defined in Table 3.
Molecular dynamics simulations (MDS) was used to sample the conformational states available to each analogue, and to generate its corresponding conformational ensemble profile (CEP). The MDSs were done using the MOLSIM package28 and MM2 force field.29,30 The temperature for the MDS is set at 300 K with a simulation sampling time of 40 ps with intervals of 0.001 ps for a total sampling of 40000 conformations of each lamellarin compound. The atomic coordinates of each conformation and its intramolecular energy sampled during the MDS were recorded every 0.02 ps for a total of 2000 “frames”, or steps, in constructing the CEP of each compound.
The set of three-ordered atoms in trial alignments are defined in Table 4. In this study eight alignments were explored across the overall lamellarin core structure.
Each conformation of a compound from its CEP was aligned in the grid cell lattice using the invariant coordinates of the three-ordered atom alignment. In this study, the size of the cubic grid cells of the lattice are 1 Å3, and the overall grid cell lattice size was chosen to fully enclose each compound of the training set. The normalized occupancy of each grid cell by each IPE atom type over the CEP for a given alignment forms a unique set of QSAR descriptors referred to as grid cell occupancy descriptors, GCODs. The GCOD descriptors were computed, and used as the trial descriptor pool in 4D-QSAR analysis. Non-GCOD descriptors of the training set compounds can also be included in the trial basis set (descriptor pool). In this particular study the logarithm of the 1-octanol/water partition coefficient (log P) and the compound’s molecular weight (MW) were selectively added to the trial basis set descriptors in some of the model building studies. The log P and MW values the training set compounds are reported in Table 1.
A 4D-QSAR analysis generates an enormous number of trial QSAR descriptors, GCODs, because of the large number of grid cells and the seven IPEs. Partial least squares (PLS) regression analysis31 is used to perform a data reduction analysis between the observed dependent variable measures and the corresponding set of GCOD values.
The most highly weighted PLS GCOD descriptors (currently the top 200), generated in step 6, are used to form the trial descriptor pool for genetic algorithm (GA) model optimization. The specific GA currently used in the 4D-QSAR software is modification of the genetic function approximation (GFA).32 The GFA optimization is initiated using N (currently 300) randomly generated 4D-QSAR models. Mutation probability over the crossover optimization cycle is set at 10%. The smoothing factor, a GFA operations variable, controls the number of independent variables in the QSAR models, is varied in order to determine the optimal number of descriptors for the 4D-QSAR models. The diagnostic measures used to analyze the resultant 4D-QSAR models generated by the GFA include (i) descriptor usage as a function of crossover operation, (ii) linear cross correlation among descriptors and/or dependent variables (biological activity measures), (iii) number of significant and independent 4D-QSAR models, and (iv) indices of model significance including the correlation coefficient, r2, leave one-out, LOO, cross-validation correlation coefficient, xv-r2, and Friedman’s lack of fit (LOF).33 In this particular 4D-QSAR application, the alignment similarity comparisons were limited to models having same number GCODs.
Steps 4–7 are repeated until all trial alignments are included in the 4D-QSAR analyses.
The inspection and evaluation of the population of models are obtained from the set of trial alignments in this step. The goal of this step is to identify the best and distinct set of 4D-QSAR models which is referred to as the manifold model of the analysis.
The “active” conformation of each compound is hypothesized at this step. This conformer is achieved by identifying all conformer states sampled for each compound that are within ΔE of the global minimum energy conformation of the CEP. Currently, ΔE is set at 2 kcal/mol. Each member of the resultant set of energy-filtered conformations is then individually evaluated in the best 4D-QSAR model. The conformation within 2 kcal/mol of the apparent global minimum that predicts the highest activity in the best 4D-QSAR model is defined as the active conformation.
The theory and corresponding methodology of the universal 4D-fingerprints for constructing the main distance-dependent matrix (MDDM) and computing corresponding eigenvalues for each matrix, using 4D molecular similarity (MS), have been presented in detail in previous work.18,34 The types of atoms composing a molecule are currently defined as the IPEs shown in Table 3. A unique MDDM is constructed for each of the eight distinct and identical IPE pairs. The elements of the MDDM are defined as following:
The “universal constant (ν)” in eq. 1, which is equal to 0.25,34 has been selected such that the difference in the sum of eigenvalues for any two arbitrary compounds with the same number, n, of a particular IPE type, m, is maximized. The term dij is average distance between the atom pair ij of IPE type u and v.
where p(k) refer to the thermodynamic probability of the kth conformer state sampled in the assessment of conformational flexibility, and dij(k) is the corresponding distance between atom pair i and j of IPE types u and v for the kth conformer state. Then, similarity eigenvalues are derived by the diagonalization of the MDDM. For same-term IPE pairs, such as u = v, the MDDM are square upper/lower triangular. These matrices can be directly diagonalized. The resulting eigenvalues determined from the MDDM are normalized and ranked in numerically descending order in their eigenvector representation. The nth normalized eigenvalue for IPE type m of a compound α,mn(α), can be obtained by scaling the non-normalized eigenvalue mn′ (α) relative to the rank of its MDDM.
Determination of eigenvalues of the MDDM for u ≠ v, the so-called cross-terms for IPE pairs that are not the same, requires a different strategy since these matrices may, or may not, be square. In the case of rectangular MDDM (u ≠ v), the following square MDDM are constructed
For MDDM(u,u) and MDDM(v,v) have the same rank and trace, both have the same set of eigenvalues. Hence, for each pair of IPE (u ≠ v)
According to all possible combinations of the eight IPE types, there are 36 possible molecular similarity eigenvectors from the MDDM for each compound α. The similarity eigenvectors have been calculated for the set of compounds, the estimation of molecular similarity for a pair of compounds α and β begins with a definition for molecular dissimilarity, given by
where i = ith eigenvalue in the corresponding eigenvetor of a specific IPE pair. Molecular similarity is then defined as
where ϕ = |rank(α) − rank(β)|/(rank(α) + rank(β). The rank of the matrices is essentially the number of atoms of a specific IPE type present. The ϕ term in eq. 8 serves to reincorporate molecular size information. Similar to the measure for dissimilarity, the similarity measure is a value between 1 and 0, where a value closer to 1 refers to compounds that are more similar, and closer to 0 refers to compounds that are more dissimilar.
The descriptor set for α consists of all of the eigenvalues of all of the eigenvectors derived from all of the MDDM for compound α. In this work, a threshold cutoff value which equal to 0.002 is applied, and those normalized eigenvalues below the threshold value are disregarded.
The maximum number of significant eigenvalues specific to that data set for a particular compound and a particular IPE type, m, is determined, m,max. All the eigenvectors for IPE type, m, for each molecule across lamellarin data set are then assigned m,max eigenvalues for IPE type m. Eigenvectors that otherwise contain less than m,max elements have the “missing” eigenvalues set to zero.
The total set of descriptors, total, for a compound in the data set will be the sum of the 36 eigenvalues of m,max length which can be a large number for the data set in this work.
Finally, the sets of 4D-fingerprints across each of the molecules of the training set form the trial descriptor pool to build the 4D fingerprint virtual screens. The building procedure of these virtual screens is identical to that employed in constructing the RI-4D-QSAR models. That is, steps 6 through 9 given above for the RI-4D-QSAR methodology are used.
Optimized RI-4D-QSAR models were constructed for each of the eight trial alignments listed in Table 4. Alignments 1, 2, 4, and 7 contain atoms from two rings (A and B), (B and C), (C and D), and (C and F), respectively. Alignment 5 and 6 only contain atoms from ring E and ring F, respectively. Only two alignments, 3 and 8, distribute the three-ordered atoms across three rings namely rings A, B, and C for alignment 3 and rings A, E, and F for alignment 8. The r2 and xv-r2 values from the best corresponding five-term RI-4D-QSAR models of each alignment are given in Table 4. Five terms in a model corresponds to the largest model that can be built by allowing at least 5 observations [compounds] per model-term for the training set. The optimized 5-term model represents an initial upper-bound exploration of the type, and corresponding quality, of an RI-4D-QSAR model that can be expected from the structure-activity data set. Alignment 1 yields the poorest fits with r2 = 0.964 and xv-r2 = 0.929. The differences among r2 and xv-r2 of the remaining alignments are quite small, or the alignment of lamellarin is not significant to the 4D-QSAR model. However, based on the greater r2 (0.999) and xv-r2 (0.998), alignment 3 appears to be the best alignment for 4D-QSAR analysis of lamellarin data set.
The optimum number of descriptors in a model is determined by monitoring when xv-r2 becomes effectively constant, or decreases, with increasing model size. Figure 2 is a plot of the number of descriptor terms in an optimized alignment 3 model versus the corresponding r2 and xv-r2. An inspection of Figure 2 reveals that the maximum number of descriptor terms in the RI-4D-QSAR model providing additional fit to the training set data is three. There is no meaningfully enhanced model fitting by including more than three descriptor terms. Thus, the optimized RI-4D-QSAR model for the 25 lamellarins generated from alignment 3 is given by eq. 9. Among top-ten 4D-QSAR models obtained from alignment 3, eq 9 (or model 3) is the best 4D-QSAR model since it has the highest xv-r2, and all other top-ten models are basically the same as model 3. This commonality to model 3 by the other top-ten models can be inferred from Table 5 by the high cross-correlations of their residuals of fit to those of eq. 9;
GCi (x, y, z, X) is the ith GCOD descriptor term located at (x, y, z) in the reference grid cell and alignment space, and having the X type IPE as defined in Table 3. Figure 3 is a plot of the predicted, using eq. 9, versus actual −logIC50 values. All of the predicted −logIC50 values are within ± 1 log unit of the corresponding observed values, and there are no outliers.
Two GCODS (GC1 and GC3) of eq. 9 correspond to pharmacophore sites of nonpolar atom occupancy, both of which increase potency. These two GCODS both have positive regression coefficients with values of 16.90 and 64.62, respectively. GCOD GC2, having an ‘any’ IPE type, has a negative regression coefficient with value of −56.33. Consequently occupancy of the GC2 site by any type of atom will lead to a decrease in the potency of anti-breast cancer activity of the corresponding lamellarin. From an analysis of eq. 9 it is found that the any IPE type at (−3, 4, −5) has about three times more of a negative effect upon −logIC50 than the positive effect of the nonpolar IPE type at (−5, 6, 2), and about the same, but opposite effect on −logIC50 as the nonpolar IPE type at (−1, 5, 0). None of the best models from GFA model optimization contain GCOD descriptors which deal with specific atom-atom interactions like hydrogen bonding.
In order to further search for pharmacophore sites which are specifically associated with lamellarins exhibiting high cytotoxic activity, an additional RI-4D-QSAR analysis was carried out. The training set of this study was limited to the six lamellarins (D, M, N, X, ε, and Dehydrolamellarin J of Table 2) that have the highest −logIC50 values, and are not redundant in their structural features. The RI-4D-QSAR models were constructed and optimized by using the same methodology and alignment used to build eq. 9. The best RI-4D-QSAR model from this small high activity data set of lamellarins is given by eq. 10.
The regression coefficients of the descriptors of eq. 10 suggest placing any type of atom at (−1, 1, −6) has about 30 times more negative effect on −logIC50 than the positive gain by locating a nonpolar atom or group at (−1, 4, −6). Certainly eq. 10 is, or borders upon, being an over-fit model. However, eq. 10 and its 3D-pharmacophore are only used as adjuncts to eq. 9 and its 3D-pharmacophore. That is, eq. 10 is being used to provide a higher-resolution view of the SAR features most characteristic of the high activity lamellarins of the training set. Equation 9 and its 3D-pharmacophore are used outside that context.
The 3D-pharmacophores defined by eqs. 9 and 10 are shown in Figures 4(a) and 4(b), respectively. The reference structure superimposed on each of the 3D-pharmacophores in these two figures is the predicted active conformation of the most active compound (lamellarin D) using eq. 9. The red spheres in Figures 4(a) and 4(b) represent those GCOD descriptor terms which have negative regression coefficients. Correspondingly, the blue spheres delineate GCOD descriptors having positive regression coefficients in the corresponding best RI-4D-QSAR equation. From an inspection of Figure 4(a), a red sphere near the R2 and R3 groups specifies a pharmacophore site where occupancy by any type of atom, or group, decreases potency since the corresponding regression coefficient −56.33. Two blue spheres are found near R4 and R5 suggesting that substitution of nonpolar groups to occupy one, or both, sites is conducive to increasing the cytotoxic activity of the lamellarins.
The 3D-pharmacophore of the high activity model, eq. 10, is represented by one red sphere (GCOD) located around R1 and R2, and a blue sphere (GCOD) positioned near R2 and R3. The most active compounds of the potent lamellarins seemingly achieve most of their additional −logIC50 potency, as compared to the less potent lamellarins, by not having any atoms or groups at (−1, 1, −6) in contrast to increasing occupancy by nonpolar atoms or groups at the GCOD located at (−1, 4, −6). The 30:1 ratio of not occupying the GCOD at (−1, 1, −6) as compared to having a nonpolar atom or group at (−1, 4, −6) is consistent with the relative binding energy contributions of an intermolecular hydrogen bond involving the OH near (−1, 1, −6) as compared to a hydrophobic binding effect due to the methyl of the methoxy group near (−1, 4, −6) as is shown in Figure 4(b).
Overall, the highly active compounds are seemingly distinguished from one another in eq. 10 by their ability to form an intermolecular hydrogen bond where the hydrogen bond acceptor atom in the receptor is expected to be near (−1, 1, −6). Some additional increase in −logIC50 can also be realized by having a hydrophobic substituent group of the ligand occupying the (−1, 4, −6) site. The two GCODs of eq. 10 may be a higher resolution representation of the single GC2 (−3, 4, −5, any) GCOD found in eq. 9.
In order to evaluate the possible roles of ligand molecular weight, MW on cytotoxic potency, −logIC50, this property were included as part of the trial basis set of descriptors in a GFA model optimization study. Unfortunately, no GFA model optimization could be realized. An inspection of the MW value of the training set compounds revealed that three lamellarins (lam K-triacetate, lam χ-triacetate, and lam U-diacetate) have very high MWs relative to the other training set compounds. These three lamellarins were removed to form a revised training set. Two lamellarins (lam F and K) were defined as a test set. GFA model building and optimization repeated for this 21 compound training set in the same manner as employed in developing eqs. 9 and 10. Ten best models were determined from the GFA optimization, and the residuals of fit cross-correlations between each pair of these models are given in Table 6. All pairs of the top-ten models have residuals of fit highly correlated to one another, with a value of at least 0.70, indicating these 10 models are all very nearly the same model. Therefore, the best of the ten models was selected as the preferred RI-4D-QSAR model for this training set, and is given by eq. 11.
Figure 5 is a plot of the observed versus the predicted −logIC50 values determined from using eq. 11. The 3D-pharmacophore embedded in the RI-4D-QSAR model given by eq. 11 is shown in Figure 6 with lamellarin D again the reference compound. All three GCOD descriptors of eq. 11 correspond to pharmacophore sites where an increasing occupancy decreases activity. One pharmacophore site, (−3, 4, 5, any) from eq. 11, is identical to a site from eq. 9, while the pharmacophore site at (−2, 1,−6, np) from eq. 11 is very close to the pharmacophore site of eq. 10 located at (−1, 1,−6, any) as can be seen by comparing Figure 6 to Figures 4(a) and 4(b). The third pharmacophore site of eq. 11 located at (3, 3, 2), which predicts the occupany of nonpolar groups to decrease −logIC50, is unique to this model as compared to eqs. 9 and 10. This new GCOD descriptor term of eq. 11 and the decrease in r2 and xv-r2 may be an indication of a significant pharmacophore-site dependence on one, or more, of the four lamellarins eliminated from the training set used to build eq. 11 and its corresponding 3D-pharmacophore.
An attempt was made to further explore if log P plays a role in the structure-activity relationship of the lamellarin training set by forcing an overfitting in the GFA model building and optimization process. The log P descriptor was the only non-GCOD descriptor added to the trial basis set (descriptor pool) at step 5 of 4D-QSAR methodology. Overfit RI-4D-QSAR models were permitted under the same methodology, same alignment, and for all lamellarins in training set as used to develop eq. 9. None of the 10 most significant overfit 4-term or 5-term RI-4D-QSAR models contained a log P descriptor term. Therefore, it was concluded that molecular lipophilicity is not a major contributing factor in the specification of the cytotoxic activity for the lamellarins studied in this analysis.
The predicted −logIC50 of lamellarin F calculated by using eq. 11 is 5.74. This value are very close to actual −logIC50 value of 5.34. The RI-4D-QSAR model obtained by removed out high MW compounds showed a good predict the activity of lamellarin F. Lamellarin K was synthesized and tested after the 4D-QSAR models reported in this paper were constructed. However, lamellarin K has an unexpected high activity [−logIC50 = 7.04] for the saturated D-ring series of compounds. Hence, it was thought important to see if this high activity could be predicted by the 4D-QSAR models, or if this saturated D-ring analog had features outside those captured by the models. The predicted −logIC50 values of lamellarin K obtained from eqs. 9, 10 and 11 are 5.33, 9.77 and 6.00, respectively. Thus, the 4D-QSAR models developed in this study cannot well-predict the activity of lamellarin K, but their composite set of predicted activities bracket around the observed activity. Moreover, while the individual models did not adequately predict the experimental endpoint, it is to be noted that the average of these three predicted values is 7.03 which is a value very close to the experimental −logIC50 value of 7.04.
Lamellarin K has a unique three hydroxyl substituent pattern at R1, R4 and R7. However, other analogs in Table 1 have three hydroxyl substituents, and some analogs without hydroxyl substutuents are more active than those with three hydroxyls, for example, compare lamellarin χ triacetate (5.54) to lamellarin E (5.28) in Table 1. All of the best 4D-QSAR models, eqs. 9, 10 and 11 are rich in GCOD terms involving nonpolar IPE types. Polar and/or hydrogen bonding capabilities from hydroxyl groups are not explicitly present in the descriptor terms of the 4D-QSAR models. All of these observations, in composite, suggest that the unique hydroxyl substituent pattern of lamellarin K make it the ‘magic bullet’ in terms of high inhibition potency relative to the other saturated D-ring analogs of Table 1.
4D-fingerprint virtual high throughput screens permit a larger range of chemistry diversity to be assayed more quickly than do RI-4D-QSAR models. In this study 4D-fingerprints models were generated using all 25 of the lamellarins in the training set. Lamellarin K was used as a modest means to validate 4D-fingerprints model as well as RI-4D-QSAR models. Two types of 4D-fingerprints can be constructed: those 4D-fingerprints explicitly dependent upon a particular alignment, and absolute 4D-fingerprints which are alignment independent.35 Absolute 4D-fingerprints were used in this analysis to maximize the range of lamellarin chemical diversity that could be reasonably screened. That is, a 4D-fingerprint screening model built independent of alignment is more general than its corresponding alignment-dependent screen, but at the cost of being somewhat less significant in its fit to the training set data.
The absolute 4D-fingerprints were derived for each of the 25 training set lamellarins using the modeling methodology given above in the Methods section. These 4D-fingerprints formed the trial basis set for model building. No non-4D-fingerprints were added to this trial descriptor pool. Model building and optimization in deriving the 4D-fingerprint QSAR equations, which are the high-throughput virtual screens, was carried in the identical fashion used to build the RI-4D-QSAR models.
Figure 7 is a plot of number of descriptor terms in a 4D-fingerprint model versus r2 and xv-r2. The xv-r2 of the 4D-fingerprints of the 4- and 5-term models are very nearly the same, and xv-r2 behaves in something of an erratic fashion for models having 5, or more, descriptor terms. The optimized 4-descriptor term virtual screening model appears, on the basis of xv-r2, to capture maximum fitting to the training set data without overfitting. Thus, the 4-term QSAR model given by eq. 12 was selected as the preferred absolute 4D-fingerprint virtual screen. Equation 12 is the best 4-term model from the top-ten 4-term models derived in the GFA optimization. Table 7 shows the linear cross-correlation matrix of the residual of fit for the top-ten 4-term models. This table reveals that all pairs of models have highly correlated residuals of fit, greater than 0.85, to one another. Thus, eq. 12 represents the best and only distinct fit to the training set data using absolute 4D-fingerprints.
For reference in defining the 4D-fingerprints, 7(any,np) represents the seventh largest eigenvalue from the MDDM of the IPEs u = (any) and v = (np) molecular similarity vector capturing all pairs of atoms in each lamellarin assigned IPEs of any and nonpolar, respectively.
The relative significance and weight of each 4D-fingerprint descriptor term in eq. 12 was measured in terms of its frequency of use in the GFA model optimization process. The idea in monitoring frequency of use is that the more significant is a descriptor to establishing a fit to the training set data, the more often it will be used in the repetitive GFA optimization process. The frequencies of descriptor usage during GFA optimization are shown in Table 8. An inspection of Table 8 indicates that 11(any,hs) and 7(any,np) are the first and second important features governing the SAR of lamellarin cytotoxicity potency, respectively. Increased potency of the lamellarins arises from increasing the values of 11(any,hs) and/or 3(p+,aro), while a decrease in lamellarin cytotoxicity accompanies an increase in the values of the 7(any,np) and 2(np,hs) 4D-fingerprints. Figure 8 is a plot of −logIC50 values predicted using eq. 12 versus the corresponding observed −logIC50 values.
The predicted activity of lamellarin K, the test compound, using eq. 12 is 7.33 which differs from the observed activity of 7.04 by only 0.29 log unit. Additional −logIC50 predictions using eq. 12 were made for a small virtual library of eight lamellarin derivatives, see Table 9, generated by making substitutent changes at R1-R5. These results indicate that the 4D-fingerprint model is responsive to predicting −logIC50 values over a wide −logIC50 potency range from nearly inactive values for ML6 to very potent activities for lamellarins ML4, ML5, and ML7. Equation 12 also has captured the SAR that both the number and positioning of –OH on the E-ring is a critical factor to potency. In general, more hydroxyls on the ring are better. But the importance of hydroxyl positioning, particularly at R3 is dramatically shown for ML5, the most active analog in Table 10 [10.05], as compared ML6, the least active analog [3.03] which differs only from ML5 by having no hydroxyl at R3.
The RI-4D-QSAR model given by eq. 9 with three descriptor terms is a more significant fit to the training set data (xv-r2 = 0.947 and r2 = 0.971) than the four descriptor 4D-fingerprint model given by eq. 12 (xv-r2 = 0.719 and r2 = 0.831). Presumably the inclusion of alignment information in eq. 9 provides this boost in the overall fitting quality of this model as compared to eq. 12. But eq. 12 in not being dependent on alignment correspondingly permits a wider range of variations lamellarin chemistry to be considered. Table 10 is the linear correlation matrix of the residuals of fit of eq. 9, the RI-4D-QSAR, to eq. 12, the absolute 4D-fingerprint virtual screen, as well as correlations of both models to the observed −logIC50 cytotoxicity values. The correlation coefficient of 0.797 between the residuals of fit for eqs. 9 and 12 indicates that these two models are basically the same, but eq. 9, owing to inclusion of alignment, fits the training set better, overall, than eq. 12.
Comparison of the predicted inhibition potencies from the 4D-fingerprints QSAR virtual screening model to the (RI)-4D-QSAR models was also investigated using ML5 and ML6, the most and the least potent compounds given in Table 9. The predicted −logIC50 values of ML5 obtained from the (RI)-4D-QSAR models by eqs. 9 and 11 are 5.39 and 9.20, respectively, and the −logIC50 values of ML6 obtained from the two equations are 7.44 and 10.23, respectively. It was found that there is an agreement in prediction only for ML5 between the (RI)-4D-QSAR model (9.20 by eq. 11) and the 4D-fingerprint QSAR model (10.05 by eq. 12).
This work puts forth a ‘quality in place of quantity’ strategy to handle small data sets composed of structurally complex, hard to synthesize compounds that can exhibit a wide-range in endpoint activity. A high-level modeling approach providing detailed structural, thermodynamic and electronic information about each complex compound of the data set is used to negate the lack-of-data drawbacks to the small size of the data set. In this study the flexibility, yet high-level of modeling sophistication of the 4D-QSAR paradigm is used to explore different subpopulations of the data set in extracting the maximum SAR information from the data set in terms of a pseudo consensus RI-4D-QSAR model and its corresponding 3D-pharmacophore. The consensus aspect to the RI-4D-QSAR modeling arises from the fact that the same methodology and parameters, including alignment, can be used in any manner across any subpopulations of the data set. As such, all resulting models are not only directly comparable, but to an appreciable extent can be combined to elucidate a high-resolution 3D-pharmacophore. In addition, the 4D-fingerprint formulation of the 4D-QSAR paradigm permits alternate model generation, particularly useful in virtual screening. Still, the 4D-fingerprint models are once again directly comparable to the RI-4D-QSAR models so as to exact additional information from the data set, as well as to evaluate the self-consistency across all the models constructed.
The consensus set of 4D-QSAR models expressed by eqs. 9–12, suggest that the ability to form a ligand-receptor intermolecular hydrogen bond and hydrophobic interactions for substituents on the E ring most modulate the cytotoxicity against T47D breast cancer cells. The optimization of this intermolecular hydrogen bond, and, to a lesser extent, the hydrophobic interactions, are coupled to the alignment freedom of a lamellarin owing, in turn, to other possible substitutions across the molecule and their possible interactions with sites on the receptor.
Hydrophobic substitutions on the F-ring can also enhance cytotoxic potency, but given that the 3D-pharmacophore sites for these interactions arise for the entire data set, and not the restricted high activity data subset, would indicate these are likely minor binding pharmacophore sites. Attempts to force the lipophilicity of the entire lamellarin into a 4D-QSAR model were unsuccessful. Thus, the finding of 3D-pharmacophore sites, where occupancy by nonpolar atoms and/or groups can modulate activity, likely reflect specific interactions at these sites, and not global lipophilic features of the lamellarins.
Lamellarin K, synthesized and tested after the modeling studies reported here were carried out, likely has its very high activity relative to other saturated D-ring analogs because of its unique three hydroxyl group substituent pattern. The average predicted −logIC50 value developed in this study sufficient predicts the activity of lamellarin K. This suggests that in order to get a high-resolution 4D-QSAR model to distinguish some substituent patterns from others for the saturated D-ring lamellarins analogs, more lamellarins data set is required.
The 4D-fingerprint virtual screening model, eq. 12, is highly consistent with the general RI-4D-QSAR model given by eq. 9. Consequently, eq. 12 can be used to rapidly screen prospective compounds without concern for alignment, but with the expectation that the 3D-pharmacophore of eq. 9 will be relevant to helping understand findings from virtual screenings. A good test to evaluate how much SAR information is actually captured in eq. 12 as a virtual screening tool, given it is based upon this relatively small training set of lamellarins would be to make and test ML5 and ML 6 of Table 9. These two compounds are predicted to differ by seven orders of magnitude in −logIC50 values, yet they differ by at their respective R3 substituents. A large difference in measured −logIC50 values would help to validate eq. 12, while a small difference would suggest that eq. 12 has very limited resolution in correctly explaining small structural differences in the lamellarins.
The Thailand Research Fund (RTA5080005) and the Commission on Higher Education (CHE) are gratefully acknowledged for research grant. We are grateful to Faculty of Science and the Graduate School of Kasetsart University, Phetchaburi Rajabhat University, the Chulabhorn Research Institute (CRI), the Chulabhorn Graduate Institute (CGI), and the College of Pharmacy, University of New Mexico for research supporting. The National Center of Excellent in Petroleum, Petrochemical Technology and Advanced Materials under Ministry of Education, KU Research and Development Institute, Laboratory for Computational and Applied Chemistry, NANOTEC Center of Excellence at KU, and the Large Scale Research Laboratory of the National Electronics and Computer Technology are also acknowledged for computing and research facilities. This work was also supported by the National Institutes of Health, USA through the NIH Roadmap for Medical Research, Grant 1 R21 GM075775. Information on Novel Preclinical Tools for Predictive ADME-Toxicology can be found at http://grants.nih.gov/grants/guide/rfa-files/RFA-RM-04-023.html. Links to nine initiatives are found here http://nihroadmap.nih.gov/initiatives.asp. This work was also supported, in part by The Proter & Gamble Company. Resources of the Laboratory of Molecular Modeling and Design at UNM, and at The Chem21 Group, Inc., were used in performing these studies. We also thank the reviewers for the valuable comments.