2.1. Atom pairs and their one-way ANOVA selection
While a variety of molecular parameters can be used in the computational methods for (Q)SAR analysis [18
], some of these parameters are complex physicochemical or geometrical 3D descriptors whose calculation is associated with difficulties conditioned by molecular flexibility and adequate sampling of conformational space. Conversely, topological indices, or 2D descriptors, obtainable from the structural formula of a compound are very attractive because of their simplicity. A reasonable compromise between ease of interpretation and ease of computation was reported by Carhart et al. [20
], who introduced atom pair descriptors as features of the environments of all atoms in the 2D representation of a chemical structure. This approach has been widely used in the context of fragment-based similarity searches and database mining [21
]. Here, we applied atom pair descriptors to represent the selected molecules. The use of an atom type naming scheme from MM+ force field, as implemented in HyperChem software, is a significant feature of our approach, as it represents more sophisticated atom typing than used in previous studies [20
]. This scheme assigns specific names to a given atom, depending on its surroundings in a molecule.
The 2D structures of 71 compounds (training set) that included 10 FPR1-specific agonists, 36 FPR2-specific agonists, and 25 non-active compounds (, see Materials and Methods
) represented as a set of HIN files were used by our CHAIN program to generate a table of atom pairs. This program, which finds all possible paths between labeled atoms in a molecular structure, identified 726 unique atom pairs among the 71 compounds. Hence, a matrix consisting of 71 lines and 726 columns was generated, with each line containing the number of times a given atom pair was present in each molecule. Since this number of columns is too large for SAR analysis, we performed a step-by step selection of atom pairs to reduce the matrix size. This sequence of steps included one-way ANOVA, linear discriminant analysis, and binary classification tree analysis. These methodologies have a high applicability for variable pre-screening in SAR [26
] and were also used successfully in our previous studies [22
Structure and receptor specificity of compounds under investigation
For the first selection step, we applied one-way analysis of variance (ANOVA) [29
] to select descriptors with significant differences between total and within-class variances. As a result of ANOVA selection, 565 descriptors were filtered out, while the remaining 161 significant atom pairs were retained for further analysis. It is reasonable to compare the distribution of initial and ANOVA-selected descriptors in terms of bond separation (i.e., the number of bonds between atoms in a given atom pair). It should be noted that the relative distribution of initial and ANOVA-selected descriptors, in terms of bond separation, was similar (). Although a greater proportion of the original atom pairs was retained after ANOVA selection for pairs with the largest bond separation (i.e., 17–18 bonds), these atom pairs are rare in the data set and have negligible statistical impact on the total SAR analysis. Since the relative distribution of atom pairs by bond separation was not changed after ANOVA, it appears that molecular shape peculiarities do not have a major influence on biological activity of these FPR agonists. In contrast, the dumb-bell shape common to inducers of macrophage tumor necrosis factor-α production led to a significantly higher fraction of “longer” atom pairs among ANOVA-selected descriptors [23
Figure 1 Comparison of initial and ANOVA-selected atom pairs in compounds 1-71. The numbers are shown for each of the indicated bond separations initially generated for 71 compounds from (light bars). Atom pairs subsequently selected by ANOVA as having (more ...)
2.2. Cluster and linear discriminant analyses
The second step of variable selection consisted of finding clusters of highly correlated descriptors. Subsequently a single variable from one cluster can be regarded as independent, whereas the other dependent descriptors of such a cluster can be excluded from further calculations. Using the 161×161 matrix of correlation coefficients for atom pairs selected by ANOVA, we chose 28 clusters of variables (). Each variable is highly correlated (r≥0.9) with at least one variable from the same cluster. Descriptors with longer bond separations were taken as representative variables (shown in bold italic in ). Hence, instead of 102 atom pairs included in , only 28 atom pairs were retained. These variables were combined with the 59 remaining atom pairs that were not highly correlated, resulting in a set of 87 atom pair descriptors selected for further calculations.
Clusters of highly correlated atom pairs (r≥0.9)
The high correlation coefficient between values of descriptors implies that these atom pairs are simultaneously present in most compounds in the data set. This usually occurs when atom pairs are produced by certain molecular features or scaffolds. The features may be very simple. For example, Cluster 14 descriptor BR_6_CO corresponds to the presence of a bromine atom and carbonyl carbon separated by 6 bonds. The same substructure also contains another atom pair from this cluster (BR_7_O1) corresponding to a carbonyl oxygen and bromine separated by 7 chemical bonds. More populated sets of correlated descriptors are produced by multi-atomic scaffolds. For instance, 11 descriptors of Cluster 2 originate from the 2-(benzimidazol-2-ylsulfanyl)-N-phenylacetamide scaffold common to compound AG-09/25 and its analogs. Thus, clustering atom pairs according to their mutual correlation not only decreases the number of variables but also provides a rational way to interpret SAR results in terms of chemical features and building blocks, which are much more complex than the atom pairs themselves. It should be noted that lowering the correlation coefficient threshold from 0.9 to 0.8 gave rather heterogeneous clusters of correlated descriptors usually not associated with distinct chemical substructures. Although the composition of several clusters remained the same, some were condensed to larger clusters at r≥0.8 by inclusion of additional atom pair descriptors (). On the other hand, the adopted threshold of 0.9 for correlation between a given variable and at least one variable from the same cluster provides high mutual correlation of all variables in this cluster. For example, each pair of descriptors among the 13 variables of Cluster 1 () is characterized by an r value greater than 0.85.
Figure 2 Schematic representation of clusters obtained at different correlation coefficient thresholds. Values in black circles correspond to the enumeration of clusters at r≥0.9 (). Red circles show clusters obtained at r≥0.8. The number (more ...)
The 87 atom pairs selected after the two steps described above were used as an input variable set for linear discriminant analysis (LDA). The LDA procedure was applied with the option of “forward stepwise” inclusion of variables, as implemented in STATISTICA 6.0 software. The descriptors were added to the model if their inclusion led to a significant improvement in classification (p<0.05). We found that 17 of the 87 atom pairs were sufficient for good LDA classification of agonists, with 68 of the 71 compounds (95.8%) classified correctly as FPR1, FPR2, or NA ().
Figure 3 Classification results of linear discriminant analysis (LDA) (Panel A) and binary classification tree analysis (Panel B) versus experimental classes of compounds investigated. The LDA was based on either 17 or 9 atom pairs from the best subset, and binary (more ...)
The LDA model with 17 atom pairs derived on the third step of variable selection was further simplified after an additional run of LDA with the “best subset search” option. The number of atom pair descriptors was decreased from 17 to 9 without loss of quality of the model (accuracy was the same using either 17 or 9 descriptors). This relatively simple LDA model obtained on the fourth step of variable selection can be expressed by the following three classification functions:
The number of corresponding atom pairs in a given compound should be used as values of descriptors for calculation of functions (1–3). One of the classes (FPR1, FPR2, or NA) is then attributed to the compound according to the maximum value among these functions.
contains calculated classes and results of leave-one-out (LOO) prediction for the entire series of compounds. All three compounds with incorrectly calculated classes (AG-09/41, AG-09/95, and AG-09/102) were inactive, while the LDA model (1–3) and LOO prediction classified them as having FPR2 activity. LOO cross-validation correctly classified 63 of 71 compounds (88.7% accuracy). This can be considered as good quality of prediction, taking into account that the model with 9 descriptors was derived based on 71 molecules in the training set. For the subset of FPR1 agonists, the fraction of correct LOO predictions was expectedly lower (70%) because of the relatively small number of compounds with FPR1 activity included in the series under investigation. This was conditioned by the low number of non-peptide FPR1-specific agonists reported in the literature.
Experimentally determined, SAR-calculated, and LOO-predicted classes of FPR1/FPR2 agonist activity for FPR1/FPR2 agonists and non-active compounds (training set) and their atom pairs used in binary classification tree analysis
2.3. Classification tree analysis
Although the LDA model is good in terms of fitting and prediction, its use in practice is based on calculation of functions (1
) and is difficult to interpret. It would be much better to find simple SAR rules which are intuitively understandable and expressed in natural “chemical” language. In recent studies, we exploited a binary classification tree approach to build logical SAR algorithms based on atom pairs for low-molecular weight inhibitors of human neutrophil elastase [22
] and non-peptide inducers of TNF-α production [23
]. In the present paper, we report the first application of the classification tree methodology based on atom pairs to SAR analysis of compounds with varying scaffolds. Indeed, a multi-scaffold training set would produce a SAR model that is more useful for subsequent virtual screening of potential FPR agonists.
The 9 descriptors involved in the best LDA model (Functions 1
) were used as the starting variable set for the classification tree algorithm [30
] implemented in STATISTICA 6.0. Deriving an optimal tree with cross-validation criteria represented the final stage of our step-by-step variable selection and resulted in a six-branched tree (). Thus, 720 of the initial 726 descriptors were filtered out during the selection steps, and the remaining six atom pairs formed a basis for the formulation of simple, logical SAR. Despite the small number of retained variables, the model is characterized by good fitting (i.e., it correctly classified most of the compounds from the training set). shows that 67 of the 71 compounds (94.4%) were accurately recognized by the tree with respect to their activity classes. Three non-active compounds were misclassified as FPR2, while one FPR2 agonist was misclassified as non-active. Detailed information regarding the retained descriptors and terminal nodes of the tree responsible for classification of each compound are shown in . The low cross-validation cost (0.141) obtained for the model demonstrates the powerful predictive ability of the classification tree (see also the results below for applying this model to an external data set).
Figure 4 Optimal classification tree for splitting compounds into activity classes. The number of compounds that entered each node is indicated. Terminal nodes correspond to the three activity classes: FPR1-specific agonists, FPR2-specific agonists, or non-active (more ...)
Chemical substructures associated with branches in the classification tree are illustrated in . The first branch evaluates molecules for the presence of two or more CA_12_O2 atom pairs. These atom pairs occur when bicoordinated ether oxygen atoms and a benzene ring are separated by 10 or 11 chemical bonds (see examples in ). Molecules containing two or more of these atom pairs move to the right branch of the tree (24 of the 71 compounds evaluated), whereas molecules with less than two CA_12_O2 atom pairs moved to the left branch (47 of the 71 compounds) (see ).
Figure 5 Examples of atom pair descriptors and their occurrences in structures of selected compounds under investigation. The indicated atom pairs are highlighted in red. Notation of atom types: CA – aromatic carbon; C3 – olefin-type or imino carbon; (more ...)
The first branch on the left evaluates molecules for the presence of BR_7_O1 atom pairs. BR_7_O1 and BR_6_CO are mutually correlated because they are found in compounds with a bromine atom and a carbonyl oxygen separated by 7 bonds (i.e., the topological distance of 6 chemical bonds falls between bromine and the carbonyl carbon atom) (see examples in ). At this branch, molecules with one or more of these atom pairs are designated as FPR2 agonists. Otherwise, they are sent to the next branch associated with the C3_9_CA descriptor. According to the split condition, a compound is designated as an FPR2 agonist in terminal node 9 if it contains five or more C3_9_CA atom pairs. This occurs when several C3-type carbons (MM+ force field notation for sp2-hybridized carbon of non-benzene character) are present with simultaneous presence of an aromatic ring on the opposite side of a molecule (see examples in ). In some molecules, the number of C3_9_CA atom pairs does not exceed 4, despite the occurrence of atoms with C3 and CA types (e.g., see C-14x in ).
The final branch on the left evaluates remaining molecules for the presence of the N2_3_O1 atom pair, as well as the correlated CO_2_N2 atom pair, which falls within the N2_3_O1 structure (see ). This branch is important for correct classification of FPR2 agonists AG-09/5 and AG-09/8 from the training set, since these compounds contain two N2_3_O1 atom pairs and pass to terminal node 13, while the other compounds with less than two N2_3_O1 atom pairs are classified as non-active in terminal node 12 (see ). Although it might be suggested that this split is not important and could be removed from the tree without significant loss of classification quality, its removal caused a noticeable increase in cross-validation cost from 0.141 to 0.254 for the truncated tree. Thus, such a simplification of the model is not statistically warranted. Moreover, chemical features associated with the N2_3_O1 atom pair were also important for correct classification of several compounds from the test set (see below).
Molecules with two or more CA_12_O2 atom pairs moved to the right branch of the classification tree. The first node of the right branch evaluates molecules for the presence of the C4_6_C4 descriptor, which corresponds to tetrahedral sp3-carbons separated by 6 bonds. Atom pairs of this type can be found in compounds with various scaffolds and is originated by two saturated hydrocarbon moieties located moderate distances from each other (examples are shown in ). The final node in this branch evaluates molecules for the presence of the BR_6_C4 atom pair. This substructure is present in five non-active compounds, three of which contain an m-bromophenyl-acetamide fragment (C-14b, AG-09/25, and AG-09/72; see ) and two others containing an N-alkyl-p-bromoaniline substructure (C-14r and C-23). These features can be used in the formulation of “chemical” rules for SAR analysis. For example, movement of bromine from the meta to the para position of the aromatic ring in a bromo-substituted phenyl-acetamide moiety transformed the non-active C-14b into the FPR1 agonist C-17b.
Atom pairs from the clusters of correlated variables (, ) did not dominate at the nodes of the classification tree, and only N2_3_O1 and BR_7_O1 were involved in the split rules. Additionally, large clusters produced by entire scaffolds did not participate at all in the classification tree. Thus, the classification process does not appear to be biased by large chemical substructures and, therefore, would be useful for evaluation of molecules with various types of chemical scaffolds.
The best approach to validate SAR and QSAR models is to apply them to an independent series of compounds. For this purpose, we evaluated a test set consisting of 17 FPR2-specific or mixed FPR1/FPR2 agonists (). A matrix of atom pairs was generated using CHAIN program, and six columns of the matrix which correspond to the descriptors important for SAR analysis were taken into account. Values of the 6 descriptors important for SAR analysis descriptors used in the classification tree are shown in along with the classification results obtained using the binary tree and algorithm from Scheme 1. FPR2-specifc agonists B-25
, and B-42
were correctly predicted as having FPR2 activity, while most of the mixed-type compounds were classified as either FPR1 (AG-09/9
, and C-14n
) or FPR2 (AG-22
, fMLF, and WKYMVm) agonists. Two members of test set (AG-09/10
) were misclassified as non-active. Note, however, that FPR1 agonist 1910-5441
has relatively lower activity (EC50
~20 μM) [8
] than the other agonists used in our computational SAR analyses. Although oligopeptides were not included in the training set, the peptides fMLF and WKYMVm from the test set were classified correctly as active compounds. Note that these two peptides possess common fragments, e.g. benzyl and 2-methylthioethyl groups. The recognition of molecules by FPRs can also be strongly determined by configuration of chiral centers; however, our atom pair approach does not currently account for molecular chirality and would require introduction of these variables as additional descriptors.
Experimentally determined and predicted classes of FPR1/FPR2 agonists from the test set and their atom pairs used in binary classification tree analysis
Our simplified SAR model is based on agonists of non-mixed (i.e., “pure” FPR1 or FPR2 agonists), while the test set contained mostly mixed-type compounds. Such a situation was conditioned by the absence of a substantial number of small molecule receptor-specific agonists with relatively high activity reported in the literature. Thus, the aim of this test set was primarily in evaluation of the model for its ability to distinguish active and inactive compounds. Obviously, further discovery of novel specific FPR1 and FPR2 agonists will allow us to expand both training and test sets in order to derive a model with enhanced predictive ability based on atom pair descriptors. On the other hand, we can predict scaffold “specific affinity” of mixed agonists for either FPR1 or FPR2 using this model. For example, mixed-type agonists C-14a
, and C-14n
] and AG-09/17
, and AG-09/22
] were classified by the model as FPR1 agonists, indicating that 4-benzylpyridazin-3-one and 2-(benzimidazol-2-ylthio)-N-phenylacetamide scaffolds have higher affinity for FPR1. By comparison, the pyrazolone scaffold (agonists B-25
, and B-43
) had higher affinity for FPR2. Note, however, that this feature requires the presence of specific agonists with such scaffolds in the training set.