Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Eur J Med Chem. Author manuscript; available in PMC 2011 November 1.
Published in final edited form as:
PMCID: PMC2953788

Computational Structure-activity Relationship Analysis of Small-Molecule Agonists for Human Formyl Peptide Receptors


N-formyl peptide receptors (FPR) are important in host defense. Because of the potential for FPRs as therapeutic targets, recent efforts have focused on identification of non-peptide agonists for two FPR subtypes, FPR1 and FPR2. Given that a number of specific small molecule agonists have recently been identified, we hypothesized that computational structure-activity relationship (SAR) analysis of these molecules could provide new information regarding molecular features required for activity. We used a training set of 71 compounds, including 10 FPR1-specific agonists, 36 FPR2-specific agonists, and 25 non-active analogs. A sequence of (1) one-way analysis of variance selection, (2) cluster analysis, (3) linear discriminant analysis, and (4) classification tree analysis led to the derivation of SAR rules with high (95.8%) accuracy for correct classification of compounds. These SAR rules revealed key features distinguishing FPR1 versus FPR2 agonists. To verify predictive ability, we evaluated a test set of 17 additional FPR agonists, and found that the majority of these agonists (>94%) were classified correctly as agonists. This study represents the first successful application of classification tree methodology based on atom pairs to SAR analysis of FPR agonists. Importantly, these SAR rules represent a relatively simple classification approach for virtual screening of FPR1/FPR2 agonists.

Keywords: Formyl peptide receptor (FPR), FPR agonists, Atom pairs, Molecular descriptors, Structure-activity relationship analysis

1. Introduction

N-formyl peptides activate phagocytes through G protein-coupled receptors known as formyl peptide receptors (FPR) [1]. FPR1 was the first FPR cloned and encodes a high-affinity receptor for fMLF [2]. Subsequently, it was found that two additional FPRs exist in humans, and these were originally designated as FPR-like 1 (FPRL1; 69% identity to FPR1) and FPR-like 2 (FPRL2; 56% identity to FPR1) [35]. Recently, the FPR nomenclature has been revised such that FPRL1 and FPRL2 are now designated as FPR2 and FPR3, respectively [6].

Compared to FPR1, FPR2 exhibits a high level of ligand promiscuity and is activated by numerous chemically unrelated ligands, including synthetic peptides, pathogen-derived peptides, host-derived peptides, and lipids [reviewed in [6]]. In addition to natural peptides and endogenous arachidonic acid metabolites, novel synthetic peptides and several small-molecule non-peptide agonists of FPR1 and FPR2 have recently been reported [711]. Indeed, the identification and development of small-molecule ligands represents an ideal approach to analyze FPR structure and function, since such molecules are well defined and can be easily modified for structure-activity relationship (SAR) analysis. Small-molecule agonists can also have advantages over peptides or proteins as potential therapeutics, and they can provide a basis for construction of useful pharmacophore models of FPR1/FPR2 agonists.

Our analysis of novel small-molecule agonists of FPRs showed that individual ring substituents had a significant impact on FPR1/FPR2 agonist activity [12], suggesting that further structure–activity analysis of known FPR agonists could lead to optimization of these lead compounds and identification of improved agonists. In fact, SAR and quantitative SAR (QSAR) models have been instrumental in understanding the molecular mechanism of action of receptor agonists and antagonists, directing their design, and facilitating virtual screening [1315].

To date, non-computational SAR analysis has been performed for FPR2 agonists with a benzimidazole scaffold [16], FPR1/FPR2 agonists with a pyridazin-3(2H)-one scaffold [17], and pyrazolone-derived FPR2 agonists [9]; however, there are currently no reported computational SAR models for non-peptide FPR1/FPR2 agonists.

Here, we used computational SAR analysis of a large group of FPR1/FPR2 agonists and their non-active analogs. The SAR rules obtained from classification tree analysis, which was based on six atom pair descriptors only, revealed key features that distinguished FPR1 and FPR2 agonists. These studies provide further virtual screening of FPR1/FPR2 agonists and also provide clues to the molecular features required for agonist activity. This is the first application of an atom pair-based approach to a set of FPR agonists with various scaffolds and their analogs.

2. Results and discussion

2.1. Atom pairs and their one-way ANOVA selection

While a variety of molecular parameters can be used in the computational methods for (Q)SAR analysis [18,19], some of these parameters are complex physicochemical or geometrical 3D descriptors whose calculation is associated with difficulties conditioned by molecular flexibility and adequate sampling of conformational space. Conversely, topological indices, or 2D descriptors, obtainable from the structural formula of a compound are very attractive because of their simplicity. A reasonable compromise between ease of interpretation and ease of computation was reported by Carhart et al. [20], who introduced atom pair descriptors as features of the environments of all atoms in the 2D representation of a chemical structure. This approach has been widely used in the context of fragment-based similarity searches and database mining [2125]. Here, we applied atom pair descriptors to represent the selected molecules. The use of an atom type naming scheme from MM+ force field, as implemented in HyperChem software, is a significant feature of our approach, as it represents more sophisticated atom typing than used in previous studies [20]. This scheme assigns specific names to a given atom, depending on its surroundings in a molecule.

The 2D structures of 71 compounds (training set) that included 10 FPR1-specific agonists, 36 FPR2-specific agonists, and 25 non-active compounds (Table 1, see Materials and Methods) represented as a set of HIN files were used by our CHAIN program to generate a table of atom pairs. This program, which finds all possible paths between labeled atoms in a molecular structure, identified 726 unique atom pairs among the 71 compounds. Hence, a matrix consisting of 71 lines and 726 columns was generated, with each line containing the number of times a given atom pair was present in each molecule. Since this number of columns is too large for SAR analysis, we performed a step-by step selection of atom pairs to reduce the matrix size. This sequence of steps included one-way ANOVA, linear discriminant analysis, and binary classification tree analysis. These methodologies have a high applicability for variable pre-screening in SAR [2628] and were also used successfully in our previous studies [22,23].

Table 1
Structure and receptor specificity of compounds under investigation

For the first selection step, we applied one-way analysis of variance (ANOVA) [29] to select descriptors with significant differences between total and within-class variances. As a result of ANOVA selection, 565 descriptors were filtered out, while the remaining 161 significant atom pairs were retained for further analysis. It is reasonable to compare the distribution of initial and ANOVA-selected descriptors in terms of bond separation (i.e., the number of bonds between atoms in a given atom pair). It should be noted that the relative distribution of initial and ANOVA-selected descriptors, in terms of bond separation, was similar (Figure 1). Although a greater proportion of the original atom pairs was retained after ANOVA selection for pairs with the largest bond separation (i.e., 17–18 bonds), these atom pairs are rare in the data set and have negligible statistical impact on the total SAR analysis. Since the relative distribution of atom pairs by bond separation was not changed after ANOVA, it appears that molecular shape peculiarities do not have a major influence on biological activity of these FPR agonists. In contrast, the dumb-bell shape common to inducers of macrophage tumor necrosis factor-α production led to a significantly higher fraction of “longer” atom pairs among ANOVA-selected descriptors [23].

Figure 1
Comparison of initial and ANOVA-selected atom pairs in compounds 1-71. The numbers are shown for each of the indicated bond separations initially generated for 71 compounds from Table 1 (light bars). Atom pairs subsequently selected by ANOVA as having ...

2.2. Cluster and linear discriminant analyses

The second step of variable selection consisted of finding clusters of highly correlated descriptors. Subsequently a single variable from one cluster can be regarded as independent, whereas the other dependent descriptors of such a cluster can be excluded from further calculations. Using the 161×161 matrix of correlation coefficients for atom pairs selected by ANOVA, we chose 28 clusters of variables (Table 2). Each variable is highly correlated (r≥0.9) with at least one variable from the same cluster. Descriptors with longer bond separations were taken as representative variables (shown in bold italic in Table 2). Hence, instead of 102 atom pairs included in Table 2, only 28 atom pairs were retained. These variables were combined with the 59 remaining atom pairs that were not highly correlated, resulting in a set of 87 atom pair descriptors selected for further calculations.

Table 2
Clusters of highly correlated atom pairs (r≥0.9)

The high correlation coefficient between values of descriptors implies that these atom pairs are simultaneously present in most compounds in the data set. This usually occurs when atom pairs are produced by certain molecular features or scaffolds. The features may be very simple. For example, Cluster 14 descriptor BR_6_CO corresponds to the presence of a bromine atom and carbonyl carbon separated by 6 bonds. The same substructure also contains another atom pair from this cluster (BR_7_O1) corresponding to a carbonyl oxygen and bromine separated by 7 chemical bonds. More populated sets of correlated descriptors are produced by multi-atomic scaffolds. For instance, 11 descriptors of Cluster 2 originate from the 2-(benzimidazol-2-ylsulfanyl)-N-phenylacetamide scaffold common to compound AG-09/25 and its analogs. Thus, clustering atom pairs according to their mutual correlation not only decreases the number of variables but also provides a rational way to interpret SAR results in terms of chemical features and building blocks, which are much more complex than the atom pairs themselves. It should be noted that lowering the correlation coefficient threshold from 0.9 to 0.8 gave rather heterogeneous clusters of correlated descriptors usually not associated with distinct chemical substructures. Although the composition of several clusters remained the same, some were condensed to larger clusters at r≥0.8 by inclusion of additional atom pair descriptors (Figure 2). On the other hand, the adopted threshold of 0.9 for correlation between a given variable and at least one variable from the same cluster provides high mutual correlation of all variables in this cluster. For example, each pair of descriptors among the 13 variables of Cluster 1 (Table 2) is characterized by an r value greater than 0.85.

Figure 2
Schematic representation of clusters obtained at different correlation coefficient thresholds. Values in black circles correspond to the enumeration of clusters at r≥0.9 (Table 2). Red circles show clusters obtained at r≥0.8. The number ...

The 87 atom pairs selected after the two steps described above were used as an input variable set for linear discriminant analysis (LDA). The LDA procedure was applied with the option of “forward stepwise” inclusion of variables, as implemented in STATISTICA 6.0 software. The descriptors were added to the model if their inclusion led to a significant improvement in classification (p<0.05). We found that 17 of the 87 atom pairs were sufficient for good LDA classification of agonists, with 68 of the 71 compounds (95.8%) classified correctly as FPR1, FPR2, or NA (Figure 3A).

Figure 3
Classification results of linear discriminant analysis (LDA) (Panel A) and binary classification tree analysis (Panel B) versus experimental classes of compounds investigated. The LDA was based on either 17 or 9 atom pairs from the best subset, and binary ...

The LDA model with 17 atom pairs derived on the third step of variable selection was further simplified after an additional run of LDA with the “best subset search” option. The number of atom pair descriptors was decreased from 17 to 9 without loss of quality of the model (accuracy was the same using either 17 or 9 descriptors). This relatively simple LDA model obtained on the fourth step of variable selection can be expressed by the following three classification functions:




The number of corresponding atom pairs in a given compound should be used as values of descriptors for calculation of functions (13). One of the classes (FPR1, FPR2, or NA) is then attributed to the compound according to the maximum value among these functions.

Table 3 contains calculated classes and results of leave-one-out (LOO) prediction for the entire series of compounds. All three compounds with incorrectly calculated classes (AG-09/41, AG-09/95, and AG-09/102) were inactive, while the LDA model (13) and LOO prediction classified them as having FPR2 activity. LOO cross-validation correctly classified 63 of 71 compounds (88.7% accuracy). This can be considered as good quality of prediction, taking into account that the model with 9 descriptors was derived based on 71 molecules in the training set. For the subset of FPR1 agonists, the fraction of correct LOO predictions was expectedly lower (70%) because of the relatively small number of compounds with FPR1 activity included in the series under investigation. This was conditioned by the low number of non-peptide FPR1-specific agonists reported in the literature.

Table 3
Experimentally determined, SAR-calculated, and LOO-predicted classes of FPR1/FPR2 agonist activity for FPR1/FPR2 agonists and non-active compounds (training set) and their atom pairs used in binary classification tree analysis

2.3. Classification tree analysis

Although the LDA model is good in terms of fitting and prediction, its use in practice is based on calculation of functions (13) and is difficult to interpret. It would be much better to find simple SAR rules which are intuitively understandable and expressed in natural “chemical” language. In recent studies, we exploited a binary classification tree approach to build logical SAR algorithms based on atom pairs for low-molecular weight inhibitors of human neutrophil elastase [22] and non-peptide inducers of TNF-α production [23]. In the present paper, we report the first application of the classification tree methodology based on atom pairs to SAR analysis of compounds with varying scaffolds. Indeed, a multi-scaffold training set would produce a SAR model that is more useful for subsequent virtual screening of potential FPR agonists.

The 9 descriptors involved in the best LDA model (Functions 13) were used as the starting variable set for the classification tree algorithm [30] implemented in STATISTICA 6.0. Deriving an optimal tree with cross-validation criteria represented the final stage of our step-by-step variable selection and resulted in a six-branched tree (Figure 4). Thus, 720 of the initial 726 descriptors were filtered out during the selection steps, and the remaining six atom pairs formed a basis for the formulation of simple, logical SAR. Despite the small number of retained variables, the model is characterized by good fitting (i.e., it correctly classified most of the compounds from the training set). Figure 3B shows that 67 of the 71 compounds (94.4%) were accurately recognized by the tree with respect to their activity classes. Three non-active compounds were misclassified as FPR2, while one FPR2 agonist was misclassified as non-active. Detailed information regarding the retained descriptors and terminal nodes of the tree responsible for classification of each compound are shown in Table 3. The low cross-validation cost (0.141) obtained for the model demonstrates the powerful predictive ability of the classification tree (see also the results below for applying this model to an external data set).

Figure 4
Optimal classification tree for splitting compounds into activity classes. The number of compounds that entered each node is indicated. Terminal nodes correspond to the three activity classes: FPR1-specific agonists, FPR2-specific agonists, or non-active ...

Chemical substructures associated with branches in the classification tree are illustrated in Figure 5. The first branch evaluates molecules for the presence of two or more CA_12_O2 atom pairs. These atom pairs occur when bicoordinated ether oxygen atoms and a benzene ring are separated by 10 or 11 chemical bonds (see examples in Figure 5A and 5B). Molecules containing two or more of these atom pairs move to the right branch of the tree (24 of the 71 compounds evaluated), whereas molecules with less than two CA_12_O2 atom pairs moved to the left branch (47 of the 71 compounds) (see Figure 4).

Figure 5
Examples of atom pair descriptors and their occurrences in structures of selected compounds under investigation. The indicated atom pairs are highlighted in red. Notation of atom types: CA – aromatic carbon; C3 – olefin-type or imino carbon; ...

The first branch on the left evaluates molecules for the presence of BR_7_O1 atom pairs. BR_7_O1 and BR_6_CO are mutually correlated because they are found in compounds with a bromine atom and a carbonyl oxygen separated by 7 bonds (i.e., the topological distance of 6 chemical bonds falls between bromine and the carbonyl carbon atom) (see examples in Figure 5C and 5D). At this branch, molecules with one or more of these atom pairs are designated as FPR2 agonists. Otherwise, they are sent to the next branch associated with the C3_9_CA descriptor. According to the split condition, a compound is designated as an FPR2 agonist in terminal node 9 if it contains five or more C3_9_CA atom pairs. This occurs when several C3-type carbons (MM+ force field notation for sp2-hybridized carbon of non-benzene character) are present with simultaneous presence of an aromatic ring on the opposite side of a molecule (see examples in Figure 5E and 5F). In some molecules, the number of C3_9_CA atom pairs does not exceed 4, despite the occurrence of atoms with C3 and CA types (e.g., see C-14x in Figure 5C).

The final branch on the left evaluates remaining molecules for the presence of the N2_3_O1 atom pair, as well as the correlated CO_2_N2 atom pair, which falls within the N2_3_O1 structure (see Figure 5G and 5H). This branch is important for correct classification of FPR2 agonists AG-09/5 and AG-09/8 from the training set, since these compounds contain two N2_3_O1 atom pairs and pass to terminal node 13, while the other compounds with less than two N2_3_O1 atom pairs are classified as non-active in terminal node 12 (see Figure 4). Although it might be suggested that this split is not important and could be removed from the tree without significant loss of classification quality, its removal caused a noticeable increase in cross-validation cost from 0.141 to 0.254 for the truncated tree. Thus, such a simplification of the model is not statistically warranted. Moreover, chemical features associated with the N2_3_O1 atom pair were also important for correct classification of several compounds from the test set (see below).

Molecules with two or more CA_12_O2 atom pairs moved to the right branch of the classification tree. The first node of the right branch evaluates molecules for the presence of the C4_6_C4 descriptor, which corresponds to tetrahedral sp3-carbons separated by 6 bonds. Atom pairs of this type can be found in compounds with various scaffolds and is originated by two saturated hydrocarbon moieties located moderate distances from each other (examples are shown in Figure 5I and 5J). The final node in this branch evaluates molecules for the presence of the BR_6_C4 atom pair. This substructure is present in five non-active compounds, three of which contain an m-bromophenyl-acetamide fragment (C-14b, AG-09/25, and AG-09/72; see Figure 5K and 5L) and two others containing an N-alkyl-p-bromoaniline substructure (C-14r and C-23). These features can be used in the formulation of “chemical” rules for SAR analysis. For example, movement of bromine from the meta to the para position of the aromatic ring in a bromo-substituted phenyl-acetamide moiety transformed the non-active C-14b into the FPR1 agonist C-17b.

Atom pairs from the clusters of correlated variables (Table 2, Figure 2) did not dominate at the nodes of the classification tree, and only N2_3_O1 and BR_7_O1 were involved in the split rules. Additionally, large clusters produced by entire scaffolds did not participate at all in the classification tree. Thus, the classification process does not appear to be biased by large chemical substructures and, therefore, would be useful for evaluation of molecules with various types of chemical scaffolds.

The best approach to validate SAR and QSAR models is to apply them to an independent series of compounds. For this purpose, we evaluated a test set consisting of 17 FPR2-specific or mixed FPR1/FPR2 agonists (Table 4). A matrix of atom pairs was generated using CHAIN program, and six columns of the matrix which correspond to the descriptors important for SAR analysis were taken into account. Values of the 6 descriptors important for SAR analysis descriptors used in the classification tree are shown in Table 4 along with the classification results obtained using the binary tree and algorithm from Scheme 1. FPR2-specifc agonists B-25, B-35, and B-42 were correctly predicted as having FPR2 activity, while most of the mixed-type compounds were classified as either FPR1 (AG-09/9, AG-09/17, AG-09/20, AG-09/22, C-14a, C-14e, C-14h, and C-14n) or FPR2 (AG-22, B-25, B-35, B-42, fMLF, and WKYMVm) agonists. Two members of test set (AG-09/10 and 1910-5441) were misclassified as non-active. Note, however, that FPR1 agonist 1910-5441 has relatively lower activity (EC50 ~20 μM) [8] than the other agonists used in our computational SAR analyses. Although oligopeptides were not included in the training set, the peptides fMLF and WKYMVm from the test set were classified correctly as active compounds. Note that these two peptides possess common fragments, e.g. benzyl and 2-methylthioethyl groups. The recognition of molecules by FPRs can also be strongly determined by configuration of chiral centers; however, our atom pair approach does not currently account for molecular chirality and would require introduction of these variables as additional descriptors.

Table 4
Experimentally determined and predicted classes of FPR1/FPR2 agonists from the test set and their atom pairs used in binary classification tree analysis

Our simplified SAR model is based on agonists of non-mixed (i.e., “pure” FPR1 or FPR2 agonists), while the test set contained mostly mixed-type compounds. Such a situation was conditioned by the absence of a substantial number of small molecule receptor-specific agonists with relatively high activity reported in the literature. Thus, the aim of this test set was primarily in evaluation of the model for its ability to distinguish active and inactive compounds. Obviously, further discovery of novel specific FPR1 and FPR2 agonists will allow us to expand both training and test sets in order to derive a model with enhanced predictive ability based on atom pair descriptors. On the other hand, we can predict scaffold “specific affinity” of mixed agonists for either FPR1 or FPR2 using this model. For example, mixed-type agonists C-14a, C-14e, C-14h, and C-14n [17] and AG-09/17, AG-09/20, and AG-09/22 [12] were classified by the model as FPR1 agonists, indicating that 4-benzylpyridazin-3-one and 2-(benzimidazol-2-ylthio)-N-phenylacetamide scaffolds have higher affinity for FPR1. By comparison, the pyrazolone scaffold (agonists B-25, B-35, B-42, and B-43) had higher affinity for FPR2. Note, however, that this feature requires the presence of specific agonists with such scaffolds in the training set.

3. Conclusion

Previously, high-throughput screening was used to select unique non-peptide agonists of FPR1 and FPR2 [712,16,17]. In the present study, we utilized atom pair descriptors for computational SAR analysis of most active FPR1/FPR2 agonists to further define the features of these molecules critical for agonist activity and to develop a simple, but accurate SAR model for predicting biological activity in future compound screening.

A sequence of ANOVA, cluster analysis, LDA, and classification tree analysis based on the atom pair descriptors led to the derivation of simple SAR rules, despite the large number of starting variables. The SAR rules obtained from classification tree analysis, which was based on six atom pair descriptors only, revealed that the FPR1 agonists in the series investigated could be characterized by simultaneous satisfying the following conditions: a) presence of a benzene ring and two-coordinated oxygen atoms separated by 10 or 11 bonds (Condition A); b) absence of sp3-carbon atoms separated by 6 bonds; and c) absence of m-bromophenyl-acetamide and N-alkyl-p-bromoaniline substructures. A compound can be classified as an FPR2 agonist if it does not match Condition A and satisfies one of the following statements: a) contains a bromine atom and a carbonyl oxygen separated by 7 bonds; b) at least three non-benzene sp2-carbons separated by 6 to 9 bonds from benzene ring(s), and at least two of these carbons separated by 7 or 8 bonds from benzene ring(s); or c) at least two N-acylhydrazine or α-aminoketone substructures are present. Another type of FPR2 agonist satisfies Condition A and contains sp3-carbon atoms separated by 6 bonds. To evaluate predictive ability of the method, we evaluated a test set of 17 FPR agonists. Most, including the two peptides fMLF and WKYMVm, were classified by the derived rules as active agonists. Thus, we provide here the first successful application of the classification tree methodology based on atom pairs for SAR analysis of FPR agonists with various scaffolds.

Good quality and high predictive ability of the SAR model, as well as simplicity and rapidity of calculations associated with the binary tree algorithm, suggest promise in using the classification tree model for large database mining and virtual screening of FPR agonists.

4. Experimental

4.1. Data sets

Training Set

The training set of 71 compounds contained 10 FPR1-specific agonists, 36 FPR2-specific agonists, and 25 non-active compounds, including 2-(benzimidazol-2-ylthio)-N-phenylacetamides, N-phenethyl-N′-phenylureas, piperazines, acetohydrazides, 1-(2-indolylcarbonyl)-4-(1-benzimidazolyl)piperidines, 4-benzylpyridazin-3-ones, arylcarboxylic acid hydrazides, 1-(2-indolylcarbonyl)-3-(1-benzimidazolyl)pyrrolidines and related scaffolds [711]. All selected agonists had EC50 values in the low micromolar range for their ability to induce intracellular calcium mobilization in cells (RBL-2H3 or HL-60) transfected with human FPR1 or FPR2. All active compounds were evaluated in wild-type cells to verify that the agonists were inactive in non-transfected cells. Names of compounds and their experimentally determined classes are shown in Table 3 (1st and 2nd columns). Structures of all training set compounds are shown in Table 1.

Test Set

The test set consisted of 17 compounds, including 11 compounds with mixed FPR1/FPR2 agonist activity [AG-22, AG-09/9, AG-09/10, AG-09/17, AG-09/20, AG-09/22, C-14a, C-14e, C-14h, C-14n [12], and B-43 [31]] and 3 FPR2-specific agonists (B-25, B-35, and B-42) [9]. Also included in the test set was 1910-5441, which has been shown to be an FPR1 agonist, although activity for FPR2 was not evaluated [8]. Finally, the test set included two peptide FPR agonists: fMLF and WKYMVm.

4.2. Structure encoding by atom pairs

For the derivation of SAR models, we used an atom pair representation of molecular structures. Each atom pair was denoted as T1_D_T2, where T1 and T2 are the types of atoms in the pair and D is bond separation, i.e. number of bonds in the shortest path between these atoms in the structural formula. As described previously [22,23], T1 and T2 were defined with symbolic codes used in HyperChem, Version 7 (Hypercube, Inc., Gainesville, FL) for atom type encoding within the MM+ force field. For example, CA, CO, and C4 codes were used for sp2-hybridized aromatic, carbonyl, and tetrahedral sp3-hybridized carbon atoms, respectively. This approach allows easy generation of atom pairs from the output file containing the molecular structure (HIN file) built by HyperChem. As atom pairs T1_D_T2 and T2_D_T1 are equivalent, we used a unified definition with lexicographic order of type substrings (i.e., with T1 ≤ T2).

All 726 unique atom pairs possible for non-hydrogen atoms in the 71 compounds of the training set were generated. This 71×726 data matrix was automatically built by our CHAIN program, based on HIN files created in HyperChem. A matrix element at the intersection of the ith row and jth column was equal to the jth atom pair occurrence in the ith molecule. A similar data matrix was calculated for the 17 compounds in the test set.

4.3. Data processing and derivation of SAR rules

The data matrix for the training set was used as an input for one-way ANOVA [29] implemented in STATISTICA 6.0. From 726 atom pairs, 161 descriptors were selected, which showed significant differences (p≤0.05) between total and within-class variances. These ANOVA-selected atom pairs were then clustered according to their mutual correlation. Each member of a cluster was tightly correlated (r≥0.9) with at least one descriptor from the same cluster. All but one variable in each of the 28 clusters were considered dependent and were discarded, while the remaining variable with the highest bond separation was taken as independent in a given cluster. These 28 atom pairs, together with the 59 non-clustered variables (total of 87 descriptors), formed the basis set for further analysis.

Derivation of SAR classification was then performed by the LDA method with the ‘Forward Stepwise’ option, using the corresponding module of STATISTICA 6.0. The statistical criterion for inclusion or exclusion of descriptors at each step was p≤0.05. The stepwise LDA allowed selection of 17 significant descriptors from 726 atom pairs generated initially. The LDA procedure was then repeated with the ‘Best Subset Search’ option on the basis of 17 variables selected in the first LDA run. The best subset consisted of just 9 atom pairs giving the least misclassification error in the LDA model. Starting from 9 variables of the best subset, we developed a binary classification tree model with univariate splits. The classification tree was built with STATISTICA 6.0 using estimated prior probabilities and equal misclassification costs for classes [30,32]. An exhaustive classification and regression tree-style univariate split selection method was used, as described by Breiman et al. [30].


This work was supported in part by National Institutes of Health grant P20 RR-020185, National Institutes of Health contract HHSN266200400009C, an equipment grant from the M.J. Murdock Charitable Trust, and the Montana State University Agricultural Experimental Station.


Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Reference List

1. Le Y, Murphy PM, Wang JM. Trends Immunol. 2002;23:541–548. [PubMed]
2. Boulay F, Tardiff M, Brouchon L, Vignais P. Biochemistry. 1990;29:11123–11133. [PubMed]
3. Ye RD, Cavanagh SL, Quehenberger O, Prossnitz ER, Cochrane CG. Biochem Biophys Res Commun. 1992;184:582–589. [PubMed]
4. Murphy PM, Ozcelik T, Kenney RT, Tiffany HL, McDermott D, Francke U. J Biol Chem. 1992;267:7637–7643. [PubMed]
5. Bao L, Gerard NP, Eddy RL, Jr, Shows TB, Gerard C. Genomics. 1992;13:437–440. [PubMed]
6. Ye RD, Boulay F, Wang JM, Dahlgren C, Gerard C, Parmentier M, Serhan CN, Murphy PM. Pharmacol Rev. 2009;61:119–161. [PMC free article] [PubMed]
7. Nanamori M, Cheng X, Mei J, Sang H, Xuan Y, Zhou C, Wang MW, Ye RD. Mol Pharm. 2004;66:1213–1222. [PubMed]
8. Edwards BS, Bologa C, Young SM, Balakin KV, Prossnitz ER, Savchuck NP, Sklar LA, Oprea TI. Mol Pharm. 2005;68:1301–1310. [PubMed]
9. Bürli RW, Xu H, Zou X, Muller K, Golden J, Frohn M, Adlam M, Plant MH, Wong M, McElvain M, Regal K, Viswanadhan VN, Tagari P, Hungate R. Bioorg Med Chem Lett. 2006;16:3713–3718. [PubMed]
10. Schepetkin IA, Kirpotina LN, Khlebnikov AI, Quinn MT. Mol Pharm. 2007;71:1061–1074. [PubMed]
11. Schepetkin IA, Kirpotina LN, Tian J, Khlebnikov AI, Ye RD, Quinn MT. Mol Pharm. 2008;74:392–402. [PMC free article] [PubMed]
12. Kirpotina LN, Khlebnikov AI, Schepetkin IA, Ye RD, Rabiet MJ, Jutila MA, Quinn MT. Mol Pharmacol. 2010;77:159–170. [PubMed]
13. Tong W, Welsh WJ, Shi L, Fang H, Perkins R. Environ Toxicol Chem. 2003;22:1680–1695. [PubMed]
14. Andricopulo AD, Montanari CA. Mini Rev Med Chem. 2005;5:585–593. [PubMed]
15. Helguera AM, Combes RD, Gonzalez MP, Cordeiro MN. Curr Top Med Chem. 2008;8:1628–1655. [PubMed]
16. Frohn M, Xu H, Zou X, Chang C, McElvaine M, Plant MH, Wong M, Tagari P, Hungate R, Bürli RW. Bioorg Med Chem. 2007;17:6633–6637. [PubMed]
17. Cilibrizzi A, Quinn MT, Kirpotina LN, Schepetkin IA, Holderness J, Ye RD, Rabiet MJ, Biancalani C, Cesari N, Graziano A, Vergelli C, Pieretti S, Dal P, Giovannoni VMP. J Med Chem. 2009;52:5054–5057. [PMC free article] [PubMed]
18. Buttingsrud B, Ryeng E, King RD, Alsberg BK. J Comput Aided Mol Des. 2006;20:361–373. [PubMed]
19. Gute BD, Basak SC. SAR QSAR Environ Res. 2006;17:37–51. [PubMed]
20. Carhart RE, Smith DH, Venkataraghavan R. J Chem Inf Comput Sci. 1985;25:64–73.
21. Plewczynski D, von GM, Spieser SA, Rychlewski L, Wyrwicz LS, Ginalski K, Koch U. Comb Chem High Throughput Screen. 2007;10:189–196. [PubMed]
22. Khlebnikov AI, Schepetkin IA, Quinn MT. Bioorg Med Chem. 2008;16:2791–2802. [PMC free article] [PubMed]
23. Khlebnikov AI, Schepetkin IA, Kirpotina LN, Quinn MT. Bioorg Med Chem. 2008;16:9302–9312. [PMC free article] [PubMed]
24. Perez-Nueno VI, Rabal O, Borrell JI, Teixido J. J Chem Inf Model. 2009;49:1245–1260. [PubMed]
25. Yu N, Bakken GA. J Chem Inf Model. 2009;49:745–755. [PubMed]
26. Yan SF, King FJ, He Y, Caldwell JS, Zhou Y. J Chem Inf Model. 2006;46:2381–2395. [PubMed]
27. Li J, Liu H, Yao X, Liu M, Hu Z, Fan B. Anal Chim Acta. 2007;581:333–342. [PubMed]
28. Roncaglioni A, Piclin N, Pintore M, Benfenati E. SAR QSAR Environ Res. 2008;19:697–733. [PubMed]
29. Lindman HR. Analysis of variance in complex experimental designs. W.H. Freeman & Co; San Francisco: 1974.
30. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and regression trees. Wadsworth & Brooks/Cole Advanced Books & Software; Monterey, CA: 1984.
31. Sogawa Y, Shimizugawa A, Ohyama T, Maeda H, Hirahara K. J Pharmacol Sci. 2009;111:317–321. [PubMed]
32. Loh WY, Shih YS. Statistica Sinica. 1997;7:815–840.