To create a new QSAR model in OCHEM the user must prepare the training and (optional) validation sets, configure the preprocessing of molecules (standardization and 3D optimization), choose and configure the molecular descriptors and the machine learning method, select the validation protocol (N-fold cross-validation or bagging) and, when the model has been calculated, review the predictive statistics and save or discard the model.
The following sections describe each of the aforementioned steps in detail.
Training and validation sets, machine learning method and validation
Training and validation datasets. One of the most important steps in model development is the preparation of input data, i.e., a training set that contains experimentally measured values of the predicted property.
The property that will be predicted by the model is identified automatically based on the contents of the training set. If the training set contains multiple properties, they will be predicted by the model simultaneously. This allows knowledge about different (but related) properties to be combined into a single model, so called multi
]. Multi-learning was shown to significantly increase the overall performance in comparison to models developed for each property separately [20
The OCHEM system allows a user to combine heterogeneous data reported in different units of measurements into a single unit set. For every property, the user must select a unit; all the input data will be automatically converted to this unit and, therefore, the final model will give predictions in this unit.
Machine learning method
. After assigning the training and validation sets (see Fig. ), the user selects a machine learning method and a validation protocol. Currently OCHEM supports linear and Kernel Ridge Regression (KRR) [21
], ASsociative Neural Networks (ASNN) [22
], Kernel Partial Least Squares (KPLS) [23
], a correction-based LogP-LIBRARY model [24
], Support Vector Machines (SVM) [25
], Fast Stagewise Multivariate Linear Regression (FSMLR) and k Nearest Neighbors (kNN) [26
The first step of model creation: selection of a training and validation set, a machine learning method and a validation protocol
Validation of models
. OCHEM offers two possibilities to validate a model: N
-fold cross-validation and bagging. We recommend always using one of the two validation options to avoid a common pitfall of model over-fitting, which results to misleading predictions [27
]. The validation procedure is also used to calibrate the estimation of prediction accuracy [29
If cross-validation is chosen, the whole process of model development, including the filtering of descriptors, is repeated N (by default 5) times with a different split of the initial set into training and validation sets. Only the respective training set is used in each step for model development.
In the case of bagging validation, the system generates N
(by default 100) training sets and builds N
models, based on these sets. The N
sets are generated from the initial training set by random sampling with replacement. The compounds not included in the training set are used to validate the performance of this model; the final prediction for each compound is the mean prediction over all the models where this compound was in the validation set [30
No matter what validation method is chosen, duplicated molecules (regardless of stereochemistry) are used either in training or validation sets but never in both simultaneously, which ensures the proper assessment of the model predictive ability.
The descriptors available in OCHEM are grouped by the software name that contributes them: ADRIANA.Code [31
], CDK descriptors [32
], Chemaxon descriptors [33
], Chirality codes [34
], Dragon descriptors [39
], E-State indices [40
], ETM descriptors [41
], GSFrag molecular fragments [43
], inductive descriptors [44
], ISIDA molecular fragments [45
], quantum chemical MOPAC 7.1 descriptors [46
], MERA and MerSy descriptors [47
], MolPrint 2D descriptors [51
], ShapeSignatures [52
] and logP and aqueous solubility calculated with ALOGPS program [24
]. The descriptor selection screen is shown in Fig. .
A screenshot of the descriptor selection and configuration panel
The following section briefly describes available descriptors. If at least one descriptor in the block requires 3D structures, the block is marked as 3D and as 2D otherwise.
(3D) comprises a unique combination of methods for calculating molecular descriptors on a sound geometric and physicochemical basis [31
]. Thus, they are all prone to an interpretation and allow the understanding of the influence of various structural and physicochemical effects on the property under investigation. ADRIANA.Code performs calculations with user-supplied 3D structures or applies built-in methods to generate 3D structures based on rapid empirical models. In addition, it contains a hierarchy of increasing levels of sophistication in representing chemical compounds from the constitution through the 3D structure to the surface of a molecule. At each level, a wide range of physicochemical effects can be included in the molecular descriptors.
descriptors (2D) predict logP [54
] and the aqueous solubility [55
] of chemicals. This program was recently top-ranked amid 18 competitors for logP prediction using >96,000 in house
molecules from Pfizer and Nycomed [24
]. It was also reported to be “the best available ‘off-the-shelf package’ for intrinsic aqueous solubility prediction” at F. Hoffmann-La Roche [56
]. ALOGPS does not have additional configuration options.
descriptors (3D) are calculated by the DescriptorsEngine, which is a part of the Chemistry Development Kit (CDK) [32
]. The CDK descriptors include 204 molecular descriptors of 5 types: topological, geometrical, constitutional, electronic and hybrid descriptors. The CDK also provides local atomic and bond-based descriptors, which will be included in OCHEM in future.
Chemaxon descriptors (also known as calculator plugins) calculate a range of physico-chemical and life-science related properties from chemical structures and are developed by ChemAxon. These calculators are usually part of the Marvin and JChem cheminformatics platforms. The descriptors are divided into 7 different groups: elemental analysis, charge, geometry, partitioning, protonation, isomers and “other” descriptors, which is a collective group for all heterogeneous descriptors that do not directly fall under any of the previous categories. For some descriptors, such as distribution coefficient (logD), the pH value is essential for calculation. By default, the descriptor value is calculated over the spectrum of pH from 0 to 14 with 1 pH unit intervals. However, it is possible to explicitly designate a specific pH value or range of pH values.
(3D) are molecular descriptors that represent chirality using a spectrum-like, fixed-length code, and include information on geometry and atomic properties. Conformation-independent chirality codes (CICC) [34
] are derived from the configuration of chiral centers, properties of the atoms in their neighborhoods, and bond lengths. Conformation-dependent chirality codes (CDCC) [35
] characterize the chirality of a 3D structure considered as a rigid set of points (atoms) with properties (atomic properties), connected by bonds. Physicochemical atomic stereo-descriptors (PAS) [36
] were implemented to represent the chirality of an atomic chiral center on the basis of empirical physicochemical properties of its ligands—the ligands are ranked according to a specific property, and the chiral center takes an “S/R-like” descriptor relative to that property. The procedure is performed for a series of properties, yielding a chirality profile. All three types of chirality descriptors can distinguish between enantiomers. Examples of applications include the prediction of chromatographic elution order [35
], the prediction of enantioselectivity in chemical reactions [34
], and the representation of metabolic reactions catalyzed by racemates and epimerases of E.C. subclass 5.1 [57
(v. 5.4) descriptors (3D) include more than 1,600 descriptors organized into 20 different sub-types that can be selected separately. DRAGON is an application for the calculation of molecular descriptors developed by the Milano Chemometrics and QSAR Research Group. The DRAGON descriptors include not only the simplest atom type, functional group and fragment counts, but also several topological and geometrical descriptors; molecular properties such as logP, molar refractivity, number of rotatable bonds, H-donors, H-acceptors, and topological surface area (TPSA) [58
] are also calculated by using well-known published models.
(2D) are separated on atom/bond type. In addition to indices it is also possible to select E-state counts, which correspond to counts of atom or bond types according to the respective indices. In some studies E-state counts were reported to produce better models than E-state indices [26
(Electronic-Topological Method [41
]) descriptors (3D) are based on the comparison of 3D structures of molecules. The molecules are represented as matrices where diagonal elements are atom charges and non-diagonal elements are distances between them. The molecules are compared with a template molecule and common fragments become ETM descriptors (i.e., 3D pharmacophores). There are usually two templates representing the most active and inactive molecules.
descriptors (2D) are the occurrence numbers of certain special fragments containing 2–10 non-hydrogen atoms; GSFRAG-L is an extension of GSFRAG that considers fragments that contain a labeled vertex, allowing one to capture the effect of heteroatoms. It was shown that the occurrence numbers of these fragments produce a unique code of a chemical structure for wide sets of compounds [43
descriptors (3D) have been derived from the LFER (Linear Free Energy Relationships)-based equations for inductive and steric substituent constants and can be computed for bound atoms, groups and molecules using intra-molecular distances, atomic electronegativities and covalent radii [44
descriptors (2D) include two types of fragments: sequences and atom centered fragments, each of which includes explicitly atoms, bonds or both [45
]. In the current version of OCHEM only sequences of atoms and bonds are used. The user can specify their minimum and maximum length.
descriptors (3D) are calculated using the non-parametrical 3D MERA algorithm and include (a) geometrical MERA descriptors (linear and quadratic geometry descriptors, descriptors related to molecular volume, proportions of a molecule, ratios of molecular sizes, quantitative characteristics of molecular symmetry, dissymmetry, chirality), (b) energy characteristics (inter- and intra-molecular Van der Waals and Coulomb energies; decomposition of intermolecular energies) and (c) physicochemical characteristics (probabilities of association, heat capacity, entropy, pKa) [47
(MERA Symmetry, 3D) descriptors are calculated using 3D representation of molecules in the framework of the MERA algorithm (see above) and include the quantitative estimations of molecular symmetry with respect to symmetry axes from C2
and the inversion-rotational axis from S1
in the space of principal rotational invariants about each orthogonal component. Additionally, the molecular chirality is quantitatively evaluated in agreement with the negative criterion of chirality (the absence of inversion-rotational axes in the molecular point group) [47
] (2D) are circular fingerprints [60
] based on Sybyl mol2 atom types. They are very efficient and can be easily calculated even for libraries comprising millions of molecules. Circular fingerprints capture a lot of information that relates molecular structure to its bioactivity. It has been shown in large-scale comparative virtual screening studies that MolPrint descriptors often outperform other fingerprinting algorithms in enrichment [61
]. Given the binary nature of MolPrint 2D fingerprints, they are ideally suited for virtual screening and clustering of molecules, as well as to the generation of numerical bioactivity models, which are able to accommodate the presence/absence nature of the descriptor.
descriptors (3D) include whole-molecule and atom-type descriptors. The latter can be used to model local (atom-dependent) properties of molecules, such as pKa or the site of metabolism [63
(3D) can be viewed as a very compact descriptor that encodes molecular shape and electrostatics in a single entity. It reduces the dimensionality of 3D molecular shape and surface charge by representing complex 3D molecules as simple histograms. These signatures lend themselves to rapid comparison with each other for virtual screening of large chemical databases. Shape Signatures can be used by itself or in conjunction with currently available computational modeling approaches commonly employed in drug discovery and predictive toxicology, such as traditional virtual screening, descriptor-based (e.g., QSAR) models, ligand-receptor docking, and structure-based drug design [52
The set of descriptors can be easily extended by incorporating new modules that could also be provided by external contributors. It is possible to use the output of previously created models as input for a new model: this option is sometimes referred to as a feature net
For the descriptors that require 3D structures of molecules, users can either rely on 3D structures generated by CORINA [68
] or retrieve molecules optimized by MOPAC and the AM1 Hamiltonian [46
] calculated by the web services implemented within the CADASTER project (http://mopac.cadaster.eu
). If additional parameters are required for the calculation of descriptors, e.g., pKa value for ChemAxon descriptors, they are specified explicitly in the interface of the corresponding descriptor block. In this case, the parameters are saved with descriptors and are then used exactly in the same form for new molecular sets during the model application.
ATOMIC descriptors (3D) are defined for a particular atom (active center) of a molecule. Atomic descriptors can be used to describe reactive centers (e.g., for pKa calculation, prediction of reactivity). These descriptors are applicable for the prediction of particular “local” properties of molecules that depend on the specified active center (currently, only macro pKa constants are supported). The currently available atomic descriptors are based on MOPAC descriptors and E-State indices.
includes descriptors that characterize ligand–protein interactions. These descriptors will allow using the information about 3D structure of proteins for modeling. For example, a number of docking derived descriptors based on Vina software [69
] were added recently and are currently in use for an ongoing study for prediction of CYP450 inhibitors [70
]. There is also a plan to extend OCHEM with other types of descriptors, e.g., those used in the COMBINE method [71
Users can export most of the descriptor values (with an exception of commercial descriptors) for offline model development. Descriptor values can be exported as an Excel file or as a simple text file in a comma-separated values (CSV) format.
Before descriptors are passed to the machine-learning method, it is possible to filter part of them out by several criteria to avoid redundancy. Currently, OCHEM supports the following filtering options for descriptors: the number of unique descriptor values, the pairwise correlation of descriptors and the variance of principal components, obtained from the principal component analysis (PCA). Thus, it is possible to exclude highly correlated descriptors or descriptors that do not pass user specified thresholds.
Configuration of the machine learning method
There are a number of configuration options that are specific for every particular machine learning method. These options are configured in separate dialog windows. Here, we briefly provide an overview of the methods and their parameters.
k Nearest Neighbors (kNN) predicts the property using the average property value of those k compounds from the training set that are nearest (in the descriptor space) to the target compound. The configurable options are: metrics type (Euclidean distance or the Pearson correlation coefficient) and the number of neighbors. By default the number of neighbors is determined automatically by the method itself.
ASsociative Neural Network (ASNN)
. This method uses the correlation between ensemble responses (each molecule is represented in the space of neural network models as a vector of predictions of neural network models) as a measure of distance amid the analyzed cases for the nearest neighbor technique [22
]. Thus ASNN performs kNN in the space of ensemble predictions. This provides an improved prediction by the bias correction of the neural network ensemble. The configurable options are: the number of neurons in the hidden layer, the number of iterations, the size of the model ensemble and the method of neural network training.
Fast Stagewise Multivariate Linear Regression (FSMLR)
is a procedure for stagewise building of linear regression models by means of greedy descriptor selection. It can be viewed as a special case of the additive regression procedure (regression boosting) specially designed to be compatible with the three-set approach based on the use of three different sets for learning: training set, internal validation set and external test set [73
]. The main configurable parameters are: (1) shrinkage—its lower values result in the large number of required iterations but may provide higher generalization performance, and (2) the relative size of the internal validation set used for stopping descriptor selection procedure.
Kernel Partial Least Squares (KPLS) and Kernel Ridge regression (KRR)
are modifications of partial least squares (PLS) and ridge regression (RR) that use a non-linear kernel (for an introduction to kernel methods see book by Schölkopf and Smola [74
]). The most important parameter for kernel-based methods is the type of kernel, as that determines non-linear relations. Available kernels are: linear, polynomial, and radial basis functions, as well as the iterative similarity optimal assignment kernel (ISOAK) [21
]. The first three kernels are used with molecular descriptors. The ISOAK kernel is defined directly on the molecular structure graph. The individual parameters for every kernel can be either specified manually or configured to be selected automatically by the method itself in an inner loop of cross-validation.
model does not require any additional configuration options. This model is based on ASNN and corrects the ALOGPS logP model [75
] using so called LIBRARY correction [76
] with the training set. The idea of this method is to adjust the LogP model to predict other properties. The success of this methodology was shown for prediction of logD of chemical compounds at pH 7.4 [76
] and it was extended to prediction of arbitrary properties in the OCHEM database.
Multiple Linear Regression Analysis (MLRA) uses step-wise variable selection. The method eliminates at each step one variable that has regression coefficients that are not significantly different from zero (according to the t test). Thus, MLRA has only one parameter which corresponds to the p value of variables to be kept for the regression.
Support Vector Machine (SVM)
uses the LibSVM program [78
]. The SVM method has two important configurable options: the SVM type (ε-SVR and μ-SVR) and the kernel type (linear, polynomial, radial basis function and sigmoid). The other options can be optimized by the method automatically using grid search.