|Home | About | Journals | Submit | Contact Us | Français|
Hydroxyl benzoic esters are preservative, being widely used in food, medicine, and cosmetics. To explore the relationship between the molecular structure and antibacterial activity of these compounds and predict the compounds with similar structures, Quantitative Structure-Activity Relationship (QSAR) models of 25 kinds of hydroxyl benzoic esters with the quantum chemical parameters and molecular connectivity indexes are built based on support vector machine (SVM) by using R language. The External Standard Deviation Error of Prediction (SDEPext), fitting correlation coefficient (R2), and leave-one-out cross-validation (Q2LOO) are used to value the reliability, stability, and predictive ability of models. The results show that R2 and Q2LOO of 4 kinds of nonlinear models are more than 0.6 and SDEPext is 0.213, 0.222, 0.189, and 0.218, respectively. Compared with the multiple linear regression (MLR) model (R2 = 0.421, RSD = 0.260), the correlation coefficient and the standard deviation are both better than MLR. The reliability, stability, robustness, and external predictive ability of models are good, particularly of the model of linear kernel function and eps-regression type. This model can predict the antimicrobial activity of the compounds with similar structure in the applicability domain.
QSAR [1, 2] is used to research the relationship between the molecular structure and biological activity and physicochemical characteristics, reveal the quantitative relationship, predict the activity of unknown compounds, and direct the synthesis of new materials [3–5]. QSAR is considered as one of the promising technologies and is widely used at present because of making up the loss of experimental data, reducing the cost of testing, and achieving high throughput prediction and screening . Many international organizations and regulatory agencies have supported and promoted the use of QSAR and thought that QSAR can be used as an alternative to animal experiments. Health Canada, the United States of Food and Drug Administration (FDA), Environmental Protection Agency (EPA), the European Union, and the Organization for Economic Cooperation and Development (OECD) apply QSAR to identify potential health hazards, screening, and priority . After recent years of development, QSAR has become a frontier topic in medicinal chemistry, environmental chemistry, life science, analytical chemistry, computer chemistry, and even pesticide [8–11].
Hydroxyl benzoic esters are important kinds of preservatives, which are widely used in medicine, food, cosmetics, pesticides, and other fields . At present, there are about 60 kinds of food preservatives in the world . The benzoic acid and sorbic acid are productive in China, but the usage is little because of the high toxicity of benzoic acid and the high price of sorbic acid. Hydroxyl benzoic esters have high efficiency, low toxicity, compatibility, and other advantages; the performance of antibacterial is stronger than benzoic acid and sorbic acid because it has a phenolic hydroxyl . So it is of great significance to study and apply the antibacterial activity of hydroxyl benzoic esters.
SVM is a machine learning algorithm based on statistical learning theory proposed by Cortes et al. [15–17]. SVM can be used for pattern recognition, regression analysis and function fitting, and so forth because it possesses favorable mathematical properties, such as the uniqueness of the solution, nondependence on the dimension of the input space, and so forth. The optimal solution of SVM is superior to the traditional learning methods. In recent years, SVM is applied to the study of QSAR of the compound. Hou et al.  investigated the QSAR of the antimalarial activity of PfDHODH inhibitors by generating four computational models using a multiple linear regression (MLR) and a SVM based on a dataset of 255 PfDHODH inhibitors. Sharma et al.  drew support from SVM and MLR studying the activity of HIV-1 capsid inhibitors. SVM model was found more efficient in prediction. Khuntwal et al.  used MLR and SVM to develop QSAR models for a dataset of 34 tetrahydrobenzothiophene derivatives. Zhiming et al.  by using ridge regression (RR) and SVM built QSAR models of bitter tasting thresholds (BTT) and cytotoxic T lymphocyte (CTL) and predicted independent test data. Results showed that the fitting, LOOCV, and external prediction accuracies were superior to the reported results of the existing literature. Zhang et al.  took the benzene compounds as the research object, combining the molecular structure of the quantitative description with MLR or nonlinear regression statistical methods SVM, to build successfully the acute toxicity QSAR models and mutagenic QSAR models of benzene compounds. By comparing the linear and nonlinear QSAR models, Zhang Xiao-Long discovered that the stability and prediction ability of nonlinear QSAR models are better than those of multiple linear QSAR models. In the literature, there are very few researches about QSAR of the hydroxyl benzoic esters. Jiang et al.  used MLR to build the model of QSAR and it can well predict the MIC and t0.5 in the range of atomic number (the number of C among 1–4 on the ester chain of MIC and 1–3 of t0.5). Qiu et al.  optimized the molecular structures of eleven kinds of p-hydroxyl benzoic esters by using density functional theory (DFT) B3LYP method of quantum chemistry and then used stepwise multiple linear regression to select the descriptors and to generate the best prediction model that relates the structural features to inhibitory activity. The QSAR results showed that the lowest unoccupied molecular orbit ELUMO and the increase of dipole moment μ were the main independent factors contributing to the antifungal activity of the compounds. SVM has shown obvious advantages in the QSAR research, but QSAR study of the compound of hydroxyl benzoic esters is confined to the linear model at present; there is no literature on the nonlinear QSAR analysis of the system.
In this paper, we use the quantum chemical parameters and molecular connectivity indexes to analyze the antibacterial activity of the hydroxyl benzoic esters. The QSAR model is established by the SVM algorithm in the R software. We obtain the structure-activity relationship between the molecular structural parameters and the antibacterial activity of Escherichia coli under the most stable configuration, which provides a basis of predicting the antibacterial activity of similar compounds.
This paper took the 25 hydroxyl benzoate group compounds as the research object, including 10 o-hydroxyl benzoic esters, 2 m-hydroxyl benzoic esters, and 13 p-hydroxyl benzoic esters. Their details are shown in Table 1.
The antimicrobial half-life (t1/2) (h) at the condition of minimum inhibition concentration of 25 hydroxyl benzoic esters was collected from the literature , in the form of logarithm (lgt1/2) to express its antibacterial activity. The results are shown in Table 2.
The quantum chemical parameters  and molecular connectivity indexes  can well explain the antibacterial activity of compounds and have good correlation between them; therefore, this paper selects them with a clear physical meaning as the descriptor.
In this paper, the quantum chemical parameters are calculated by the latest Gaussian 09 software  that is a quantum chemistry software of semiempirical calculation and ab initio calculation of United States Gaussian company. Gaussian 09 in the calculation can carry out the molecular structure through the View Gauss 5 software directly and create the input files of molecular structures. In the calculation, Gaussian 09 software calls directly the input file and translates it into the form of redundant internal coordinates automatically. The results of the calculation are output by the text. Each time before calculation, a suitable chemistry model (computational method) should be established for the system in order to achieve balance in terms of computational cost and accuracy [27, 28]. The method of this paper is B3LYP/6-31G DFT/(d). Because all the molecular configurations are optimal configurations and the geometry optimization is convergent and there is no virtual frequency by the frequency analysis, therefore, all the data are true and reliable. Find out the useful quantum chemical parameters from the output file. The values are shown in Table 3.
Molecular connectivity indexes which mainly reflect the number of atoms in molecules, valence bond and branch information, and so forth are the constants that are calculated according to the molecular structure. Each order index has a different meaning. Many studies show that 5XvP can characterize a lot of information, which has a great significance in explaining the influence of structure on biological activity [29, 30]. So, this study selects 8 molecular connectivity indexes, including 0XvP, 1XvP, 2XvP, 3XvP, 4XvP, 5XvP, 3XvC, and 4XvPC. The results are shown in Table 4.
The rational division of datasets is a very hot research topic in the field of QSAR. There are a variety of methods. In this paper, Random Sampling (RS)  is used to divide the raw data into training set (22 kinds) and test set (3 kinds, o-hydroxyl benzoic esters, m-hydroxyl benzoic esters, and p-hydroxyl benzoic esters). The training set is used to establish the SVM nonlinear models, and the test set tests the external prediction ability of the models.
Through the R software program, the training set with 22 compounds is used to build the nonlinear models by SVM algorithm based on the selected descriptors. Firstly, we standardize the data and then establish 4 models of kernel for radial, linear, eps-regression, and nu-regression type, respectively.
Model validation is very important for QSAR research, which consists of two aspects: internal validation to test the fitting ability and robustness of models and external validation to test the model's predictive ability. Both internal and external validations are equally important .
There are many methods to estimate a model's stability, robustness, and internal predictive ability, such as the fitting correlation coefficient, cross-validation, random model test, Y random, and various residual errors (like Root Mean Squared Errors (RMSEs), standard residual error, etc.) . In this paper, the fitting correlation coefficient (R2) between the experimental and predicted values of the training dataset and leave-one-out cross-validation (Q2LOO) are used to test the reliability, robustness, stability, and whether the models are overfitting or not.
A very important purpose of the QSAR models is to predict the related activity data of new or even nonsynthetic compounds, in order to guide the design and synthesis of compounds with desirable activity, or to screen the compounds. This requires that the model has good predictive ability and generalization ability; however, cross-validation can only explain the internal predictive ability of models and good internal prediction ability does not mean the excellent external prediction ability [34–36]; that is, good cross-validation Q2cv is a necessary but nonsufficient condition for the high external predictive ability . The only way to evaluate the external predictive ability of the model is to test the model with the new compound (namely, external test set that is not involved in the process of descriptor selection and model establishment). The parameters of evaluation model's external predictive ability include R2ext, external Q2ext, and SDEPext. In this paper, the test set is used to predict the corresponding lgt1/2 and external predictive ability of the models is evaluated by SDEPext.
We use principal component analysis to extract the most critical molecular descriptors of the hydroxyl benzoic esters for antibacterial half-life.
Four nonlinear SVM models based on the selected descriptors are established by using training set. Experimental values and internal prediction results of lgt1/2 are shown in Table 5 and scatter plot in Figure 1.
lgt1/2 of the test set is predicted, respectively, by 4 SVM models and the results are shown in Table 7. SDEPext of the models and the residual between experimental values and the predicted results of lgt1/2 are displayed in Table 8. Scatter plots of experimental values and prediction results by 4 SVM models of 25 compounds of lgt1/2 are shown in Figure 2.
The degree of freedom and the speed of the preservative molecule determine the effective collision between the central atom of reactivity and the group or atom of microbial molecular activity. As a result, the antimicrobial property of the preservative is essentially determined by the electronic behavior of the preservative and the microorganism, that is, the quantum biochemical characterization of preservative. Therefore, from the perspective of quantum chemistry to study the relationship between the structure and properties of compound, the effective antimicrobial groups of preservative can be explained in essence . Jiang et al.  use multiple linear regression to establish the linear model of 25 kinds of hydroxyl benzoic esters. The parameters are shown in Table 9. Results showed that R2 was only 0.421, but the equation had good linear relationship when the number of C atoms was less than 4. When the number of C atoms in the ester group is more than 4, the influencing factors become more complex and cannot be described by simple linear relationship and may be in nonlinear or diversified relationship. So we use the R language to write the program and establish 4 kinds of nonlinear models through the SVM machine algorithm for 25 hydroxyl benzoic esters and predict lgt1/2. Predicted results of training set are shown in Table 5. The scatter plot of experimental and predicted lgt1/2 is drawn by using R software. Figure 1 shows that the predicted and experimental values are in good agreement and the linearity is obvious. According to literatures, if the value of R2 is greater than 0.6 [35, 38] and Q2 is greater than 0.5, the model is good, and model is excellent when the values are more than 0.9 . Tropsha et al.  recommend R2 and Q2 to be greater than 0.6. Table 6 shows that both R2 and Q2LOO are greater than 0.6 and R2 and Q2LOO of two models with linear kernel function are close to 0.75, so we may think that the stability, robustness, and internal predicted ability of the 4 models are better and the models are not overfitting because R2 is larger than Q2LOO by no more than 25%. By RS extracting, the para-, ortho-, and metacompound from 25 hydroxyl benzoic esters make up external test set to test the models, and the prediction results are shown in Table 7. The parameters from Table 8 show that the residual values of lgt1/2 of the test set are in the range of −0.037244~0.322733 and SDEPext is 0.213, 0.222, 0.189, and 0.218, respectively. The results indicate that the 4 models have high external predictive ability among themselves; in particular the model of the linear kernel function and eps-regression type is better than the other 3 models. Scatter plots of experimental values and prediction results by 4 SVM models of 25 compounds of lgt1/2 are shown in Figure 2. The results show that the overall prediction of the 4 SVM models is better and, particularly, the linear relationship between predictive and experimental value of the model, where kernel function is linear and type is eps-regression, is the best.
In Table 10, the principal component analysis shows that the proportion of variance of the first principal component reaches 96.03%; therefore, the first principal component is taken only. Table 11 shows that the first principal component includes E (total energy), ZPE (zero-point vibrational energy), and p (polarizability). We consider that E, ZPE, and p are the key factors for antibacterial half-life of hydroxyl benzoic esters. p is a kind of structural parameter characterized by molecular deformation tensor under the action of external electric field. It is the most important property that p is related to the volume of the molecule and p contains information about the molecular interaction that is able to characterize the properties of the molecule as an electron acceptor. Since the coefficients of p and ZPE are negative, this indicates that the value of p and ZPE is greater and the antibacterial half-life of hydroxyl benzoic esters is shorter but E is just the opposite because the coefficient is positive.
In summary, QSAR nonlinear model obtained by quantum chemical parameters and molecular connectivity indexes can better predict the antibacterial activity of hydroxyl benzoic esters. The introduction of SVM algorithm solves the problem of poor correlation of QSAR and complex nonlinear relationship between the molecular descriptors when formula weight is large, which provides a basis for the prediction of the antibacterial activity of compounds with similar structure.
Therefore, the main conclusions of this paper are as follows:
This study was supported the Natural Science Foundation of Guangdong Province (Grant no. 2014A030313585), the Natural Sciences Funds, China (Grant no. 81473588, 2014), and Guangdong Province Science and Technology New Drug R&D Key Project (Grant no. 2013A022100041).
The authors confirm that this article's content has no conflicts of interest.