|Home | About | Journals | Submit | Contact Us | Français|
The terms bioaccumulation and bioconcentration refer to the uptake and build-up of chemicals that can occur in living organisms. Experimental measurement of bioconcentration is time-consuming and expensive, and is not feasible for a large number of chemicals of potential regulatory concern. A highly effective tool depending on a quantitative structure-property relationship (QSPR) can be utilized to describe the tendency of chemical concentration organisms represented by, the important ecotoxicological parameter, the logarithm of Bio Concentration Factor (log BCF) with molecular descriptors for a large set of non-ionic organic compounds. QSPR models were developed using multiple linear regression, partial least squares and neural networks analyses. Linear and non-linear QSPR models to predict log BCF of the compounds developed for the relevant descriptors. The results obtained offer good regression models having good prediction ability. The descriptors used in these models depend on the volume, connectivity, molar refractivity, surface tension and the presence of atoms accepting H-bonds.
Bioaccumulation is the process where the chemical concentration in an aquatic organism achieves a level exceeding that in the water, as a consequence of chemical uptake through all means of chemical exposure (e.g. dietary absorption, transport across the respiratory surface, dermal absorption, and inhalation). Bioconcentration is defined as the absorption of a chemical concentrations in an aquatic organism’s tissues that is greater than that in the water as a result of exposure of the organism to a chemical concentration in the water via non-dietary routes. The extent to which a contaminant will concentrate in an organism is articulated as BCF which is the ratio of the chemical concentration in the organism (CB) and the water (Cw)1:
Within a species, BCFs vary for different chemical compounds. The BCF of a chemical is the result of four processes: Absorption, Distribution, Metabolism and Excretion (ADME).2 Bioconcentration is usually derived under laboratory conditions where the chemical is absorbed from the water via the respiratory surface and/or the skin only. Generally, the experimental measurement of bioconcentration is time-consuming and expensive, and is not feasible for a large number of chemicals of potential regulatory concern. For this reason attention is turning to estimation of BCF values by QSPRs. BCF values can be estimated from the octanol/water partition coefficient (Kow) using QSPR models. In addition, Kow values, either experimentally determined or estimated, can be used directly to assess the potential for bioaccumulation.
QSPRs are usually observed between log BCF and log Kow. The majority of these are linear regression models3–8 which give satisfactory results only for chemicals with log Kow < 6, while for highly hydrophobic chemicals (log Kow > 6), non-linear,9,10 bilinear11,12 or polynomial13 relationships have been proposed and applied for more satisfactory BCF prediction/estimation. Zhao et al14 performed a QSPR study on a large set of 473 compounds that was created with ISIS BASE 2.5 SP2. Lu15 proposed models for the prediction of BCF using connectivity indices and comprising experimental BCF data in fish (at steady-state) for 239 non-ionic organic compounds. This approach was applied by Gramatica and Papa16,17 with the same data set previously modeled by Lu15 using a large starting set of theoretical molecular descriptors and by applying a genetic algorithm (GA) approach as the variable subset selection method to obtain multilinear regression models with few variables correlated with BCF. Fatemi et al18 reported a QSAR study on a set of 53 compounds using geometric, electronic and topological descriptors by artificial neural networks (ANN) to predict the value of log BCF for some compounds. Recently, a QSAR model for fish BCFs of 8 groups of compounds was developed employing partial least squares (PLS) regression, based on linear solvation energy relationship (LSER) theory and theoretical molecular structural descriptors.19–21
Khadikar et al22 reported QSAR study on the estimation of bioconcentraction factor of polyhalogenated biphenyls using a distanced-based index called Padmakar-Ivan index, abbreviated as PI. Incited by these results, we have used the molecular descriptors used by Lu et al15 Gramatica and Papa16,17 and Khadikar et al22 in addition to large set of other topological indices to model the BCF. It is worth to mention that, to our knowledge and to date, no attempt has been made to investigate neural network modeling using principal components for prediction of BCFs of the set of nonionic organic compounds.
Before discussing our results it is necessary to mention that both Lu et al15 and Gramatica and Papa in16,17 used the same data set of 238 compounds, while Khadikar et al22 used only 16 polyhalogenated biphenyls from the original set of 238 compounds reported by Lu et al.15 Furthermore, Gramatica and Papa16,17 did not verify the data and only deleted acrolein from the original data set reported by Lu et al.15 In this contribution, we observed that 9 compounds (compounds: 210–215, 218, 220, and 234) out of the larger set of 238 compounds need to be removed in order to obtain excellent models. The main reason for removing these 9 compounds is that the molecular descriptors used by us failed to establish structure-property relationship.
In view of the above, we used principal components— artificial neural networks (PC-ANN) and PLS methods on the data set of 229 non-ionic compounds modeled by Lu et al.15 The proposed models were checked for their internal and external predictive power using cross validation parameters.
Geometry Optimization was performed by HyperChem (Version 7.0 Hypercube, Inc) at the AM1 level. Descriptors were calculated using HyperChem and Dragon software.23 SPSS Software (version 13.0, SPSS, Inc.) was used for the simple multiple linear regression (MLR) analysis. PCA, ANN and PLS regression were performed in the MATLAB (Version 7.0.1 (R14), Mathworks, Inc.) environment.
Compounds name and their BCFs are included in Table 1. Chemical structure of these compounds was obtained from HyperChem software and optimized on AM1 semi-empirical level. An AM1 optimization was chosen since it was developed and parameterized for common organic structures. The Optimization was preceded by the Polak-Rebiere algorithm to reach 0.01 root mean square gradient. In this study, 24 molecular descriptors including topological, 2D-autocorrelation, GETAWAY, properties and functional groups descriptors were calculated using HyperChem and Dragon software, see Abbreviations.
MLR analysis using the method of maximum −R2 with stepwise selection and elimination of variables24 was employed to model the logarithm of the BCF (log BCF) relationships with different set of descriptors to select initial input models for the artificial neural networks algorithm (ANN).
Principal component analysis (PCA) and more specifically factor analysis (FA) groups together variables that are collinear to form a composite indicator capable of capturing as much of common information of those indicators as possible. Each factor reveals the set of variables having the highest association with it. The idea under this approach is to account for the highest possible variation in the indicators set using the smallest possible number of factors. Therefore, the index no longer depends upon the dimensionality of the dataset but it is rather based on the “statistical” dimensions of the data. Application of PCA on a descriptor data matrix results in a loading matrix containing factors or principal components, which are orthogonal and therefore do not correlate with each other. We used these factors as the inputs of ANN instead of the original descriptors.
In contrast to MLR, the artificial neural networks (ANN) are capable of recognizing highly nonlinear relationships. The flexibility of ANN enables it to discover more complex relationships in experimental data, when it is compared with the traditional statistical models. The PC-ANN was proposed by Gemperline25 to improve training speed and decrease the overall calibration error.
In this method, as a preliminary treatment, the input data (i.e. molecular descriptors) was normalized so as to have zero mean and unity variance, and then were subjected to principal component analysis (PCA) before being introduced into the neural network. The most significant principal components (PCs), which explain most of the variance in the original data (>95%), were selected, ranked according to decreasing Eigen-value and then used as ANN input.
It should be noted for each model obtained with MLR separate PC-ANN models were developed so that the input’s descriptors were the subsets selected by the stepwise MLR methods. In the case of each MLR model, a feed-forward neural network with back-propagation of error algorithm was constructed to model the activity structure relationships between the extracted PCs of the descriptors in one hand and the logarithm of BCF data of the non-ionic organic compounds in the other hand. More details about the model development in PC-ANN and the network architecture are explained.26–29 Over-fitting problem or poor generalization capability happens when a neural network over-learns during the training period. A too well-trained model may not perform well on unseen data set due to its lack of generalization capability. An approach to overcome this problem is the early stopping method in which the training process is concluded as soon as the overtraining signal appears. This approach requires the data set to be divided into three subsets: training, test and validation sets. The training and the validation sets are the norm in all model training processes. The test set is used to test the trend of the prediction accuracy of the model trained at some point of the training process. At later training stages, the validation error increases. This is the point when the model should cease to be trained to overcome the over-fitting problem. To achieve this purpose, the extracted PCs for each MLR model were classified into training set (60%), validation set (20%) and external test set (20%). Then, the training and validation sets were used to optimize the network performance. The regression between the network output and the activity was calculated for the three sets individually. The training function “trainscg” in MATLAB was used to train the network. To find models with lower errors, the ANN algorithm was run many times, each time run with different geometry and/or initial weights. The different ANN runs were carried out in a systematic way so that the obtained models have low training and testing rootmean-square errors (i.e. low fitness).
PLS is a method for building regression models on the latent variable (LV) decomposition relating two blocks, matrices X and Y, which contain the independent and dependent variables, respectively. These matrices can be simultaneously decomposed into a sum of LV’s. In this procedure, it is necessary to find the best number of LV’s, which is normally performed by using cross-validation, based on determination of minimum prediction error. Leave-one-out cross validation was carried out using the NIPALS algorithm. Applications of PLS have been discussed by several workers.30,31 For model validation, the dataset is required to be divided into training set for building the QSAR model and external test set for investigating its predictive ability. The most important indicators of the QSAR model superiority are the statistical parameters of the external test set. In a similar manner to our previous work,32 the data was divided into 80% training set and 20% test set. To have comparable data with that used in the ANN analysis, the outliers and test set compounds are kept the same as in the PC-ANN analysis.
Table 2 records the regression models suggested from MLR analyses and their correlation coefficients (R). The number of descriptors in these models is varied between 3 and 15. The highest correlation coefficient obtained is 0.923 for a regression model with 15 descriptors (model 15) whereas the R2CV (Q2) values shown in Figure 1 suggest that model 4 can be the optimal.
The regression equation for models 4 is:
Eq. (2) shows that log BCF can be modeled by 2D autocorrelation descriptors (MATS2 m and GATS2e), the information indices descriptors (V1MD,deg) and the functional group descriptor (nHAccc). 2D autocorrekation descriptors are molecular descriptors calculated from molecular graph by summing the products of atom weights of the terminal atoms of all the paths of the considered path length (the lag). 2D autocorrelations by Moran (MATS) and Geary (GATS) algorithms are calculated from lag 1 to lag 8 for 4 different weighting schemes. The information indices descriptors are molecular descriptors calculated as information content of molecules, based on the calculation of equivalence classes from the molecular graph. Eq. (2) shows that the most significant descriptor is the Moran autocorrelation of a topological structure −lag2/weighted by atomic mass (MATS2 m) which is among the 2D autocorrelation descriptors. Furthermore, eq. (2) shows that log BCF increases with increasing MATS2M and V1M D,deg and decreases with increasing Geary autocorrelation −lag2/weighted by atomic Sanderson electronegativities (GATS2e) as well as the number of acceptor atoms for H-bonds (nHAcc) values.
Nevertheless, the number of descriptors is small according to the rule of the thumb suggested by Tute.33 Partial least squares (PLS) and artificial neural networks algorithm (ANN) were used for further investigation of the linear and nonlinear relationships in the obtained regression models.
The inputs of the ANN were the subset of the descriptors used in different MLR models (Table 2 and S1 in the supplementary material). The correlation data matrix for these descriptors is represented in supplementary material (Table S2). As it is observed, some descriptors represent high degree of collinearity. Collinear descriptors add redundancy to the input data matrix and therefore the performances of the models obtained by using these descriptors will be degraded. Therefore, we used principal components as the inputs of ANN instead of the original descriptors.
Firstly, PCA was used to classify the molecules into training, validation and prediction sets. Performing PCA overall the data set of 229 compounds and 24 descriptors and plotting the first and second principal, shows that compounds 179 and 181 are outliers, see Figure 2. In other words, molecules 179 and 181 behave differently from other molecules with respect to both molecular structure (descriptors) and the logarithm of BCF (log BCF). Therefore, these molecules were not used in the future analysis. According to the pattern of the distribution of the data in factor spaces (Fig. 2) the training, validation and prediction molecules were selected homogenously, so that molecules in different zones of Figure 2 included to all three subsets. After removing the outliers and subjecting the data for the remaining 227 compounds to the preliminary treatment mentioned above, the classified data was used as an input for the ANN.
In this study, a three–layered feed–forward ANN model with back–propagation learning algorithm34 was employed. At the first, the nonlinear relationship between the subset of descriptors selected by stepwise selection-based MLR (Table 2) and log BCF was preceded by PC-ANN models with similar structure. The number of hidden layer’s nodes was set 7 for all models and the number of nodes in the input layer was the number of PCs extracted for each subset of descriptors. The results of PC-ANN modeling for MLR models number 3–15 are given in Table 2. This table shows that models 4 and 14 have almost the highest correlation coefficient for the external test set (0.877 and 0.848, respectively) which indicates a high predictive power. The training set correlation coefficients for models 4 and 14 are 0.915 and 0.903, respectively. The R2CV values for model 4 is 0.837 and the correlation coefficient of prediction is 0.877, which means that the four PCs selected by eigenvalue ranking procedure can explain at least 83.7% and 76.5% variance in log BCF for the calibration and prediction, respectively. Model 14 has a lower R2CV values (0.814) and correlation coefficient of prediction (Rp = 0.848) than model 4 which models 4 PCs while model 14 models 5 PCs. This suggests that the variables in model 4 are not so strictly correlated between themselves (4 variables represents by 4 PCs) and the network probably would not lose its performance due to collinearity while the variables in model 14 (14 variables) are correlated, they were represented by only 5 PCs.
Figure (3a) shows plots of predictive residual sum of squares (PRESS) against regression models number 3 –15 for training and test sets obtained from ANN analysis. As it is seen, the values of PRESS for ANN models 4 and 14 are the minima for training and test sets at the same time which make each of them a good candidate for performing feature analysis on. Hence, the number of hidden nodes for models 4 and 14 was optimized and compared.
In order to optimize the performance of the ANN models 4 and 14, we trained the ANN using different number of hidden nodes starting from 1 hidden node to 20 hidden nodes. Figure (3c) shows plots of PRESS against number of hidden nodes for training and test sets for ANN model 4. Although the minima on PRESS curves for this model occurs when using 19 and 20 hidden nodes, such large number of hidden nodes can lead to overfitted models.35 Therefore, we will consider the models with the lower number of hidden nodes (10 hidden nodes, in this case). The PRESS values for the external test set are mainly lower than that for the training set for all models.
The results for the ANN optimization for model 4 are shown in Table S3. Using 19 or 20 hidden nodes gives comparable network performance to that obtained when using 10 hidden nodes. Although the former models have better statistical parameters than those obtained when using 10 hidden nodes, the model obtained using10 hidden nodes was chosen as the optimal one to avoid the risk of overfitting that may be associated with the use of large number of hidden nodes. Using 10 hidden nodes, we obtained a high correlation coefficient for the training set (0.918) and for the prediction set (0.882). This model has a high R2 value for the cross–validation (0.841) and low prediction error (RSEP% = 0.824%).
Figure (3d) shows plots of PRESS against number of hidden nodes for training and test sets for ANN model 14. The results for the ANN optimization for model 14 are shown in Table S3. This table shows that the minimum on the PRESS curves occurs when using 16–20 hidden nodes. Following the same argument used for model 4, we consider the model with the next lower PRESS value which is in this case 14. For this model, we obtained a relatively high correlation coefficient for both the training (0.917) and the prediction sets (0.845). This model has a high R2 value for the cross–validation (0.840) and an RSEP% value of (0.946%). From a statistical point of view, models 4 and 14 (obtained using 10 and 14 hidden nodes, respectively) are similar. Since large numbers of hidden nodes often draws the attention to overfitting risk,35 model 4 obtained using 10 hidden nodes is preferred over model 14 that is obrained using 14 hidden nodes. The relative standard error of prediction (RSEP%) is an important parameter for the evaluation of the predictive ability of a multivariate calibration model. RSEP% is calculated according to eq. (3)
Model 4 with 10 hidden nodes gives lower RSEP% (0.824%) than that for model 14 with 14 hidden nodes (0.946%). For deciding on the best model is normally performed by using cross-validation, based on determination of minimum prediction error,36 thus the choice for model 4 with 10 hidden nodes as optimal one is confirmed.
Table S4 in the supplementary material shows the results for randomization test that was performed to investigate the probability of chance correlation for model 4 with 10 hidden nodes in the network. The proposed models were also checked for reliability by permutation testing (Y-scrambling). The results of this procedure (Table S4 in the supplementary material) show that all models have low correlation coefficients and PRESS values. This indicates that model 4, obtained using 10 hidden nodes, has low susceptibility towards chance correlation and gives direct evidence that the proposed models are well founded. Figure 4 demonstrates regression between observed and predicted log BCF as well as their residuals for this model.
PLS analysis with cross validation was carried out for advance investigation of the linear relationships of the obtained regression models. Model validation was achieved through leave-one-out cross-validation (LOO CV) and external validation (for a test set), and the predictive ability was statistically evaluated through the root mean square errors of calibration and validation. The calibration and prediction qualities were quantified with the correlation coefficients for the training and test sets as well as the R2CV (leave one out cross-validation on training set), select the LV when the R2CV has a high number, or determine it by computing the prediction error sum of squares (PRESS) for cross-validated models which is a standard index to measure the accuracy of a modeling method based on the cross-validation technique.
The cross-validation method employed was to eliminate only one sample at a time and then PLS calibrate the remaining standard descriptor. By using this calibration, the log BCF of the sample left out was predicted. This process was repeated until each standard had been left out once. Figure (3b) shows the associated PRESS of the training and test sets for each model. Table 2 shows that the minimum prediction error (0.163%) occurs for model 9. The cross validation coefficient of determination for this model is high (0.821).
This model has the lowest PRESS values for the training and test sets at the same time. While other models have higher R2CV values than this model, they also have higher prediction errors. Accordingly, model 9 was the best model according to PLS analysis. This model has a regression coefficient of 0.921 and 0.912 for the training and tests sets, respectively.
(Table S5) in the supplementary material shows regression and cross validation parameters for randomization test that is performed to investigate the probability of chance correlation for models 3–15 using PLS analysis. This table shows that the proposed optimal PLS model (model 9) is superior to that obtained by chance. Figure 5 shows regression between observed and predicted log BCF as well as their residuals for training and test sets of model 9 using PLS analysis.
This model contains the following nine descriptors: V1MD,deg, nHAcc, MATS2m, GATS2e, 2Xv, ST, JhetZ, Jhetv, MR (see appendix) which are represented by 5 LV’s.
The following conditions proposed by Golbraikh and Tropsha37 were applied to conclude that the QSAR model has acceptable prediction power if:
where R20 and R′20 are the coefficients of determination characterizing linear regression with Y-intercept set at zero, the first associated with observed vs. predicted values, the second related to predicted vs. observed values; k and k′ are the slopes of the regression lines forced through zero, relating observed vs. predicted and predicted vs. observed values. Alternatively, the parameter R2m, where R2m = R2* (1−(R2−R20))1/2, can be used.38 This parameter penalizes a model for large differences between observed and predicted values, was also calculated. R2m should be larger than 0.5 for a good external prediction, which is the case for model 4 from the ANN analysis (R2m = 0.754) and model 9 from the PLS analysis (R2m = 0.756). If a model shows good statistical performance for all these criteria, on both the training and the test sets, its reliability and robustness are high.
Comparing the linear (PLS) and nonlinear (PC-ANN) models shows that nonlinear relations improved the models over linear ones. Table 2 shows that compared with PLS and ANN results, MLR underestimate the regression coefficient values for small number of variables. Both ANN and MLR results show that Model 4 is the optimal model to predict the BCF values for the set of compounds in this study. However, PLS analysis improves the statistics compared with ANN and MLR. It gives models with higher correlation coefficients and R2CV values in addition to lower prediction errors. PLS analysis suggests that the optimal model is model 9. Taking into account the complexity of the neural network based models; PLS based models are better to describe the QSPR of the BCF for the data set in this investigation. The descriptors used in these models depend on the volume, connectivity, molar refractivity, surface tension and the presence of atoms accepting H-bonds.
In summary, this study utilizes theoretical molecular descriptors to estimate BCF directly from the structure of the chemical. The most important molecular descriptors to model the BCF are topological (mainly 2d autocorrelation descriptors), geometrical (mainly encoding molecular size) and generally related to chemicals’ dimensions, polarizability, surface tension, molar refreactivity and number of acceptor atoms for H-bonds.
Table 3 summarizes the results obtained in the present study as well as other QSPR studies performed on the BCF. Zhao et al14 performed a QSPR study on a large set of 473 compounds that was created with ISIS BASE 2.5 SP2. Wang et al39 obtained a QSAR model by adopting topological properties and flexibility of chemicals to predict the BCF. Gramatica and Papa16,17 have performed QSAR study on the same set of non-ionic organic compounds using GA and MLR analysis without verifying the data. Fatemi et al18 applied ANN using descriptors selected with GA but did not use PCA in the stepwise pre-selection of variables. Both Gramatica and Papa16,17 as well as Fatemi et al18 split the data randomly, while in this contribution the data is split homogenously using the space of PCs. This implies that the training, validation and test sets include the molecules in different zones of the data distribution shown in Figure 2. Chen et al19 developed a QSAR model for fish BCFs of 8 groups of compounds employing PLS regression, based on LSER theory and theoretical molecular structural descriptors. Their model showed that the molecular size plays a critical role in affecting the bioconcentration of organic pollutants in fish which agrees with the results found in this study where theoretical descriptors were used to develop the model and assist the mechanism interpretation. Nevertheless, the model found in this study has performed external validation in addition to the usage of a validation set of 46 compounds to detect overfitting as implemented in the early stopping method while no such procedure was used in the other studies mentioned above.
Wang et al obtained coefficients of determination of 0.80 and 0.79 for calibration and cross validation, respectively, while in this study we obtained higher values (0.85 and 0.82). Generally, the calibration and cross validation coefficient of determinations (R2 and R2 CV) obtained in this study (ANN model: 0.84 and 0.84/PLS model: 0.85 and 0.82) are higher than those obtained in the other studies shown in Table 3. An exception is for the R2 and R2 CV values obtained by Fatemi et al (0.88 and 0.89–0.92) and Chen et al (0.87 and 0.86). However, in this study, we used a larger number of compounds which implies that the results obtained in this study are more general than those obtained by Fatemi et al and Chen et al.
A quantitative–structural property relationship analysis has been conducted on BCF (log BCF) for 227 different non-ionic organic compounds by using PLS and the principal component–artificial neural networks modeling methods, with application of eigenvalue ranking factor selection procedure. The PLS gives improved regression models with better prediction ability compared with PC-ANN. Taking into account the complexity of the neural network based models; PLS based models are quite good to describe the QSPR of the BCF for the data set in this investigation. A 0.921 correlation coefficient was obtained using PLS with 5 LV’s. The optimal model contains the descriptors V1M D,deg, nHAcc, MATS2m, GATS2e, 2Xv, ST, JhetZ, Jhetv and MR.
Omar Deeb acknowledges Prof. Dr. Jingwen Chen, Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), School of Environmental Science and Technology, Dalian University of Technology (China) for his valuable suggestions to improve the manuscript. He also acknowledges Dr. Kunal Roy, Drug Theoretics and Cheminformatics Lab, Department of Pharmaceutical Technology, Jadavpur University, Kolkata (India) for his suggestions concerning validation of the proposed models.
This manuscript has been read and approved by all authors. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers of this paper report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.