Home | About | Journals | Submit | Contact Us | Français |

**|**Environ Health Insights**|**v.4; 2010**|**PMC2918358

Formats

Article sections

Authors

Related links

Environ Health Insights. 2010; 4: 33–47.

Published online 2010 July 6.

PMCID: PMC2918358

Correspondence author email: moc.oohay@li0002beed

Copyright © 2010 the author(s), publisher and licensee Libertas Academica Ltd.

This is an open access article. Unrestricted non-commercial use is permitted provided the original work is properly cited.

The terms bioaccumulation and bioconcentration refer to the uptake and build-up of chemicals that can occur in living organisms. Experimental measurement of bioconcentration is time-consuming and expensive, and is not feasible for a large number of chemicals of potential regulatory concern. A highly effective tool depending on a quantitative structure-property relationship (QSPR) can be utilized to describe the tendency of chemical concentration organisms represented by, the important ecotoxicological parameter, the logarithm of Bio Concentration Factor (log BCF) with molecular descriptors for a large set of non-ionic organic compounds. QSPR models were developed using multiple linear regression, partial least squares and neural networks analyses. Linear and non-linear QSPR models to predict log BCF of the compounds developed for the relevant descriptors. The results obtained offer good regression models having good prediction ability. The descriptors used in these models depend on the volume, connectivity, molar refractivity, surface tension and the presence of atoms accepting H-bonds.

Bioaccumulation is the process where the chemical concentration in an aquatic organism achieves a level exceeding that in the water, as a consequence of chemical uptake through all means of chemical exposure (e.g. dietary absorption, transport across the respiratory surface, dermal absorption, and inhalation). Bioconcentration is defined as the absorption of a chemical concentrations in an aquatic organism’s tissues that is greater than that in the water as a result of exposure of the organism to a chemical concentration in the water via non-dietary routes. The extent to which a contaminant will concentrate in an organism is articulated as BCF which is the ratio of the chemical concentration in the organism (C_{B}) and the water (C_{w})^{1}:

$$\text{BCF}\hspace{0.17em}=\hspace{0.17em}{\text{C}}_{\text{B}}/{\text{C}}_{\text{W}}$$

(1)

Within a species, BCFs vary for different chemical compounds. The BCF of a chemical is the result of four processes: Absorption, Distribution, Metabolism and Excretion (ADME).^{2} Bioconcentration is usually derived under laboratory conditions where the chemical is absorbed from the water via the respiratory surface and/or the skin only. Generally, the experimental measurement of bioconcentration is time-consuming and expensive, and is not feasible for a large number of chemicals of potential regulatory concern. For this reason attention is turning to estimation of BCF values by QSPRs. BCF values can be estimated from the octanol/water partition coefficient (K_{ow}) using QSPR models. In addition, K_{ow} values, either experimentally determined or estimated, can be used directly to assess the potential for bioaccumulation.

QSPRs are usually observed between log BCF and log K_{ow}. The majority of these are linear regression models^{3}^{–}^{8} which give satisfactory results only for chemicals with log K_{ow} < 6, while for highly hydrophobic chemicals (log K_{ow} > 6), non-linear,^{9}^{,}^{10} bilinear^{11}^{,}^{12} or polynomial^{13} relationships have been proposed and applied for more satisfactory BCF prediction/estimation. Zhao et al^{14} performed a QSPR study on a large set of 473 compounds that was created with ISIS BASE 2.5 SP2. Lu^{15} proposed models for the prediction of BCF using connectivity indices and comprising experimental BCF data in fish (at steady-state) for 239 non-ionic organic compounds. This approach was applied by Gramatica and Papa^{16}^{,}^{17} with the same data set previously modeled by Lu^{15} using a large starting set of theoretical molecular descriptors and by applying a genetic algorithm (GA) approach as the variable subset selection method to obtain multilinear regression models with few variables correlated with BCF. Fatemi et al^{18} reported a QSAR study on a set of 53 compounds using geometric, electronic and topological descriptors by artificial neural networks (ANN) to predict the value of log BCF for some compounds. Recently, a QSAR model for fish BCFs of 8 groups of compounds was developed employing partial least squares (PLS) regression, based on linear solvation energy relationship (LSER) theory and theoretical molecular structural descriptors.^{19}^{–}^{21}

Khadikar et al^{22} reported QSAR study on the estimation of bioconcentraction factor of polyhalogenated biphenyls using a distanced-based index called Padmakar-Ivan index, abbreviated as PI. Incited by these results, we have used the molecular descriptors used by Lu et al^{15} Gramatica and Papa^{16}^{,}^{17} and Khadikar et al^{22} in addition to large set of other topological indices to model the BCF. It is worth to mention that, to our knowledge and to date, no attempt has been made to investigate neural network modeling using principal components for prediction of BCFs of the set of nonionic organic compounds.

Before discussing our results it is necessary to mention that both Lu et al^{15} and Gramatica and Papa in^{16}^{,}^{17} used the same data set of 238 compounds, while Khadikar et al^{22} used only 16 polyhalogenated biphenyls from the original set of 238 compounds reported by Lu et al.^{15} Furthermore, Gramatica and Papa^{16}^{,}^{17} did not verify the data and only deleted acrolein from the original data set reported by Lu et al.^{15} In this contribution, we observed that 9 compounds (compounds: **210–215, 218, 220,** and **234**) out of the larger set of 238 compounds need to be removed in order to obtain excellent models. The main reason for removing these 9 compounds is that the molecular descriptors used by us failed to establish structure-property relationship.

In view of the above, we used principal components— artificial neural networks (PC-ANN) and PLS methods on the data set of 229 non-ionic compounds modeled by Lu et al.^{15} The proposed models were checked for their internal and external predictive power using cross validation parameters.

Geometry Optimization was performed by HyperChem (Version 7.0 Hypercube, Inc) at the AM1 level. Descriptors were calculated using HyperChem and Dragon software.^{23} SPSS Software (version 13.0, SPSS, Inc.) was used for the simple multiple linear regression (MLR) analysis. PCA, ANN and PLS regression were performed in the MATLAB (Version 7.0.1 (R14), Mathworks, Inc.) environment.

Compounds name and their BCFs are included in Table 1. Chemical structure of these compounds was obtained from HyperChem software and optimized on AM1 semi-empirical level. An AM1 optimization was chosen since it was developed and parameterized for common organic structures. The Optimization was preceded by the Polak-Rebiere algorithm to reach 0.01 root mean square gradient. In this study, 24 molecular descriptors including topological, 2D-autocorrelation, GETAWAY, properties and functional groups descriptors were calculated using HyperChem and Dragon software, see Abbreviations.

MLR analysis using the method of maximum −R^{2} with stepwise selection and elimination of variables^{24} was employed to model the logarithm of the BCF (log BCF) relationships with different set of descriptors to select initial input models for the artificial neural networks algorithm (ANN).

Principal component analysis (PCA) and more specifically factor analysis (FA) groups together variables that are collinear to form a composite indicator capable of capturing as much of common information of those indicators as possible. Each factor reveals the set of variables having the highest association with it. The idea under this approach is to account for the highest possible variation in the indicators set using the smallest possible number of factors. Therefore, the index no longer depends upon the dimensionality of the dataset but it is rather based on the “statistical” dimensions of the data. Application of PCA on a descriptor data matrix results in a loading matrix containing factors or principal components, which are orthogonal and therefore do not correlate with each other. We used these factors as the inputs of ANN instead of the original descriptors.

In contrast to MLR, the artificial neural networks (ANN) are capable of recognizing highly nonlinear relationships. The flexibility of ANN enables it to discover more complex relationships in experimental data, when it is compared with the traditional statistical models. The PC-ANN was proposed by Gemperline^{25} to improve training speed and decrease the overall calibration error.

In this method, as a preliminary treatment, the input data (i.e. molecular descriptors) was normalized so as to have zero mean and unity variance, and then were subjected to principal component analysis (PCA) before being introduced into the neural network. The most significant principal components (PCs), which explain most of the variance in the original data (>95%), were selected, ranked according to decreasing Eigen-value and then used as ANN input.

It should be noted for each model obtained with MLR separate PC-ANN models were developed so that the input’s descriptors were the subsets selected by the stepwise MLR methods. In the case of each MLR model, a feed-forward neural network with back-propagation of error algorithm was constructed to model the activity structure relationships between the extracted PCs of the descriptors in one hand and the logarithm of BCF data of the non-ionic organic compounds in the other hand. More details about the model development in PC-ANN and the network architecture are explained.^{26}^{–}^{29} Over-fitting problem or poor generalization capability happens when a neural network over-learns during the training period. A too well-trained model may not perform well on unseen data set due to its lack of generalization capability. An approach to overcome this problem is the early stopping method in which the training process is concluded as soon as the overtraining signal appears. This approach requires the data set to be divided into three subsets: training, test and validation sets. The training and the validation sets are the norm in all model training processes. The test set is used to test the trend of the prediction accuracy of the model trained at some point of the training process. At later training stages, the validation error increases. This is the point when the model should cease to be trained to overcome the over-fitting problem. To achieve this purpose, the extracted PCs for each MLR model were classified into training set (60%), validation set (20%) and external test set (20%). Then, the training and validation sets were used to optimize the network performance. The regression between the network output and the activity was calculated for the three sets individually. The training function “trainscg” in MATLAB was used to train the network. To find models with lower errors, the ANN algorithm was run many times, each time run with different geometry and/or initial weights. The different ANN runs were carried out in a systematic way so that the obtained models have low training and testing rootmean-square errors (i.e. low fitness).

PLS is a method for building regression models on the latent variable (LV) decomposition relating two blocks, matrices **X** and **Y**, which contain the independent and dependent variables, respectively. These matrices can be simultaneously decomposed into a sum of LV’s. In this procedure, it is necessary to find the best number of LV’s, which is normally performed by using cross-validation, based on determination of minimum prediction error. Leave-one-out cross validation was carried out using the NIPALS algorithm. Applications of PLS have been discussed by several workers.^{30}^{,}^{31} For model validation, the dataset is required to be divided into training set for building the QSAR model and external test set for investigating its predictive ability. The most important indicators of the QSAR model superiority are the statistical parameters of the external test set. In a similar manner to our previous work,^{32} the data was divided into 80% training set and 20% test set. To have comparable data with that used in the ANN analysis, the outliers and test set compounds are kept the same as in the PC-ANN analysis.

Table 2 records the regression models suggested from MLR analyses and their correlation coefficients (R). The number of descriptors in these models is varied between 3 and 15. The highest correlation coefficient obtained is 0.923 for a regression model with 15 descriptors (model **15**) whereas the R^{2}_{CV} (Q^{2}) values shown in Figure 1 suggest that model **4** can be the optimal.

Correlation coefficient for MLR, PLS and ANN models **3–15** and cross validation parameters obtained from PLS and ANN analysis.

The regression equation for models **4** is:

$$\begin{array}{ll}\text{log}\hspace{0.17em}\text{BCF}\hspace{0.17em}=\hfill & -\hspace{0.17em}18.551\hspace{0.17em}(\pm 1.508)\hspace{0.17em}+\hspace{0.17em}1.681\hspace{0.17em}(\pm 0.075){\hspace{0.17em}}^{\text{V}}{1}_{\text{D},\text{deg}}^{\text{M}}\hfill \\ \hfill & -\hspace{0.17em}0.535\hspace{0.17em}(\pm 0.030)\hspace{0.17em}\text{nHAcc}\hspace{0.17em}+\hspace{0.17em}16.806\hfill \\ \hfill & \begin{array}{l}(\pm 1.561)\hspace{0.17em}\text{MATS2m}\\ -\hspace{0.17em}0.417\hspace{0.17em}(\pm 0.084)\hspace{0.17em}\text{GATS2e}\end{array}\hfill \end{array}$$

(2)

Eq. (2) shows that log BCF can be modeled by 2D autocorrelation descriptors (MATS2 m and GATS2e), the information indices descriptors (^{V}1^{M}_{D,deg}) and the functional group descriptor (nHAccc). 2D autocorrekation descriptors are molecular descriptors calculated from molecular graph by summing the products of atom weights of the terminal atoms of all the paths of the considered path length (the lag). 2D autocorrelations by Moran (MATS) and Geary (GATS) algorithms are calculated from lag 1 to lag 8 for 4 different weighting schemes. The information indices descriptors are molecular descriptors calculated as information content of molecules, based on the calculation of equivalence classes from the molecular graph. Eq. (2) shows that the most significant descriptor is the Moran autocorrelation of a topological structure −lag2/weighted by atomic mass (MATS2 m) which is among the 2D autocorrelation descriptors. Furthermore, eq. (2) shows that log BCF increases with increasing MATS2M and ^{V}1^{M} _{D,deg} and decreases with increasing Geary autocorrelation −lag2/weighted by atomic Sanderson electronegativities (GATS2e) as well as the number of acceptor atoms for H-bonds (nHAcc) values.

Nevertheless, the number of descriptors is small according to the rule of the thumb suggested by Tute.^{33} Partial least squares (PLS) and artificial neural networks algorithm (ANN) were used for further investigation of the linear and nonlinear relationships in the obtained regression models.

The inputs of the ANN were the subset of the descriptors used in different MLR models (Table 2 and S1 in the supplementary material). The correlation data matrix for these descriptors is represented in supplementary material (Table S2). As it is observed, some descriptors represent high degree of collinearity. Collinear descriptors add redundancy to the input data matrix and therefore the performances of the models obtained by using these descriptors will be degraded. Therefore, we used principal components as the inputs of ANN instead of the original descriptors.

Firstly, PCA was used to classify the molecules into training, validation and prediction sets. Performing PCA overall the data set of 229 compounds and 24 descriptors and plotting the first and second principal, shows that compounds **179** and **181** are outliers, see Figure 2. In other words, molecules **179** and **181** behave differently from other molecules with respect to both molecular structure (descriptors) and the logarithm of BCF (log BCF). Therefore, these molecules were not used in the future analysis. According to the pattern of the distribution of the data in factor spaces (Fig. 2) the training, validation and prediction molecules were selected homogenously, so that molecules in different zones of Figure 2 included to all three subsets. After removing the outliers and subjecting the data for the remaining 227 compounds to the preliminary treatment mentioned above, the classified data was used as an input for the ANN.

Correlation of 1st principal component with 2nd principal component for the factor spaces of the descriptors and their BCF.

In this study, a three–layered feed–forward ANN model with back–propagation learning algorithm^{34} was employed. At the first, the nonlinear relationship between the subset of descriptors selected by stepwise selection-based MLR (Table 2) and log BCF was preceded by PC-ANN models with similar structure. The number of hidden layer’s nodes was set 7 for all models and the number of nodes in the input layer was the number of PCs extracted for each subset of descriptors. The results of PC-ANN modeling for MLR models number **3**–**15** are given in Table 2. This table shows that models **4** and **14** have almost the highest correlation coefficient for the external test set (0.877 and 0.848, respectively) which indicates a high predictive power. The training set correlation coefficients for models **4** and **14** are 0.915 and 0.903, respectively. The R^{2}_{CV} values for model **4** is 0.837 and the correlation coefficient of prediction is 0.877, which means that the four PCs selected by eigenvalue ranking procedure can explain at least 83.7% and 76.5% variance in log BCF for the calibration and prediction, respectively. Model **14** has a lower R^{2}_{CV} values (0.814) and correlation coefficient of prediction (R^{p} = 0.848) than model **4** which models 4 PCs while model **14** models 5 PCs. This suggests that the variables in model **4** are not so strictly correlated between themselves (4 variables represents by 4 PCs) and the network probably would not lose its performance due to collinearity while the variables in model **14** (14 variables) are correlated, they were represented by only 5 PCs.

Figure (3a) shows plots of predictive residual sum of squares (PRESS) against regression models number **3** –**15** for training and test sets obtained from ANN analysis. As it is seen, the values of PRESS for ANN models **4** and **14** are the minima for training and test sets at the same time which make each of them a good candidate for performing feature analysis on. Hence, the number of hidden nodes for models **4** and **14** was optimized and compared.

In order to optimize the performance of the ANN models **4** and **14**, we trained the ANN using different number of hidden nodes starting from 1 hidden node to 20 hidden nodes. Figure (3c) shows plots of PRESS against number of hidden nodes for training and test sets for ANN model **4**. Although the minima on PRESS curves for this model occurs when using 19 and 20 hidden nodes, such large number of hidden nodes can lead to overfitted models.^{35} Therefore, we will consider the models with the lower number of hidden nodes (10 hidden nodes, in this case). The PRESS values for the external test set are mainly lower than that for the training set for all models.

The results for the ANN optimization for model **4** are shown in Table S3. Using 19 or 20 hidden nodes gives comparable network performance to that obtained when using 10 hidden nodes. Although the former models have better statistical parameters than those obtained when using 10 hidden nodes, the model obtained using10 hidden nodes was chosen as the optimal one to avoid the risk of overfitting that may be associated with the use of large number of hidden nodes. Using 10 hidden nodes, we obtained a high correlation coefficient for the training set (0.918) and for the prediction set (0.882). This model has a high R^{2} value for the cross–validation (0.841) and low prediction error (RSE^{P}% = 0.824%).

Figure (3d) shows plots of PRESS against number of hidden nodes for training and test sets for ANN model **14**. The results for the ANN optimization for model **14** are shown in Table S3. This table shows that the minimum on the PRESS curves occurs when using 16–20 hidden nodes. Following the same argument used for model **4**, we consider the model with the next lower PRESS value which is in this case 14. For this model, we obtained a relatively high correlation coefficient for both the training (0.917) and the prediction sets (0.845). This model has a high R^{2} value for the cross–validation (0.840) and an RSE^{P}% value of (0.946%). From a statistical point of view, models **4** and **14 (**obtained using 10 and 14 hidden nodes, respectively) are similar. Since large numbers of hidden nodes often draws the attention to overfitting risk,^{35} model **4** obtained using 10 hidden nodes is preferred over model **14** that is obrained using 14 hidden nodes. The relative standard error of prediction (RSE^{P}%) is an important parameter for the evaluation of the predictive ability of a multivariate calibration model. RSE^{P}% is calculated according to eq. (3)

$${\mathit{\text{RSE}}}^{P}\%=100\times \sqrt{\frac{\sum {(\text{log}\hspace{0.17em}{\mathit{\text{BCF}}}_{\mathit{\text{predicted}}}-\text{log}\hspace{0.17em}{\mathit{\text{BCF}}}_{\mathit{\text{observed}}})}^{2}}{\sum {(\text{log}\hspace{0.17em}{\mathit{\text{BCF}}}_{\mathit{\text{observed}}})}^{2}}}$$

(3)

Model **4** with 10 hidden nodes gives lower RSE^{P}% (0.824%) than that for model **14** with 14 hidden nodes (0.946%). For deciding on the best model is normally performed by using cross-validation, based on determination of minimum prediction error,^{36} thus the choice for model **4** with 10 hidden nodes as optimal one is confirmed.

Table S4 in the supplementary material shows the results for randomization test that was performed to investigate the probability of chance correlation for model **4** with 10 hidden nodes in the network. The proposed models were also checked for reliability by permutation testing (Y-scrambling). The results of this procedure (Table S4 in the supplementary material) show that all models have low correlation coefficients and PRESS values. This indicates that model **4**, obtained using 10 hidden nodes, has low susceptibility towards chance correlation and gives direct evidence that the proposed models are well founded. Figure 4 demonstrates regression between observed and predicted log BCF as well as their residuals for this model.

PLS analysis with cross validation was carried out for advance investigation of the linear relationships of the obtained regression models. Model validation was achieved through leave-one-out cross-validation (LOO CV) and external validation (for a test set), and the predictive ability was statistically evaluated through the root mean square errors of calibration and validation. The calibration and prediction qualities were quantified with the correlation coefficients for the training and test sets as well as the R^{2}_{CV} (leave one out cross-validation on training set), select the LV when the R^{2}_{CV} has a high number, or determine it by computing the prediction error sum of squares (PRESS) for cross-validated models which is a standard index to measure the accuracy of a modeling method based on the cross-validation technique.

The cross-validation method employed was to eliminate only one sample at a time and then PLS calibrate the remaining standard descriptor. By using this calibration, the log BCF of the sample left out was predicted. This process was repeated until each standard had been left out once. Figure (3b) shows the associated PRESS of the training and test sets for each model. Table 2 shows that the minimum prediction error (0.163%) occurs for model **9**. The cross validation coefficient of determination for this model is high (0.821).

This model has the lowest PRESS values for the training and test sets at the same time. While other models have higher R^{2}_{CV} values than this model, they also have higher prediction errors. Accordingly, model **9** was the best model according to PLS analysis. This model has a regression coefficient of 0.921 and 0.912 for the training and tests sets, respectively.

(Table S5) in the supplementary material shows regression and cross validation parameters for randomization test that is performed to investigate the probability of chance correlation for models **3–15** using PLS analysis. This table shows that the proposed optimal PLS model (model **9**) is superior to that obtained by chance. Figure 5 shows regression between observed and predicted log BCF as well as their residuals for training and test sets of model **9** using PLS analysis.

Correlation of the predicted log BCF against observed one as well as their residues for **A**) training set. **B**) external test set of model **9** obtained by PLS analysis.

This model contains the following nine descriptors: ^{V}1^{M}_{D,deg,} nHAcc, MATS2m, GATS2e, ^{2}X^{v}, ST, Jhet_{Z}, Jhet_{v}, MR (see appendix) which are represented by 5 LV’s.

The following conditions proposed by Golbraikh and Tropsha^{37} were applied to conclude that the QSAR model has acceptable prediction power if:

- Q
^{2}(R^{2}_{CV}) > 0.5 - R
^{2}> 0.6 - (R
^{2}– R^{2}_{0})/R^{2}< 0.1 and 0.85 < k < 1.15Or(R^{2}–R′^{2}_{0})/R^{2}< 0.1 and 0.85 < k′ < 1.15

where R^{2}_{0} and R′^{2}_{0} are the coefficients of determination characterizing linear regression with Y-intercept set at zero, the first associated with observed vs. predicted values, the second related to predicted vs. observed values; k and k′ are the slopes of the regression lines forced through zero, relating observed vs. predicted and predicted vs. observed values. Alternatively, the parameter R^{2}_{m}, where R^{2}_{m} = R^{2}* (1−(R^{2}−R^{2}_{0}))^{1/2}, can be used.^{38} This parameter penalizes a model for large differences between observed and predicted values, was also calculated. R^{2}_{m} should be larger than 0.5 for a good external prediction, which is the case for model **4** from the ANN analysis (R^{2}_{m} = 0.754) and model **9** from the PLS analysis (R^{2}_{m} = 0.756). If a model shows good statistical performance for all these criteria, on both the training and the test sets, its reliability and robustness are high.

Comparing the linear (PLS) and nonlinear (PC-ANN) models shows that nonlinear relations improved the models over linear ones. Table 2 shows that compared with PLS and ANN results, MLR underestimate the regression coefficient values for small number of variables. Both ANN and MLR results show that Model **4** is the optimal model to predict the BCF values for the set of compounds in this study. However, PLS analysis improves the statistics compared with ANN and MLR. It gives models with higher correlation coefficients and R^{2}_{CV} values in addition to lower prediction errors. PLS analysis suggests that the optimal model is model **9**. Taking into account the complexity of the neural network based models; PLS based models are better to describe the QSPR of the BCF for the data set in this investigation. The descriptors used in these models depend on the volume, connectivity, molar refractivity, surface tension and the presence of atoms accepting H-bonds.

In summary, this study utilizes theoretical molecular descriptors to estimate BCF directly from the structure of the chemical. The most important molecular descriptors to model the BCF are topological (mainly 2d autocorrelation descriptors), geometrical (mainly encoding molecular size) and generally related to chemicals’ dimensions, polarizability, surface tension, molar refreactivity and number of acceptor atoms for H-bonds.

Table 3 summarizes the results obtained in the present study as well as other QSPR studies performed on the BCF. Zhao et al^{14} performed a QSPR study on a large set of 473 compounds that was created with ISIS BASE 2.5 SP2. Wang et al^{39} obtained a QSAR model by adopting topological properties and flexibility of chemicals to predict the BCF. Gramatica and Papa^{16}^{,}^{17} have performed QSAR study on the same set of non-ionic organic compounds using GA and MLR analysis without verifying the data. Fatemi et al^{18} applied ANN using descriptors selected with GA but did not use PCA in the stepwise pre-selection of variables. Both Gramatica and Papa^{16}^{,}^{17} as well as Fatemi et al^{18} split the data randomly, while in this contribution the data is split homogenously using the space of PCs. This implies that the training, validation and test sets include the molecules in different zones of the data distribution shown in Figure 2. Chen et al^{19} developed a QSAR model for fish BCFs of 8 groups of compounds employing PLS regression, based on LSER theory and theoretical molecular structural descriptors. Their model showed that the molecular size plays a critical role in affecting the bioconcentration of organic pollutants in fish which agrees with the results found in this study where theoretical descriptors were used to develop the model and assist the mechanism interpretation. Nevertheless, the model found in this study has performed external validation in addition to the usage of a validation set of 46 compounds to detect overfitting as implemented in the early stopping method while no such procedure was used in the other studies mentioned above.

Wang et al obtained coefficients of determination of 0.80 and 0.79 for calibration and cross validation, respectively, while in this study we obtained higher values (0.85 and 0.82). Generally, the calibration and cross validation coefficient of determinations (R^{2} and R^{2} _{CV}) obtained in this study (ANN model: 0.84 and 0.84/PLS model: 0.85 and 0.82) are higher than those obtained in the other studies shown in Table 3. An exception is for the R^{2} and R^{2} _{CV} values obtained by Fatemi et al (0.88 and 0.89–0.92) and Chen et al (0.87 and 0.86). However, in this study, we used a larger number of compounds which implies that the results obtained in this study are more general than those obtained by Fatemi et al and Chen et al.

A quantitative–structural property relationship analysis has been conducted on BCF (log BCF) for 227 different non-ionic organic compounds by using PLS and the principal component–artificial neural networks modeling methods, with application of eigenvalue ranking factor selection procedure. The PLS gives improved regression models with better prediction ability compared with PC-ANN. Taking into account the complexity of the neural network based models; PLS based models are quite good to describe the QSPR of the BCF for the data set in this investigation. A 0.921 correlation coefficient was obtained using PLS with 5 LV’s. The optimal model contains the descriptors ^{V}1^{M} _{D,deg,} nHAcc, MATS2m, GATS2e, ^{2}X^{v}, ST, Jhet_{Z}, Jhet_{v} and MR.

Omar Deeb acknowledges Prof. Dr. Jingwen Chen, Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), School of Environmental Science and Technology, Dalian University of Technology (China) for his valuable suggestions to improve the manuscript. He also acknowledges Dr. Kunal Roy, Drug Theoretics and Cheminformatics Lab, Department of Pharmaceutical Technology, Jadavpur University, Kolkata (India) for his suggestions concerning validation of the proposed models.

**Disclosure**

This manuscript has been read and approved by all authors. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers of this paper report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.

1. Mackay DA. Fraser, Bioaccumulation of persistent Organic Chemicals: mechanisms and Models. Environmental Pollution. 2000;110:375–91. [PubMed]

2. Hodgeson E, Levi PE. Introduction to Biochemical Toxicology. Appleton and Lange; Norwalk, CT, USA: 1994.

3. Neely BW, Branson DR, Blau GE. Partition Coefficient to Measure Bioconcentration Potential of Organic Chemicals in Fish. Environ Sci Technol. 1974;8:1113–5.

4. Kanazawa J. Measurement of the BCFs of Pesticides by Freshwater Fish and their Correlation with Physicochemical Properties or Acute Toxicities. Pestic Sci. 1981;12:417–24.

5. Mackay D. Correlation of BCFs. Environ Sci Technol. 1982;16:274–8. [PubMed]

6. Veith GD, Kosian P. Estimating Bioconcentration Potential from Octanol/Water Partition Coefficients. In: Mackay D, et al., editors. Physical Behaviour of PCBs in the Great Lakes. Ann Arbor Science Pub; USA: 1983. p. 269.

7. Isnard P, Lambert S. Estimating BCFs from Octanol-Water Partition Coefficient and Aqueous Solubility. Chemosphere. 1988;17:21–34.

8. Schuurmann G, Klein W. Advances in Bioconcentration Prediction. Chemosphere. 1988;17:1551–74.

9. Dimitrov SV, Mekenyan OG, Walker JD. Non-Linear Modeling of Bioconcentration Using Partition Coefficients for Narcotic Chemicals. SAR QSAR Environ Res. 2002;13:177–84. [PubMed]

10. Bintein S, Devillers J, Karchern W. Nonlinear Dependence of Fish Bioconcentration on n-Octanol/Water Partition Coefficient. SAR QSAR Environ Res. 1993;1:29–39. [PubMed]

11. Kubinyi H. Nonlinear Dependence of Biological activity on Hydrophobic Character. J Med Chem. 1977;20:625–9. [PubMed]

12. Nendza M. QSAR of Bioconcentration: Validity Assessment of log Pow/logBCF Correlations. In: Nagel R, Loskill R, editors. Bioaccumulation in Aquatic Systems. VCH; Weinheim: 1991. pp. 34–66.

13. Connell DW, Hawker DW. Use of Polynomial Expression to describe the Bioconcentration of Hydrophobic Chemicals by Fish. Environ Safety. 1988;16:242–57. [PubMed]

14. Zhao C, Boriani E, Chana A, Roncagliono A, Benfenati E. A new hybrid system of QSAR models for predicting BCFs (BCF) Chemosphere. 2008;73:1701–7. [PubMed]

15. Lu X, Tao S, Hu H, Dawson RW. Estimation of BCFs of nonionic organic compounds in fish by molecular connectivity indices and polarity correction factors. Chemosphere. 2000;41:1675–88. [PubMed]

16. Gramatica P, Papa E. QSAR modelling of BCF by the oretical molecular descriptors. QSAR Comb Sci. 2003;22:374–85.

17. Gramatica P, Papa E, Dearden JC. Linear QSAR regression models for the prediction of BCFs by physicochemical properties and structural theoretical molecular descriptors. Chemosphere. 2007;67:351–8. [PubMed]

18. Fatemi MH, Jalai-Heravi M, Konuze E. Prediction of BCF using genetic algorithm and artificial neural network. Anal Chim Acta. 2003;486:101–8.

19. Qin H, Chen JW, Wang Y, et al. Development and assessment of quantitative structure activity relationship models for BCFs of organic pollutants. Chin Sci Bull. 2009;54:628–34.

20. Chen JW, Li XH, Yu HY, et al. Progress and perspectives of quantitative structure-activity relationships used for ecological risk assessment of toxic organic compounds. Sci China Ser B-Chem. 2008;51:593–606.

21. Chen JW, Harner T, Ding GH, et al. Universal predictive models on octanol-air partition coefficients at different temperatures for persistent organic pollutants. Environ Toxico Chem. 2004;23:2309–17. [PubMed]

22. Khadikar PV, Singh S, Mandaloi D, Joshi S, Bajaj AV. QSAR study on BCF (BCF) of polyhaloginated biphenyls using PI index. Bioorg Med Chem. 2003;11:5045–50. [PubMed]

23. Todeschini R.Milano Chemometrics and QSAR Group, <http://www.disat.unimib.it/chm>.

24. Chaterjee S, Hadi AS, Price B. Regression Analysis by Examples. 3rd ed. Wiley; New York: 2000.

25. Gemperline PJ, Long JRV, Gregoriou G. Gregoriou, Nonlinear multivariate calibration using principal components regression and artificial neural networks. Anal Chem. 1991;63:2313–23.

26. Hemmateenejad B, Shamsipur M. Quantitative Structure—Electrochemistry Relationship Study of Some Organic Compounds Using PC-ANN and PCR. Internet Electron J Mol Des. 2004;4:316–34.

27. Hemmateenejad B, Safarpour M, Miri R, Nesari N. Toward an Optimal Procedure for PC-ANN Model Building: Prediction of the Carcinogenic Activity of a Large Set of Drugs. J Chem Inf Model. 2005;45:190–9. [PubMed]

28. Deeb O, Hemmateenejad B. ANN-QSAR Model of Drug-binding to Human Serum Albumin. Chem Biol Drug Des. 2007;70:19–29. [PubMed]

29. Deeb O, Goodarzi M. Predicting the solubility of pesticide compounds in water using QSPR methods. Mol Phys. 2010;108:181–92.

30. Deeb O, Hemmateenejad B, Jaber A, Garduno-Juarez R, Miri R. Effects of electronic and physicochemical parameters on the carcinogenic activity of some sulfa drugs using QSAR analysis based on genetic-MLR and genetic-PLS. Chemosphere. 2007;67L:2122–30. [PubMed]

31. Goodarzi M, Freitas MP. Predicting Boiling Points of Aliphatic Alcohols through Multivariate Image Analysis Applied to Quantitative Structure— Property Relationships. J Phys Chem A. 2008;112:11263–5. [PubMed]

32. Deeb O, Youssef K, Hemmateenejad B. QSAR of Novel Hydroxyphenylureas as Antioxidant Agents. QSAR Comb Sci. 2007;27:417–24.

33. Tute MS. History and Objectives of Quantitative Drug Design in Advances in Drug Research. In: Harter NJ, Simmord AB, editors. Vol. 6. Academic Press; London: 1971. p. 1.

34. Rumelhart DE, Hinton GE, Williams RJ. Learning representations by backpropagating errors. Nature. 1986;323:533–6.

35. Derks EPPA, Buydens LMC. Aspects of network training and validation on noisy data Part 1. Training aspects. Chemom Intell Lab Syst. 1998;41:171–84.

36. Martens H, Naes T. Multivariate Calibration. John Wiley; Chichester: 1989.

37. Golbraikh A, Tropsha A. Beware of q2! J Mol Graph Model. 2002;20:269–76. [PubMed]

38. Roy K, Roy P. Comparative chemometric modeling of cytochrome 3 A4 inhibitory activity of structurally diverse compounds using stepwise MLR, FA-MLR, PLS, GFA, G/PLS and ANN techniques. Eur J Med Chem. 2009;44:2913–22. [PubMed]

39. Wang Y, Lib Y, Dinga J, Jianga Z, Changa Y. Estimation of BCFs using molecular electro-topological state and flexibility. SAR QSAR Environ Res. 2008;19:375–95. [PubMed]

40. Schüürmann G, Ebert R, Chen JW, Wang B, Kühne R. External Validation and Prediction Employing the Predictive Squared Correlation Coefficient Test Set Activity Mean vs Training Set Activity Mean. J Chem Inf Model. 2008;48:2140–5. [PubMed]

Articles from Environmental Health Insights are provided here courtesy of **Libertas Academica**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |