|Home | About | Journals | Submit | Contact Us | Français|
Few Quantitative Structure-Activity Relationship (QSAR) studies have successfully modeled large, diverse rodent toxicity endpoints. In this study, a comprehensive dataset of 7,385 compounds with their most conservative lethal dose (LD50) values has been compiled. A combinatorial QSAR approach has been employed to develop robust and predictive models of acute toxicity in rats caused by oral exposure to chemicals. To enable fair comparison between the predictive power of models generated in this study versus a commercial toxicity predictor, TOPKAT (Toxicity Prediction by Komputer Assisted Technology), a modeling subset of the entire dataset was selected that included all 3,472 compounds used in the TOPKAT’s training set. The remaining 3,913 compounds, which were not present in the TOPKAT training set, were used as the external validation set. QSAR models of five different types were developed for the modeling set. The prediction accuracy for the external validation set was estimated by determination coefficient R2 of linear regression between actual and predicted LD50 values. The use of the applicability domain threshold implemented in most models generally improved the external prediction accuracy but expectedly led to the decrease in chemical space coverage; depending on the applicability domain threshold, R2 ranged from 0.24 to 0.70. Ultimately, several consensus models were developed by averaging the predicted LD50 for every compound using all 5 models. The consensus models afforded higher prediction accuracy for the external validation dataset with the higher coverage as compared to individual constituent models. The validated consensus LD50 models developed in this study can be used as reliable computational predictors of in vivo acute toxicity.
Chemical toxicity can be associated with many hazardous biological effects such as gene damage, carcinogenicity, or induction of lethal rodent or human diseases. It is important to evaluate the toxicity of all commercial chemicals, especially the High Production Volume (HPV)1 compounds as well as drugs or drug candidates, since these compounds could directly affect human health. To address this need, standard experimental protocols have been established by chemical industry, pharmaceutical companies, and government agencies to test chemicals for their toxic potential. For example, a so called “Standard Battery for Genotoxicity Test” was established by the International Conference on Harmonization, U. S. Environmental Protection Administration (EPA), U. S. Food and Drug Administration (FDA) and other regulatory agencies. This test includes one bacterial reverse mutation assay (e.g. Salmonella typhimurium mutation test), one mammalian cell gene mutation assay (e.g., mouse lymphoma cell mutation test) and one in vivo micronucleus test. The test battery varies slightly for pharmaceutical compounds, industrial compounds, and pesticides. The current strategies and guidelines for toxicity testing were described in a recent review (1).
Although the experimental protocols for toxicity testing have been developed for many years and the cost of compound testing has been reduced significantly, computational chemical toxicology continues to be a viable approach to reduce both the amount of effort and the cost of experimental toxicity assessment (2). Significant savings could be achieved if accurate predictions of potential toxicity could be used to prioritize compound selection for experimental testing, especially for testing in vivo.
Many Quantitative Structure Activity Relationship (QSAR) models have been developed for different toxicity endpoints to address this challenge (3–6). The summary of several models reported in earlier publications on acute rodent toxicity are given in Table 1. There are several shortcomings of earlier toxicity QSAR models that should be pointed out. Most of these studies included a relatively small number of congeneric compounds and as a result, they had limited applicability for compounds outside of the modeling set. Very few successful QSAR models have been reported for predicting in vivo toxicity endpoints that are applicable to the diverse compounds of environmental interest (5,7,8). For instance, Enslein and coworkers (9,10) developed multi-linear regression models using large, diverse training sets (425 and 1851 chemicals, respectively) but these models had relatively poor external prediction power, yielding an R2 value of 0.33 for the large test set.
Indeed, accurate prediction of toxicity for compounds that were not used for model development is a very challenging problem. QSAR models are generally more applicable for the analysis of small datasets of similar compounds with a simple mechanism of action (e.g., congeneric molecules binding to the same receptor or inhibiting the same enzyme) and less accurate for larger dataset of compounds with complex mechanisms of action. Toxicity prediction is a hard problem because there are multiple underlying mechanisms of action, and the datasets studied in the context of a general end point (e.g., rat LD50) are large and chemically diverse. Furthermore, QSAR models are developed by interpolating the training set data and therefore they inherently have limited applicability outside of the training set. At the same time, any external prediction implies inherent, and frequently, excessive extrapolation of the training set models. Poor external predictive power of QSAR models could be due to the lack of or incorrect use of external validation during the modeling process. Each statistical method used in QSAR studies has its particular advantages, weaknesses, and practical constraints so it is important to select the most suitable QSAR methodology for a specific toxicity endpoint. Thus, the toxicity prediction challenge should be addressed very carefully using rigorous modeling approaches and extensive model validation procedures.
Our recent studies of aquatic toxicity offered potential solutions to some of the above problems (11). A combinatorial QSAR approach was applied to study an aquatic toxicity dataset containing 983 diverse organic compounds tested against Tetrahymena pyriformis (11). To explain our choice of methodology and terminology, any QSAR modeling effort requires a set of chemical descriptors and a statistical optimization approach to develop the best correlation between values of descriptors and those of biological activity. For any dataset there are several sets of descriptors that could be calculated using different available software packages. Similarly, there are multiple statistical modeling approaches that could be employed with any of the descriptor sets. In the practice of QSAR modeling, there is no standard combination of the descriptor type and model optimization approach that works best for all datasets. In addition different QSAR methods usually use different definitions of applicability domain (or in most cases do not use the applicability domain at all). Combinatorial QSAR modeling implies that for a given experimental dataset we calculate several sets of descriptors and employ several statistical modeling approaches forming all-against-all pairwise combinations of descriptor sets and modeling techniques to develop multiple types of QSAR models. We require that each model must satisfy certain validation criteria. As we demonstrated in the earlier study (11), the consensus models had the highest external prediction power as compared to any individual model used in the consensus prediction. Since the individual models can have differently defined applicability domains, the consensus method can also afford greater chemical space coverage as well.
In this paper, a similar combinatorial QSAR workflow was employed to study a much larger and more chemically diverse dataset (arguably, the largest and most diverse in vivo toxicity dataset ever reported in the public domain) containing 7,385 unique organic compounds with experimentally determined oral rat acute toxicity. We have explored various QSAR approaches in terms of their ability to develop robust and externally predictive models. The consensus prediction integrating all validated individual models was found to be the most accurate (using an external prediction set) when compared both to each individual model used in the consensus approach and to a popular commercial software, TOPKAT. The consensus models developed in this study could be used as reliable predictors of rodent acute toxicity for chemical compounds. The models will be made available through the ChemBench web portal maintained in our laboratory (http://chembench.mml.unc.edu).
The rat LD50 data were collected from difference sources (12) to form a dataset including more than 8,000 compounds. The structures of those compounds were verified using the approach discussed by Young’s group (13). The quality of the data has been extensively reviewed over the past several years. After removing inorganic and organometallic compounds, salts, and compound mixtures, the final acute toxicity dataset included 7,385 unique organic compounds. The original values of LD50 for each compound were expressed as mol/kg; these were converted to log(1/(mol/kg)) values according to standard QSAR practices. Chemical structures of all compounds and their experimental LD50 values used in this study are available from the authors upon request.
This dataset was compared with the training set used to develop the rat acute toxicity predictor available from the commercial Toxicity Prediction by Komputer Assisted Technology (TOPKAT) software. It was found that 3,472 out of 7,385 compounds were included in the TOPKAT rat LD50 training database. To enable direct comparison of external predictive power for models generated in our studies vs. TOPKAT), these 3,472 compounds were used as the modeling set and the remaining 3,913 compounds as the external validation set.
Rat LD50 models for the 3,472 modeling set compounds were developed with various types of chemical descriptors, including those from the Dragon software v5.4 (14) and a set of descriptors developed previously by Martin and coworkers at the US EPA (15). The latter set consisted of more than 800 descriptors in the following classes: E-state values and E-state counts, constitutional descriptors, topological descriptors, walk and path counts, connectivity, information content, 2D autocorrelation, Burden eigenvalues, molecular properties (such as the octanol-water partition coefficient), Kappa, hydrogen bond acceptor/donor counts, molecular distance edge, and molecular fragment counts. There were overlaps between Dragon and EPA descriptors but both included unique types of descriptors as well. The Dragon descriptors were used for the kNN and random forest methods and the EPA descriptors were used for the hierarchical clustering, FDA MDL QSAR, and nearest neighbor QSAR methods.
Initial use of Dragon yielded more than a thousand of chemical descriptors for the training set, which were processed as follows. First, we removed all descriptors that had zero values or zero variance for all modeling set compounds. Furthermore, redundant descriptors were identified by analyzing correlation coefficients between all pairs of descriptors and if the correlation coefficient between two descriptor types for all modeling set compounds was higher than 0.95, one of them was removed. As a result, the total number of Dragon descriptors used for model building was reduced to 454. The number of EPA descriptors used for model building (for the hierarchical clustering and FDA MDL QSAR methods) varied depending on the size and composition of the training set molecules that were used for model building.
The kNN QSAR method (16) employs the kNN classification principle and a variable (i.e., descriptor) selection procedure. Briefly, a subset of nvar (number of selected descriptors) descriptors is selected randomly at the onset of the calculations. The nvar is set to different values and the training set models are developed with leave-one-out cross-validation, where each compound is eliminated from the training set and its LD50 value is predicted as the average activity of k most similar molecules, where the value of k is optimized as well (k = 1 to 5). The similarity is characterized by Euclidean distance between compounds in multidimensional descriptor space. A method of simulated annealing with the Metropolis-like acceptance criteria is used to optimize the selection of descriptors. The objective of this method is to optimize nvar and k values to obtain the best leave-one-out cross-validated q2abs, i.e., q2 with the intercept set to zero, possible by optimizing the nvar and k. The additional details of the method can be found elsewhere (16).
In developing kNN QSAR models we followed our general predictive QSAR modeling workflow methodology (17), that places special emphasis on model validation. Briefly, we start by dividing the original dataset randomly into a (bigger) modeling set and a (smaller) external validation set; the latter is not used for model development at all and the former is designated as a modeling set. The modeling set compounds are divided multiple times into training/test sets using the Sphere Exclusion approach (18) that ensures that both training and test sets are chemically diverse The models are developed using training set data and their performance is characterized with the standard leave-one-out cross-validated (LOO-CV) R2 (q2) for the training sets and the conventional coefficient of determination R2 for the test sets; this coefficient is determined for a regression that is forced through the origin of the experimental vs. calculated LD50 plot. The model acceptability threshold values of the LOO-CV accuracy of the training sets and the prediction accuracy for test sets were both set at no less than 0.5. Models that did not meet both training and test set cutoff criteria were discarded. Models that passed these threshold criteria were used to predict LD50 values of the external validation set to assure their external predictive power as discussed in the Results section. The detailed discussion of the workflow used to develop validated QSAR models can be found in a recent review (19).
In machine learning, a random forest is a predictor that consists of many decision trees and outputs the prediction that combines outputs from individual trees. The algorithm for inducing a random forest was developed by Breiman and Cutler (20). In this study, the implementation of the random forest algorithm available in R.2.7.1 (21) was used. In the random forest modeling procedure, n samples are randomly drawn from the original data. These samples were used to construct n training sets and to build n trees. For each node of the tree, m descriptors were randomly chosen from the total 454 Dragon descriptors. The best data split was calculated using these m descriptors for each training set. In this study, only the defined parameters (n = 500 and m = 13) were used for the model development.
The hierarchical clustering method utilizes a variation of the Ward’s Minimum Variance Clustering Method (22) to produce a series of clusters from the initial training set. For a training set of n chemicals, initially there will be n clusters. At each step in the clustering process, two clusters are combined so that the increase in variance over all the clusters in the system is minimized. The change in variance caused by combining clusters j and k is as follows:
where nj = number of chemicals in cluster j, Cj,i is the centroid (or average value) for descriptor i for cluster j, and d is the number of descriptors in the EPA pool of descriptors (~800) (15). The process of combining clusters while minimizing variance continues until all of the chemicals are lumped into a single cluster. After the clustering is complete, each cluster is analyzed to determine if an acceptable QSAR model can be developed. A genetic algorithm technique is used to select descriptors to build a multi-linear regression model for each cluster (15). Similar to the kNN approach, each model must achieve a LOO-CV accuracy of 0.5 to be used in making predictions. The predicted value for a given test chemical is calculated using the equally weighted average of the model predictions from the closest cluster from each step in the hierarchical clustering. This method was previously shown to yield the best results for another acute toxicity endpoint, IGC50 (50% inhibitory concentration of population growth) of Tetrahymena pyriformis (15).
A QSAR methodology (denoted here as the FDA MDL QSAR method) based on the studies of Contrera et al. (23) was developed earlier (15). For each test chemical, a cluster is constructed using the thirty most similar chemicals from the training set as defined by the cosine similarity coefficient, SCi,k, which is calculated as follows
where xij is the value of the jth normalized descriptor for chemical i (normalized with respect to all the chemicals in the original training set) and xkj is the value of the jth descriptor for chemical k. The entire pool of approximately 800 EPA descriptors is used to calculate the similarity coefficient in equation 2. A multiple linear regression model is then built for the new cluster using a genetic algorithm based method and the toxicity is predicted.
The nearest neighbor method is a simplification of the variable selection kNN approach described above. In the nearest neighbor method, the toxicity is simply predicted as the average of the toxicity of the three most similar chemicals from the training set. The similarity is defined in terms of the cosine similarity coefficient (Equation 2). In the nearest neighbor method, the entire available descriptor pool is used to characterize molecular similarity (as opposed to a subset of the descriptor pool as in the descriptor selection kNN method). In order to make a prediction, each of the neighbors in the training set must exceed a minimum cosine similarity coefficient of 0.5.
A common problem for most QSAR studies is the existence of compounds that are highly dissimilar to all other compounds in the dataset. These compounds are regarded as outliers in the descriptor space and are likely to present problems in establishing SAR trends, which is critical to QSAR modeling. In this study, we have identified and excluded the structural outliers from the modeling at the beginning of the modeling procedure.
For kNN and random forest modeling procedures, we have developed a method to detect outliers that are dissimilar to other compounds of the dataset in the descriptor space. This procedure included the following steps. 1) calculation of the distance or similarity matrix based on the Dragon descriptors of compounds in the descriptor space; 2) finding the nearest neighbors for all compounds in the dataset based on a predefined similarity threshold; 3) identifying those compounds that have no nearest neighbors as outliers.
In order to measure similarity, each compound i is represented by a point in the M-dimensional descriptor space (where M is the total number of descriptors) with the coordinates Xi1, Xi2, …, XiM, where Xis (s=1 ,…,M) are the values of individual descriptors. The molecular dissimilarity between any two molecules i and j is characterized by the Euclidean distance between their representative points. The Euclidean distance dij between points i and j in M-dimensional space can be calculated as follows (Eq. 3):
Compounds with the smallest distance between them are considered to have the highest similarity. The distances (dissimilarity) of compounds in our modeling set are compiled to produce a chemical similarity threshold DT, calculated as follows (Eq.4):
Here, is the average Euclidean distance between all compounds and their k nearest neighbors (k was set to 1 in this procedure) of each compound within the modeling set, σ is the standard deviation of these Euclidean distances, and Z is an arbitrary parameter to control the threshold level and was set to 0.5 in this study. The DT threshold is used to identify outliers as follows. If the distance of a compound to its nearest neighbor in the modeling set exceeds this threshold, this compound is considered an “outlier” and excluded from the modeling set. After excluding 997 structural outliers, the remaining 2,475 modeling set compounds were compiled as a new reduced modeling set to develop kNN and random forest toxicity models.
It is important to point out that the identification and exclusion of outliers is based only on consideration of chemical similarity but not activity. Thus, the removal of structural outliers could be regarded as a pre-treatment of the modeling set using objective chemometric approaches.
For the hierarchical and FDA MDL QSAR methods, a chemical is removed from a cluster if it is both an influential data point (determined by at least two statistical tests, e.g. DFFITS, leverage, Cook’s distance, and covariance ratio) and an outlier (determined from studentized deleted residual). The details of these procedures are given elsewhere (24).
Defining model Applicability Domains (AD) is an active area of modern QSAR research (25,26). Every QSAR model can formally predict the relevant target property for any compound for which chemical descriptors can be calculated. However, since each model is developed using compounds in the training set only (that cover only a small fraction of the entire chemistry (i.e., descriptor) space) the special applicability domain for each model should always be defined. As a consequence, only a certain fraction of compounds in any external dataset is expected to fall within the AD. This fraction is therefore referred to as the dataset coverage. There are several discussions about model AD in a recent publication (27). In this study, we present a detailed discussion concerning the effect of the AD on model predictivity using much larger modeling/validation sets than any other reported in the literature including our own previous publications.
The AD of kNN and Random Forest models is calculated from the distribution of similarities between each compound and its k nearest neighbors in the training set (similarities are computed as Euclidean distances between compounds represented by their multiple chemical descriptors). Based on the previous studies, the standard cutoff value to define the applicability domain for a QSAR model places its boundary at one-half of the standard deviation calculated for the distribution of distances between each compound in the training set and its k nearest neighbors in the same set. Thus, if the distance of the test compound from any of its k nearest neighbors in the training set exceeds the threshold, the prediction is considered unreliable. The detailed description of the algorithm to define this AD is given elsewhere (18,28).
Before any cluster model can be used to make a prediction for a test chemical, it must be determined whether the test chemical falls within the AD for the model. The first constraint, the model ellipsoid constraint, checks if the test chemical is within the multidimensional ellipsoid defined by the ranges of descriptor values for the chemicals in the cluster (for the descriptors appearing the cluster model). The model ellipsoid constraint is satisfied if the leverage of the test compound (h00) is less than the maximum leverage value for all the compounds used in the model (29). The second constraint, the Rmax constraint, checks if the distance from the test chemical to the centroid of the cluster is less than the maximum distance for any chemical in the cluster to the cluster centroid. The final constraint, the fragment constraint, stipulates that the chemicals in the cluster must contain at least one example of each of the fragments that are present in the test chemical (15).
For the prediction from the cluster model to be valid, several constraints must be met. The first two constraints are the model ellipsoid and fragment constraints described above. The final constraint is that the predicted toxicity value must be within the range of experimental toxicity values for the chemicals used to build the model (15).
For a prediction from the nearest neighbor method to be made, there must be three chemicals in the training set which are sufficiently similar to the test chemical (the similarity coefficient between each chemical and the test chemical in equation 1 must exceed 0.5).
The statistical parameters of predictions for the external validation set obtained from all five QSAR models developed in this study as well as using TOPKAT are shown in Table 2. It is difficult to compare all models side by side because the underlying approaches used different definitions of AD and therefore the statistical results are shown for external datasets of different sizes. Indeed, these initial results suggest that the prediction accuracy and chemical space coverage are tightly interlinked and in general, as expected, higher accuracy is obtained for smaller external datasets within the AD of each model. Models with the most liberally defined AD (and consequently, the highest coverage), i.e., NN and FDA MDL QSAR had the lowest R2 and the highest MAE followed by TOPKAT and Hierarchical Clustering that had progressively higher R2 values (although similar MAE) and smaller coverage. Nevertheless, for these four models the absolute R2 values were relatively low, i.e., under 0.5. Only two models (kNN and RF) afforded R2 higher than 0.50 and MAE lower than 0.50 for the external validation set but the external dataset coverage of these two models is the lowest (19%) among all models. It could be argued that for this dataset (and perhaps for any large and diverse dataset) it is critical to define a rather restrictive AD in order to achieve most accurate predictions as discussed in more detail below.
All five QSAR approaches implemented method-specific AD except kNN and RF models, which used the same definition of AD. On average, the use of AD improved the performance of individual models although the improvement came at the expense of the lower chemical space coverage. The direct comparison between individual models appears difficult due to different definitions of AD and different interplay between coverage and accuracy for relevant models.
Figure 1 shows the distribution of MAE values for the prediction of external validation set for TOPKAT, five individual models and consensus model developed in this study (see additional discussion of the consensus model below) that used the AD for three compound sets: all external compounds; those located within the AD of each model; and those outside of AD. Notably, all models showed similar predictivity when applied to the entire external set but the effect of AD was indeed model-specific. Six (TOPKAT, kNN, RF, Hierarchical Clustering, FDA MDL QSAR and consensus) out of seven QSAR models that used the AD showed the improvement in the prediction accuracy for external validation set as a result of excluding those compounds outside of the AD. The result of NN practically did not change after applying the AD criteria. This is not surprising given that there were only very few compounds that were outside of the structural AD in this model.
The different predictivity of the external validation set obtained from five QSAR models does not necessarily indicate that statistical approaches or descriptors used to develop these models have greatly different predictive power for this specific toxicity endpoint. It is noticeable that the resulting predictive accuracy strongly correlates to the model coverage that is decided by the model applicability domain. Once a more restrictive AD was applied, the predictive accuracy improved significantly (Table 2). For this reason, it is interesting to study the performance of each model when the same model applicability domain was implemented. Since only a small number of compounds were out of the ADs of NN and FDA MDL QSAR models, the remaining two model applicability domains (ADs of Hierarchical Clustering and kNN/RF) and the AD of TOPKAT were used to study the prediction accuracy of each model under the same prediction coverage (Table 3).
When using the same model applicability domain, the prediction coverage of the external prediction set obtained from each individual QSAR models are almost but not exactly the same. This is because there are some compounds (less than 1% of the total external compounds) which cannot be predicted using the Hierarchical clustering method even if all the constraints are relaxed. At similar levels of prediction coverage, the individual predictions using models generated in this study are similar to each other. Interestingly, the results generated using all models are approximately the same (in terms of R2 and MAE) when using TOPKAT defined AD, with kNN method arguably showing slightly better performance. However, somewhat surprisingly, with the decrease of the chemical space coverage most of the individual models developed in this study appear increasingly superior to TOPKAT (Table 3). It may be concluded that the prediction accuracy is not sensitive to the statistical approaches employed in this paper but strongly depends on the model applicability domain. Again, as noted above, it could be concluded that the higher accuracy of prediction comes at the expense of reducing the chemical space coverage.
There are some compounds that could not be predicted accurately by any of the five individual models. Using MAE > 1.0 as criteria, there are 520 validation set compounds with large prediction errors for any of the individual models. Some specific chemical scaffolds could be identified from these 520 compounds. These scaffolds and the comparison between the average LD50 values of the associated compounds in the modeling set, external validation set and those validation set compounds that have large prediction errors are listed in Table 4. The average LD50 value of these compounds is 3.4, and it is much higher than that of the compounds in the modeling set (2.47). Therefore, the relatively small fraction of compounds with high values of acute toxicity in the modeling set is a potential reason of the low prediction accuracy for these 520 compounds.
Ten out of 17 steroid-like compounds in the validation set have large prediction errors. As shown in Table 4, the five steroids in the modeling set have lower acute toxicity (average LD50 = 2.5) than these ten compounds (average LD50 = 4.6). A similar observation is true for the esters. Compounds with the same scaffolds and high acute toxicity need to be added into the modeling set to accurately predict these types of compounds. On the contrary, all five dioxins in the validation set have much lower toxicity (average LD50 = 5.1) than those three in the modeling set (average LD50 = 8.2). Therefore, dioxins with lower acute toxicity need to be added to the modeling set to accurately predict this type of compounds. There is no clear difference between the average LD50 value of 49 thiophosphates with large prediction errors in the validation set and the 285 thiophospates in the modeling set. However, the activity range of these 49 thiophosphates in the validation set is from 1.2 to 6.3, which is much larger than the activity range of 285 thiophospates in the modeling set, which is from 1.6 to 5.4. For this reason, thiophosphates with both high and low acute toxicity values need to be added to the validation set to improve the model predictivity for this type of compounds. These results indicate the existing shortcomings of the TOPKAT LD50 modeling set. Apparently, the modeling set should be balanced not only in terms of chemical diversity of compounds but also their activity distribution to afford higher external accuracy of models.
The statistical results obtained with individual models indicate that different modeling techniques may have different advantages for predicting the rat oral LD50 of organic compounds. Although the performance of our individual models are comparable or slightly better than that of TOPKAT, it is difficult to judge which model is better than others and which model should be chosen to predict rat acute toxicity potential of new compounds. For this reason, following a strategy that was proven successful in our previous studies (11) a simple consensus model was developed that integrated all of the individual models. In this approach, the LD50 value for each compound is predicted as the arithmetic average of all LD50 values predicted by individual models taking into account the model applicability domains. Note that additional averaging schemes giving, e.g., different weights to different contributing models could be used in principle. However, there has not been sufficient research in the QSAR modeling community into looking for the most optimal scheme for the ensemble QSAR modeling. Thus, we chose the simplest approach in this study. The detailed comparison between consensus predictions and those of other models when using the same AD is listed in Table 3. The data clearly demonstrate that the predictive accuracy of consensus model is higher than that for any individual model. In addition, we used the Wilcoxon test to calculate the p-values for the differences in MAEs obtained by consensus prediction vs. individual methods. Under almost all conditions, the improvement achieved by consensus prediction, compared with any individual model, is statistically significant (p < 0.01) and the only exception is when comparing consensus prediction with RF for the 743 compounds in the applicability domain of RF models (p=0.4).
From the discussion above, it is clear that the AD is an important factor that affects the predictive accuracy of each individual model. In the consensus prediction, model applicability domain was implemented by introducing the concept of “consensus prediction fraction”. Since the consensus prediction is the average of predictions using all five models, the fraction of the prediction could be defined as the number of individual model predictions that are available to predict a new compound (due to the AD limitations). Thus, if only one model could predict a compound, the consensus prediction fraction is 20% for this compound. If all 5 models could make the prediction, the prediction fraction of the consensus model is 100%. Different cutoff values for the prediction fraction could be set to get different prediction accuracy (and different coverage) based on this threshold. Figure 2 shows the change of prediction accuracy of external set, which is indicated by R2 and MAE, obtained by consensus prediction with different fraction cutoff values (Figure 2). For comparison, the TOPKAT prediction for the same external compounds is also shown in the same Figure 2. Increasing the prediction fraction level increased the prediction accuracy but decreased the prediction coverage. Figure 3 shows the relationship between experimental and consensus-predicted LD50 values when the prediction fraction is 80%. The compounds outside of the AD in this consensus prediction are also shown (Figure 3). Obviously, the removal of outliers improves the correlation. Furthermore, it is also interesting to compare the prediction coverage and accuracy that is indicated by R2 and MAE. Figure 4 shows the inverse correlation between the coverage and R2 (or direct correlation between the coverage and MAE) for all individual models (including TOPKAT) and consensus model (including the results of different prediction fractions). It is clear that the prediction accuracy obtained by this consensus model is higher than that for any individual model under any conditions (Figure 4).
A further understanding of the predictive ability of the models used in this study can be obtained by analyzing compounds for which consensus prediction gave higher accuracy than any of the individual models. It is clear that if all five individual models make similar predictions for a compound, the value from consensus prediction will be similar to any of those generated with individual models. The possible improvement of the prediction accuracy due to the use of consensus prediction could be achieved when the individual predictions are different. Table 5 lists ten compounds, which have the most significant difference between individual predictions.
There are many external validation set compounds (such as #1, #4, #5 and #9 in Table 4) whose individual LD50 predictions include one value with a large deviation from the others, which is usually the one that has the largest prediction error. Therefore, by taking the average for consensus prediction, we could compensate for the large error of such individual result.
On the other hand, the compounds #2, #3, #6, #7, #8 and #10 show large errors for the majority of their individual predictions. The consensus model is able to make accurate prediction, such as for compound #8, or prediction with moderate error, as for the remaining compounds in the Table, because individually predicted LD50 values are both lower and higher than the experimental LD50 value so that the errors to some extent cancel each other. The differences in model predictions arise because they use different descriptors and/or different modeling methods, which could model different aspects of toxicological affects. Thus, the consensus modeling allows for these different affects to be incorporated into a single (and on average, more accurate) prediction.
Several QSAR approaches have been used to develop toxicity models of the largest available set of diverse organic compounds tested for the oral acute toxicity in rats. The resulting models (for the most part incorporating specific applicability domains) were validated by predicting the toxicity of a large external validation set. It was observed that all models showed somewhat different but comparable performance for the validation set when compared to the commercial toxicity predictor TOPKAT. Formally, the highest accuracies were achieved by kNN and RF approaches (R2 = 0.66 and 0.70, respectively) but this required a decrease in space coverage (to ca. 19%). However, when the same model applicability domain was implemented, the individual models showed similar performance as applied to the validation set. Here, the use of applicability domain improved the prediction accuracy using individual models but decreased the predictive coverage of the validation set. Notably with the decrease of the prediction coverage models developed in this study showed slightly higher prediction accuracy as compared to TOPKAT.
The most significant result of our studies is the demonstrated superior performance of the consensus modeling approach when all models are used concurrently and predictions from individual models are averaged (see Figure 1)‥ The predictive accuracy of the consensus QSAR models was shown to be superior to any individual model when predicting the same set of external compounds. By using different cutoff values for the prediction fraction, trade-offs between the accuracy and the coverage of consensus prediction results can easily be seen. The predictivity of consensus models was found to be superior to that of TOPKAT when predicting the same external compounds. Finally, these studies indicated that a well organized modeling set that covers not only a broad chemical space but also broad activity ranges of major chemical scaffolds in this chemical space is necessary to develop successful QSAR toxicity predictors. Additional studies of this dataset are ongoing and will be reported in the future. All successful models reported in this paper will be made available via the ChemBench web portal (http://chembench.mml.unc.edu). Meanwhile, interested researchers can send us any compounds of interest for LD50 prediction.
This work was supported, in part, by grants from NIH (GM076059 and ES005948) and US EPA (RD832720 and RD833825). The research described in this article has not been subjected to each funding agency's peer review and policy review and therefore does not necessarily reflect their views and no official endorsement should be inferred. This manuscript has been reviewed by the US EPA and approved for publication. Approval does not signify that the contents necessarily reflect the views and policies of the agency, nor does mention of trade names or commercial products constitute endorsement or recommendation for use.
AD, Applicability Domain;
HPV, High Production Volume;
kNN, k Nearest Neighbors;
LOO-CV, Leave-One-Out Cross-Validated;
MAE, Mean Absolute Error;
NIH, National Institutes of Health;
NIEHS, National Institute of Environmental Health Sciences;
NTP, National Toxicology Program;
QSAR, Quantitative Structure-Activity Relationship;
R2, coefficient of determination;
RF, Random Forest;
TOPKAT, Toxicity Prediction by Komputer Assisted Technology;
US EPA, U. S. Environmental Protection Agency;
US FDA, U. S. Food and Drug Administration;
The authors declare no competing financial interests.