qHTS data improve QSAR model accuracy
The cell viability qHTS assays have been extensively validated and are known to give reproducible results [e.g., half-maximal activity concentration (AC50
) values] in toxicity screening studies (Inglese et al. 2006
; Xia et al. 2008
). These data, when converted to binary “biological” descriptors, were shown previously to improve the accuracy of conventional, chemical descriptor-based QSAR models of rodent carcinogenicity (Zhu et al. 2008
). The same simple binary descriptors, however, did not improve QSAR models of the acute rodent toxicity (i.e., LD50
) data set used in this report (data not shown). However, qHTS assays contain full concentration–response information, enabling derivation of multiple “biological” descriptors using a noise-filtering algorithm ().
The initial use of these novel qHTS-derived descriptors alone did not result in robust classification models of rat acute toxicity (data not shown). This observation was similar to those of our previous studies (Zhu et al. 2008
) showing that “binary” biological descriptors alone, derived from these same qHTS data, did not correlate well with rodent carcinogenicity. In vitro
screening, even in as many as 13 cell lines, may not capture the complex biological mechanisms of in vivo
We then examined the relationships between the “chemical” and qHTS-derived “biological” descriptors. Following standard cheminformatics procedures, we calculated and plotted pairwise similarities between compounds estimated by respective Euclidean distances using either biological or chemical descriptors (). We found no correlation between any two sets of descriptors; that is, chemical similarity is perceived differently by the biological versus chemical descriptors. We conclude from this analysis that both sets of descriptors may bring unique features to models when used simultaneously.
Figure 3 Pairwise Euclidean distances in the chemical (y-axis) and biological (x-axis) descriptor space for the qHTS LD50 data set. Dots represent compound pairs; colors reflect in vivo toxicity: blue, pairs of nontoxic compounds; red, pairs of toxic compounds; (more ...)
Next, we built QSAR models of acute rat toxicity using chemical descriptors only (). Based on the external validation set, mean accuracy of the models was > 75%, which supports the utility of chemical descriptor–based QSAR models for the acute rat toxicity end point. To determine whether qHTS-derived “biological” descriptors could improve the model predictivity, we used hybrid, chemical–biological sets of descriptors. When we used unprocessed qHTS descriptors, the model accuracy was dampened (, THR = 0%), likely due to high noise levels (i.e., random variation) in the concentration–response profiles. However, hybrid models based on the noise-filtered qHTS data showed significantly improved external classification accuracy compared with models based on chemical descriptors alone or hybrid descriptors with untreated qHTS data. Three hybrid models (, THR = 5%, 15%, and 25%) showed similar performance, indicating that relatively minor correction of the baseline response results in a significant improvement of the model performance. In further analysis, we used the arbitrary value of THR = 15%.
CCRs of 5-fold external validation for kNN and random forest models.
qHTS data improve QSAR model coverage
We based the classification kNN QSAR method in this study on an ensemble of models that uses a consensus scoring scheme whereby an average value of the binary classifications from all individual models (0 = “nontoxic,” 1 = “toxic”), for which a chemical was found within the respective applicability domains, is recorded. The average “prediction” value could fall anywhere within the range between 0 and 1. The results reported in are based on a consensus classification using 0.5 as a threshold (i.e., average value > 0.5 is predicted “toxic,” < 0.5 “nontoxic”). However, the kNN model’s classification stringency can be adjusted by applying individual thresholds to each class (e.g., ≤ 0.3 is nontoxic, ≥ 0.7 toxic) and treating all inconsistent classifications (e.g., between 0.3 and 0.7) as inconclusive. Although the accuracy of the classification may improve when stringent thresholds are applied, the coverage of the model (i.e., a fraction of the compounds that may be classified because of the applicability domain limitations) is eroded. To explore the relationship between the predictivity and coverage of the models based on chemical or hybrid [original or filtered (15% THR) concentration–response data] descriptors, we have determined the CCR and coverage of the models with varying classification thresholds ().
Figure 4 External prediction results of kNN models using different classification criteria: distribution of the predicted values (A) and heat maps illustrating classification (B, CCR) and coverage (C, percent chemicals within the applicability domain) results (more ...)
The distribution of the consensus model predictions () for the test compounds shows that the hybrid descriptor models with noise-filtered qHTS data exhibit most favorable separation of “toxic” and “nontoxic” compounds. Importantly, when CCR () and coverage () are plotted as heat maps, it is evident that the hybrid descriptor models with noise- filtered qHTS data have not only high accuracy but also higher coverage at lower thresholds. For example, when fairly strict classification criteria (e.g., ≤ 0.3 for nontoxic, ≥ 0.7 for toxic) are applied, all three types of models can achieve similar classification accuracy (CCR ≈ 86%), yet the coverage is considerably higher for the hybrid models (81% vs. 57%; connected dots in ), implying that hybrid models are expected to make accurate predictions for substantially more external chemicals, which is an important model feature for prioritizing new chemicals for in vivo testing. Furthermore, the consensus classification value correlates well with LD50 [see Supplemental Material, Figure 5 (doi:10.1289/ehp.1002476)].
Comparative analysis of hybrid QSAR
To evaluate robustness of the classification models, we used the y-randomization test (see “Materials and Methods”) applied to the representative hybrid descriptor model with noise-filtered (THR = 15%) qHTS data and the model based on chemical descriptors only. All y-randomized models were significantly worse (one-tailed t-test p < 0.05) than respective real ones, with CCR values < 0.52 in all cases.
We also compared the performance of models developed in this study with that of the widely used commercial toxicity predictor software TOPKAT (Toxicity Prediction by Komputer Assisted Technology) (Venkatapathy et al. 2004
). There were 87 molecules present both in our qHTS LD50
data set and in the previously reported external validation set (Zhu et al. 2009a
) of TOPKAT. Because TOPKAT generates continuous LD50
predictions, we made binary classifications using the same criteria as applied in the case of the qHTS LD50
data (see “Materials and Methods”); 52 molecules were classified as 11 “toxic” and 41 as “nontoxic” compounds, and the remaining 35 had “marginal” activity (). Although the hybrid models based on the noise-filtered qHTS data gave CCR values > 0.85, both our chemical descriptor-based models and those of TOPKAT (also based on chemical descriptors only) showed lower predictivity (CCR of 0.75–0.77 or 0.69, respectively; note the dramatic improvement in sensitivity, that is, accuracy in predicting toxic compounds, of our models vs. TOPKAT, 73–91% vs. 43%, respectively, with minor drop in specificity, 83–85% vs. 93%, respectively). These results further support the use of hybrid chemicobiological descriptors in QSAR modeling of chemical toxicity.
Classification results for external validation set.
Chemical and biological descriptors are both important for accurate prediction of acute rat toxicity
The QSAR modeling approaches used here allow for the analysis of individual descriptors that appear frequently in models with high classification accuracy. To this end, we further examined the hybrid descriptor-based kNN model with noise-filtered (THR = 15%) qHTS data.
In total, among five splits of the modeling set (), we generated > 7,000 individual kNN models. shows that, on average, each descriptor appeared in 3.3% of all models. We determined that 90 descriptors had above-average frequency, of which 21 were qHTS-derived descriptors (). The apparent imbalance between chemical and biological descriptors is due to a corresponding imbalance (4:1) in the total number of descriptors of each class used for modeling.
Figure 5 Occurrence frequencies of the descriptors in the hybrid kNN (THR = 15%) model (A) and relative frequencies of qHTS biological descriptors (B). Max, maximum. The fraction of most frequent descriptors selected by mean occurrence is marked by a dashed line (more ...)
The top descriptor overall, with as high as 61% occurrence, was the Jurkat cell viability response at the highest concentration tested (92 μM). Similar to the observation made in our previous studies (Zhu et al. 2008
), the Jurkat cell line was found to be the most significant biological descriptor for predicting in vivo
toxicity, followed by the SK-N-SH cell line. Jurkat is a human tumor cell line derived from T-cell leukemia, and it grows in suspension with a relatively fast doubling time of about 22 hr. This cell line retains some metabolic capacity toward xenobiotics and is used frequently for in vitro
testing (Nagai et al. 2002
). We found that HepG2 and renal proximal tubule cell lines generated the least informative biological descriptors. Actually, almost all cell lines had model-informative responses over the top six concentrations tested; we derived fewer informative data from the mid to lower part of the concentration range (). Independent of assay hit frequency, however, the modeling success suggests that the modes of action for chemicals that cause overt toxicity in vivo
may, at least in part, correspond to those operative in vitro
. Interestingly, the qHTS descriptor representing response at the lowest concentration tested (0.6 nM) in the N2a cell line was indicative of nontoxic classification (of 26 compounds with nonzero response at 0.6 nM, 1 was toxic, 9 were nontoxic, and 16 were marginal). This result underscores the need for including sufficiently high and low concentrations for in vitro
screening of chemicals.
summarizes the most frequently selected chemical descriptors. They fall into several chemical categories consisting of halocarbon compounds, sulfur-containing molecules (mainly thiophosphates), and aromatic structures. These chemical classes are known for their prevalent toxicity (Denison 1990
; Vittozzi et al. 2001
). Several of the descriptors are likely to serve as secondary features within classes, to afford recognition of specific subclasses of molecules that have either low or high toxicity.
Frequently used descriptors in a kNN Hybrid (THR = 15%) model.
In addition, we argue not only that there is value in better understanding what descriptors were successful at predicting activity class, but also that it is useful to analyze the “classification outliers”—that is, those chemicals that the models failed to predict accurately. Because both chemical structure–based and qHTS profile–based descriptors are available, we can determine whether certain chemical classes of the consistently correctly/incorrectly classified compounds have similar concentration–response curve fingerprints (see “Materials and Methods” and ), as well as cases where qHTS results are less reliable or informative to the model success. illustrates several sample comparisons using qHTS fingerprints derived from the concentration– response curves in the 13 cell lines. For example, correctly classified polychlorinated phenols, aliphatic alcohols, and acetates (, items 1–3) exhibit similar in vitro concentration–response profiles and in vivo toxicity. In contrast, a pair of benzaldehyde molecules (, item 4) have markedly different qHTS profiles, with one profile indicating more potential toxicity, whereas both are considered inactive in vivo; in this case, chemical descriptors perceive the chemicals as similar in relation to toxicity. For alkyl halides and nitriles (, items 5 and 6), in vitro screening failed to detect toxicity, whereas they are positive for in vivo toxicity (except for volatile bromoethane and acetonitrile), but in the case of phenylenediamine derivatives and alkyl aldehydes (, items 7 and 8), the agreement between in vitro and in vivo results is higher.
Classifications for similar compounds.
For some misclassified compounds (e.g., bromoethane, acetonitrile, or methyl vinyl ketone; , items 5, 6, and 9), the errors may be related to metabolism. For example, in the case of alkyl nitriles, their toxicity is known to be caused by the hydrogen cyanide metabolite (Willhite and Smith 1981
). Other reasons for failure of the model to accurately predict could include certain physical properties (e.g., volatility) and chemical uniqueness, that is, when a “structural outlier” is the only representative of a certain mechanism of toxicity. These factors may help explain incorrect classification of iodoform and methyl isocyanate, which are small volatile molecules with inactive qHTS profiles but are known to be toxic in vivo
These results suggest that a strategy for refining hybrid models could be to tailor their applications based on the success or failure of the global consensus models in local regions of chemical space. For example, in regions of chemical space where pharmacokinetics (e.g., metabolism or absorption) challenges in vitro–in vivo comparisons, models could be trained to rely exclusively on chemical descriptors, and the generation of qHTS data would be less crucial. In other areas of chemical space, where qHTS results add significantly to model performance, generation of qHTS results would be considered a higher priority, and in these cases, our results show the importance of using both short-term assays and advanced cheminformatics approaches for predicting in vivo toxicity assessment.