Search tips
Search criteria 


SAR QSAR Environ Res. 2010 Jul–Sep; 21(5-6): 403–413.
Published online 2010 September 6. doi:  10.1080/1062936X.2010.501815
PMCID: PMC2946238

Development of an ecotoxicity QSAR model for the KAshinhou Tool for Ecotoxicity (KATE) system, March 2009 version


The KAshinhou Tool for Ecotoxicity (KATE) system, including ecotoxicity quantitative structure–activity relationship (QSAR) models, was developed by the Japanese National Institute for Environmental Studies (NIES) using the database of aquatic toxicity results gathered by the Japanese Ministry of the Environment and the US EPA fathead minnow database. In this system chemicals can be entered according to their one-dimensional structures and classified by substructure. The QSAR equations for predicting the toxicity of a chemical compound assume a linear correlation between its log P value and its aquatic toxicity. KATE uses a structural domain called C-judgement, defined by the substructures of specified functional groups in the QSAR models. Internal validation by the leave-one-out method confirms that the QSAR equations, with r2>0.7, RMSE ≤0.5, and n>5, give acceptable q2 values. Such external validation indicates that a group of chemicals with an in-domain of KATE C-judgements exhibits a lower root mean square error (RMSE). These findings demonstrate that the KATE system has the potential to enable chemicals to be categorised as potential hazards.

Keywords: QSAR, ecotoxicity prediction, classification, chemical substances, domain, KATE

1. Introduction

Quantitative structure–activity relationships (QSARs) are potential tools for predicting the activity and properties of chemicals, including their physicochemical attributes, health effects, ecotoxicity and biological activity. QSAR models can estimate and predict such activity and can thus be used to categorise chemicals in terms of their potentially hazardous nature. A recent review has demonstrated that acute aquatic toxicity [1] can be predicted using QSAR and describes the available databases of ecotoxicity data.

Prediction of toxicity by QSAR does not require lengthy experiments, nor the use of animals, plants or cells. QSAR models have therefore been utilised for the assessment of new and existing chemicals for conformity with regulatory requirements in countries within the Organisation for Economic Co-operation and Development (OECD) [2]. In Japan, under the Chemical Substances Control Law (CSCL), the Ministry of the Environment (MoE) is responsible for evaluating the adverse effects of chemicals onecosystems, and uses tests involving aquatic organisms such as Oryzias latipes (fishes) or Daphnia magna (daphnia), in addition to algae data available from the MoE website [3]. The Japanese National Institute for Environmental Studies (NIES) was established to apply QSAR models to acute ecotoxicity, and has developed a QSAR prediction system using the MoE ecotoxicity database. This system, published in March 2009, is known as the KAshinhou Tool for Ecotoxicity (KATE) [4].

The present paper focuses on the theoretical and methodological aspects of the KATE system, and QSAR equations classified by chemical substructure are introduced. We shall then present the cross-validation (‘leave-one-out’) results, and the toxicities calculated by KATE, and by alternative systems such as TIssue MEtabolism Simulator (TIMES) [5,6] (developed by Zlatarov at the Laboratory of Mathematical Chemistry, Bourgas University, Bulgaria), and by ECOSAR™[7] (developed by the US Environmental Protection Agency (EPA)) using the same end-point data set as that in KATE. The validity of KATE will be discussed using the applicability domain, log P, and C-judgements.

2. Overview of KATE

2.1. End-point

KATE uses experimental data on chemical substances to predict aquatic toxicity. The end-points of interest are the 96-hour median lethal concentration (LC50) in fish after acute toxicity tests, and the 48-hour median effective concentration (EC50) in daphnia obtained after acute immobilisation tests. Training sets for QSAR development were derived from the results of ecotoxicity tests (Oryzias latipes LC50 and Daphnia magna EC50) obtained by the MoE [3], as well as the results of acute toxicity tests from the US EPA fathead minnow (Pimephales promelas) database [8,9]. In the KATE system, the 96-hour LC50 data for Oryzias latipes and fathead minnow were combined to reinforce the number of reference datasets. The QSAR equations in KATE for the fish and daphnia end-points were designed using 535 and 258 chemicals, respectively.

2.2. Classification of chemicals

Chemical substances can be classified according to the substructures that give rise to specific chemical properties (Appendix 1 of the supplementary material which is available on the Supplementary Content tab of the article's online page at The rules for daphnia and fish end-points are identical, except for the following five classes: amines aromatic or phenols1, amines aromatic or phenols3, amines aromatic or phenols4, amines aromatic or phenols5, and primary amines. According to KATE, the toxicity of a chemical containing amino functional groups might be different in daphnia from its toxic behaviour in fish.

Forty-four classes are proposed for each end-point of KATE QSAR models. Table 1 shows the QSAR class name, and the detailed class features are listed in Appendix 2 of the supplementary material (available online). The chemicals in the KATE unclassified class were not categorised within any of the rules in Appendix 2. Additional classification rules or fragment definitions are required in further studies to reduce the number of chemicals described as unclassified. It should be noted that the concept of unclassified within KATE does not always include reactive chemicals, and thus differs from the reactive unspecified category in the TIMES software.

Table 1.

QSARs for fish acute toxicity estimated by the equation: log(1/LC50[mM]) = a * log P + b.

2.3. Neutral organics

Neutral organics is an aggregate of the chemicals in defined classes in the KATE system. It comprises the classes: nitriles aliphatic, ketones, alcohols or ethers aliphatic, phosphates, hydrocarbons aliphatic, ethers aliphatic and ethers aromatic. In the OECD Environment Monograph [10], neutral organic compounds of minimal toxicity were divided into the groups: aliphatic alcohols, aliphatic ketones, aliphatic ethers and alkoxyethers, aliphatic halogenated hydrocarbons, saturated alkanes and halogenated benzenes. Some of the neutral organics compounds defined in the OECD monograph were categorised differently from those in KATE.

2.4. QSAR equations

The QSAR equations in the KATE model express the correlation between the octanol/water partition coefficient (log P) of a compound and its aquatic toxicity, using simple linear regression analysis. Measured log P values were used to derive the QSAR equations, except for the equations labelled C in Tables 1 and and2.2. In cases where experimental log P values were not available, an equation was constructed from the calculated Clog P value obtained by the Daylight toolkit [11]. The LC50 and EC50 values in the equation were expressed in terms of the common logarithm of the inverse of millimoles per litre (mmolL−1, or mM). The equations and the statistical information obtained are shown in Tables 1 and and2.2. Where there were fewer than three sets of reference data within one class, QSAR prediction could not be performed. In such cases the class name was the only information obtained from KATE, and the label NO–QSAR is indicated in Tables 1 and and2.2. The equation for a class named pyrethroids was not constructed, since the log P values in the reference data were gathered in higher ranges [6.1, 6.5].

Table 2.

QSARs for the daphnia acute toxicity estimated by the equation: log(1/EC50[mM]) = a * log P + b.

2.5. Domains in KATE

KATE offers two ‘judgements’ to verify whether or not a predicted chemical substance falls within the applicability domain of a QSAR class. The first is the log P judgement, based on the log P range defined by the reference chemical data of the class concerned. This has been categorised as a descriptor domain [12,13]. The interpolated log P range for each class is listed in Tables 1 and and22.

The second is the C-judgement, which is categorised as a structural domain and is defined by the substructures shown in Appendix 3 of the supplementary material (available online). The substructures are based on functional groups having similar concepts to those used by Schultz et al. [13], rather than on atom-centred fragments [12,14]. Schultz et al. applied the structural domain to one QSAR equation for aromatic compounds, and the out-of-domain revealed well-known electrophoric mechanisms in the structural space(s) [13]. In the KATE system the classification rules (described in Section 2.2) play a role in constructing such structural space(s). The definition of the applicability domain of C-judgement depends on whether all the substructures of the chemical under test are found in reference chemicals in the class, or secondly, whether all substructures in the test chemical are present in reference chemicals in either neutral organics or the class concerned. The first of these definitions is stricter than the second. The reliability of the log P and C-judgements is assessed later in Section 4 (Results and discussion).

2.6. KATE system software

The KATE software was first made available to the public in January 2008. An updated version of KATE, including standalone personal computer and internet versions, was released in March 2009. The standalone version, called ‘KATE on PAS’, and the internet version, called ‘KATE on NET’, adopted the KOWWIN™[15] of the US EPA, and Clog P [11] estimated by the Daylight system, respectively, to estimate the calculated log P. Except for the treatment of calculated log P values, KATE on PAS and KATE on NET use the same classification algorithm, fragment identification by tree structure (FITS), developed by Yoshioka.

In the KATE system, the input is simplified molecular input line entry specification (SMILES) and log P (if available) for toxicity prediction, and the output is the calculated toxicity concentration (LC50 or EC50), the QSAR class found for the predicted chemical, and the domain judgements. If the measured log P of a chemical is not available, the calculated log P according to the SMILES information (KOWWIN or C log P) is adopted.

3. Methods of QSAR validation

First, leave-one-out cross validations were examined for training sets used in the QSAR equations of KATE. Secondly, external validations were performed using test set compounds not included in the KATE training sets due to lack of measured log P values. The 287 fish 96-hour LC50 and 98 daphnia 48-hour EC50 from the Japan MoE, along with the US EPA fathead minnow database, were used for comparison of the calculated toxicity by the KATE software version published in March 2009, TIMES v. 2.25, and ECOSAR v. 0.99 h (1999).

It is worth mentioning that the end-points of the data calculated by KATE were not identical to those calculated by TIMES and ECOSAR. Fish (mixed with Oryzias latipes and fathead minnow acute toxicity tests) 96-hour LC50 and daphnia 48-hour EC50 (KATE), Pimephales promelas 96-hour LC50 and daphnia 48-hour EC50 (TIMES), and fish 96-hour LC50 and daphnia 48-hour LC50 (ECOSAR) were therefore adopted. The input of KATE and ECOSAR were SMILES strings, and calculated log P by KOWWIN. In TIMES, only the lists of SMILES strings were used as input values, and quantum chemical calculations were performed using MOPAC AM1 Hamiltonian, using the ‘precise’ option, without taking other conformers into account.

4. Results and discussion

4.1. Cross validation

The QSAR equations were validated by the leave-one-out method obtained from the KATE system. The complete list of results is given in Appendix 4 of the supplementary material (available online). The statistical data are displayed in Tables 1 and and2.2. The criterion proposed by Hulzebos and Posthumus [16] was evaluated, in which the estimations from models should not deviate from the experimental value by a factor of 10 or above. For fish, 575 of the 628 chemicals met the acceptable criteria, and for daphnia 241 of 290 did so. (In this instance the 628 and 290 chemicals involved some degree of duplication.) Using the QSAR equations in the KATE system, more than 80% of chemicals were predicted within a factor of 10. The classes with less than a 0.7 squared correlation coefficient (r2 < 0.7), and/or more than 0.5 RMSE, tended to increase the number of chemical substances in the unacceptable group. For example, the fish hydrocarbons aromatic class had 43 reference data, r2 = 0.826, RMSE = 0.368, and only one unacceptable chemical. In other words, 98% of the chemicals were classed as acceptable. On the other hand, the fish dinitrobenzene class contained 12 reference data, r2 = 0.331, RMSE = 0.669, and three unacceptable chemicals. In this case, 75% of the chemicals were thus acceptable.

As shown in Tables 1 and and2,2, each of the classes with r2 ≥ 0.7, RMSE ≤ 0.5, and n >5, e.g., the fish hydrocarbon aromatic class, had a sufficiently high q2. Such classes showed QSAR equations similar to those of neutral organics. Thus the toxicity of such classes could be explained mainly by the narcotic effect of the chemicals. However, the daphnia amines aromatic or phenols4 and amines aromatic or phenols5 groups had a larger intercept b in the QSAR equations than neutral organics with a small log P value (see Figure 1). These classes can be explained in terms of polar narcosis or narcosis II [17]. Narcosis II is known to be more toxic than baseline toxicity, i.e., than neutral organics, non-polar narcosis, narcosis I, or less inert, as explained by Verhaar et al. [18].

Figure 1.

The correlation between log P and the measured toxicity values of chemicals used in KATE as a daphnia end-point. The dotted-dashed, dashed and bold lines are the QSAR equations of amines aromatic or phenols4, amines aromatic or phenols5, and neutral organics, ...

In some cases the q2 values were much smaller than those of r2. QSAR equations based on fewer than six reference data require a greater number of reference chemicals.

4.2. External validation

Tables 3 and and44 list the statistical data of the TIMES, ECOSAR, and KATE with or without the applicability domains. The complete results are given in Appendix 5 of the supplementary material. First, we will focus on the TIMES, ECOSAR, and all the KATE results, without considering any applicability domains. In fish, the determination coefficient, r2, and RMSE using KATE (r2 = 0.868 and RMSE = 0.658) were larger and smaller, respectively, than those using TIMES (r2 = 0.751 and RMSE = 0.935) and than by ECOSAR (r2 = 0.790 and RMSE = 0.869). For daphnia, RMSE using KATE (0.993) was smaller than that using TIMES (1.404) and ECOSAR (1.364). However, r2 using KATE (0.662) showed no noticeable advantage over that by TIMES (0.668) or ECOSAR (0.699). Since reference data for the daphnia end-point (258 chemicals) numbered only half of those for fish (535 chemicals), the reference data for each QSAR equation for daphnia would therefore be less satisfactory for predicting toxicity. The addition of reference data and a change in the classification rules can recover the values of the statistical data. A fraction of log(1/LC50) with an underestimation of less than −1 indicated that, compared with KATE, TIMES and ECOSAR tended to underestimate the toxicities of both fish and daphnia. On the other hand, a fraction of log(1/LC50) showing an overestimation of more than 1 indicated that, compared with TIMES, ECOSAR and KATE tended to overestimate toxicity in both fish and daphnia. Considering these underand over-estimation fractions, we find that KATE gives a higher predictive ability in acute Oryzias latipes and Daphnia magna toxicity tests than does TIMES or ECOSAR. If the alert: Out of domain, in TIMES, and the applicable log P range in ECOSAR are considered rigidly, the correlation between measured and calculated toxicity is improved in TIMES and ECOSAR.

Table 3.

Statistical information comparing measured and calculated fish log(1/LC50[mM]) of 287 test set compounds. The complete results are shown in Appendix 5-1.

Table 4.

Statistical information between measured and calculated Daphnia log(1/EC50[mM]) for 98 test set compounds. The complete results are shown in Appendix 5-2.

Secondly, in fish, the RMSE of one of any in-domains was smaller than if domains were not considered. However, the r2 in-domain of log P showed no particular improvement. For daphnia, r2 and RMSE for one of any in-domains were larger and smaller, respectively, than those without considering domains. In the present study, either the descriptor and/or structural domains were related to the reduction of RMSE and the fraction of underestimated chemicals, especially if both domains were considered simultaneously. Additionally, the stricter structural domain C(1) (shown in Tables 3 and and4)4) demonstrated better predictive performance than the structural domain C(2). The systematic study of the domain based on the atom-centred fragment (ACF) approach by Kuhne et al. [14] showed that the ACF varied with respect to its size in terms of the path length, and the ACF match mode was specified in terms of degree of strictness. They also demonstrated a clear relationship between predictive performance and the levels of the ACF definition and match mode [14]. Even though the definition of substructures for the domain are different, the improvement by using C-judgement is similar in concept to that using the ACF approach. Thus, the log P range of the equation and C-judgement are useful for assessing the applicability of the QSAR results.

5. Summary

We have reported on the KATE system, encompassing a full list of classifications of the QSAR equations and KATE validations. In the KATE system chemicals are classified by their substructure. The QSAR equations express the correlation between log P and log(1/LC50) or log(1/EC50) of a chemical by simple linear regression analyses. The classes of QSAR equations are characterised by fragments of chemicals, except for the neutral organics class. The descriptor and structure domains, log P and C-judgements, in KATE were also introduced.

The cross-validation of the KATE system showed that QSAR equations with higher r2 and lower RMSE with n > 5 gave a reliably higher q2 than the other QSAR equations in KATE, meaning they had better predictive ability. A comparison of KATE, TIMES, and ECOSAR revealed that KATE was more accurate, due to end-point dependence. The use of log P and the C-judgement improved the statistical data. Thus the KATE system is a powerful tool for predicting acute toxicity in Oryzias latipes and Daphnia magna when the log P and C-judgement can be confirmed. Also, KATE has the potential to be useful in risk assessment.

The next topics in QSAR development will be to consider the reactivity of chemicals, and to include multi-regression analysis. The quantum chemical parameters, such as partial charges, are candidates for additional descriptors. Other ways of significantly increasing the reliability of toxicity prediction will be to improve the classification of the substructures, increase the reference data in a QSAR equation, and to refine the C-judgement.


KATE was researched and developed by the Research Center for Environmental Risk at the NIES, under contract to the Japanese MoE between 2004 and 2008. We also wish to thank the US EPA for permission to use KOWWIN in KATE on PAS, the standalone version of the KATE system. We are grateful to Mr K. Hasunuma and Ms K. Sugiyama for their support and encouragement with the KATE publication.


1. Netzeva T.I., Pavan M., Worth A.P. Review of (quantitative) structure—activity relationships for acute aquatic toxicity. QSAR Comb. Sci. 2008;27:77–90.
2. OECD. Report on the regulatory uses and applications in OECD member countries of (quantitative) structure—activity relationship [(Q)SAR] models in the assessment of new and existing chemicals. 2006. Environment Health and Safety Publications Series on Testing and Assessment, No. 58, OECD, Paris.
3. MoE. Japan ecotoxicity tests data. Available at
2. KATE. Available at Copyright (C) 2008–2009 Ministry of the Environment, Government of Japan, all rights reserved. It is cautioned that these QSAR results may not be used as ecotoxicity test results required for MoE submissions in compliance with the CSCL.
5. Mekenyan O.G., Dimitrov S.D., Pavlov T.S., Veith G.D. A systematic approach to simulating metabolism in computational toxicology. I. The TIMES heuristic modelling framework. Curr. Pharm. Des. 2004;10:1273–1293. [PubMed]
6. Dimitrov S.D., Mekenyan O.G., Sinks G.D., Schultz T.W. Global modeling of narcotic chemicals: Ciliate and fish toxicitya. THEOCHEM. 2003;622:63–70.
8. US EPA. fathead minnow database. Available at
9. Russom C.L., Bradbury S.P., Broderius S.J., Hammermeister D.E., Drummond R.A. Predicting modes of toxic action from chemical structure: Acute toxicity in the fathead minnow (Pimephales promelas) Environ. Toxicol. Chem. 1997;16:948–967.
10. OECD. Report of the OECD Workshop on quantitative structure activity relationships (QSARs) in aquatic effects assessment. 1992. Environment Monographs, No. 58, OECD, Paris.
11. Clog P. Daylight Chemical Information Systems, Inc. Available at The underlying program, CLOG P, is copyrighted by Pomona College and BioByte, Inc., of Claremont, CA.
12. Dimitrov S., Dimitrova G., Pavlov T., Dimitrova N., Patlewicz G., Niemela J., Mekenyan O. A stepwise approach for defining the applicability domain of SAR and QSAR models. J. Chem. Inf. Model. 2005;45:839–849. [PubMed]
13. Schultz T.W., Hewitt M., Netzeva T.I., Cronin M.T.D. Assessing applicability domains of toxicological QSARs: Definition, confidence in predicted values, and the role of mechanisms of action. QSAR Comb. Sci. 2007;26:238–254.
14. Kuhne R., Ebert R.U., Schuurmann G. Chemical domain of QSAR models from atom-centered fragments. J. Chem. Inf. Model. 2009;49:2660–2669. [PubMed]
15. US EPA. KOWWIN™. Available at
16. Hulzebos E.M., Posthumus R. (Q)SARs: Gatekeepers against risk on chemicals? SAR QSAR Environ. Res. 2003;14:285–316. [PubMed]
17. Veith G.D., Broderius S.J. Rules for distinguishing toxicants that cause Type-I and Type-II Narcosis syndromes. Environ. Health Perspect. 1990;87:207–211. [PMC free article] [PubMed]
18. Verhaar H.J.M., Vanleeuwen C.J., Hermens J.L.M. Classifying environmental pollutants: 1. Structure—activity relationships for prediction of aquatic toxicity. Chemosphere. 1992;25:471–491.

Articles from Taylor & Francis Open Select are provided here courtesy of Taylor & Francis