The advent of structure-activity relationship (SAR) and quantitative SAR (QSAR) paradigms have allowed for the prediction of toxicants and the rational design of therapeutic agents based on their similarity in chemical structure or property to previously tested compounds [
1]. Moreover, the utility of QSAR approaches to investigate sets of similarly shaped chemicals with discrete mechanisms of action (
e.g., ligands to specific receptors) has been well demonstrated [
2-
4]. However, chemicals associated with adverse human health effects such as cancer are generally not amicable to traditional QSAR modelling for two reasons. First, there is a great structural diversity of chemicals being modelled for these endpoints. This is because, for the most part, the chemicals that are tested for potential carcinogenic effects are often in use or will be in use for a myriad of purposes (
e.g., industrial solvents, consumer products, pesticides, and drugs). Second, there is no generalized
a priori accepted mechanism of toxicity applicable to the entire set of compounds being modelled (
e.g., a specific receptor for carcinogenesis).
The Computer Automated Structure Evaluation program (CASE) was developed over 20 years ago by Rosenkranz and Klopman in order to address these difficulties [
5,
6]. This SAR expert system was one of the first developed to efficiently and rapidly analyse large numbers of structurally diverse compounds without the need for any
a priori mechanism of action. The CASE program successfully used 2-dimenstional (2D) structural features called biophores found among categorized active and inactive chemicals in the program's learning set that were associated with a particular biological, pharmacological, or toxicological activity. Other methods were also developed including John Ashby's “structural alerts” to potential carcinogenicity [
7-
9], TOPKAT [
10], and DEREK [
11]. SAR models are increasingly being used by regulatory agencies worldwide for both human health [
12] and ecological endpoints [
13] (
e.g., Oncologic by the U.S. Environmental Protection Agency [
14] and CASE by the U.S. Food and Drug Administration's Centre for Drug Evaluation and Research [
15]).
Previously, we reported SAR models based on data from the Carcinogenic Potency Database (CPDB) [
16] analyses of mouse [
17] and rat [
18] cancer data using CASE/MultiCASE. In these studies, rat and mouse SAR models had a concordance between experimental and SAR-predicted values of 71 and 78%, respectively [
17,
18]. More recently, MCASE MC4PC and MDL-QSAR models developed by the FDA produced a concordance of 66 and 69%, respectively [
15].
Our early CASE/MultiCASE models, while being predictive, also provided some insight into the structural underpinnings for carcinogenesis and were consistent with Ashby's “structural alerts” [
7-
9]. Recently, using the cat-SAR expert system, we developed models of rat mammary carcinogens [
19] based on CPDB data [
20,
21]. One set of SAR models was based on a comparison of rat mammary carcinogens to rat non-carcinogens (MC-NC) and the second compared rat mammary carcinogens to rat non-mammary carcinogens (MC-NMC). While the MC-NC model was typical of carcinogen SAR models with comparisons of carcinogens to noncarcinogens (albeit for a specific tumour site), the MC-NMC model was unique since it was based on a learning set that contained carcinogens in both the active (
i.e., mammary carcinogens) and inactive (
i.e., carcinogens to sites other than the mammary gland) categories.
In that study, the rat MC-NC model achieved a concordance between experimental and predicted values of 84% and the rat MC-NMC model was 78% concordant. As such, both tissue-specific models were more concordant than previous models developed for whole animal carcinogenesis. More importantly, however, the MC-NMC model was able to distinguish between different types of carcinogens (i.e., mammary carcinogens from all other carcinogens), not rather between carcinogens and non-carcinogens. Thus the MC-NMC model identified structural attributes that addressed the question of “why do some carcinogens induce mammary cancer.”
However, even though these models are useful for predicting organ-specific carcinogenesis, they have limited applicability for mechanistic inquiry regarding organ-specific activity. One way to overcome this shortfall, as reported by Zhu et al., is to incorporate SAR descriptors that have direct biological relevance [
22]. In this study, the authors described QSAR models for rodent carcinogenicity that were developed from chemical structural descriptors and high throughput screening cytotoxicity descriptors for the modelled compounds wherein the QSAR model's concordance went from 62.3% for the chemical descriptor only model to 72.7% when cell viability data was included [
22].
Another way to include biologically-relevant descriptors in SAR model is to obtain potential chemical-protein interaction data by virtual screening techniques. As such, we developed and report herein a novel SAR modelling approach that uses descriptors not derived directly from chemical structure, but derived from whether or not the compounds are potential ligands for a wide variety of proteins. These biologically-based SAR descriptors are developed by virtual screening of the compounds in the model's learning set against a large and diverse set of proteins. The end result of this approach is a set of SAR descriptors associating chemical carcinogens and non-carcinogens to potential biological targets. This is in contrast to traditional SAR descriptors that associate chemical carcinogens to chemical structures. Hence this novel SAR modelling paradigm bridges the gap between chemical (sub)structure and observable toxicological phenomena by directly associating carcinogenic activity with biological targets. The activity part of the model is still empirical from some in vitro or in vivo experiment (e.g., cancer bioassay). However, we demonstrate that the “structure” part of the SAR equation can be populated with biologically relevant descriptors (i.e., ligand-receptor interactions) rather than the traditional chemical descriptors (e.g., molecular weight, 2D fragments, or topological indices). This new SAR modelling paradigm could be referred to as (biologically relevant) structure – activity relationship modelling.
Cross validations of the ligand MC-NC model had a concordance between experimental and predicted results of 71% and the MC-NMC model was 72% concordant. Furthermore, the development of a hybrid fragment-ligand model improved the concordances to 85 and 83% for the two models, respectively. Therefore, the hybrid model, by bringing together chemical and biological descriptors, outperformed both models. As important, however, as the increased predictivity of the hybrid model is the observation in an example case that the ligand model provided a more complete SAR-based rationale for chemical carcinogenesis by identifying specific biological targets to which carcinogens may interact. The fragment and ligand models together therefore can provide a description of chemical features associated with carcinogenic activity as well as biological targets potentially affected by chemical carcinogens.