Poor performance of scoring functions is a well-known bottleneck in structure-based virtual screening, which is most frequently manifested in the scoring functions’ inability to discriminate between true ligands versus known non-binders (therefore designated as binding decoys). This deficiency leads to a large number of false positive hits resulting from virtual screening. We have hypothesized that filtering out or penalizing docking poses recognized as non-native (i.e., pose decoys) should improve the performance of virtual screening in terms of improved identification of true binders. Using several concepts from the field of cheminformatics, we have developed a novel approach to identifying pose decoys from an ensemble of poses generated by computational docking procedures. We demonstrate that the use of target-specific pose (-scoring) filter in combination with a physical force field-based scoring function (MedusaScore) leads to significant improvement of hit rates in virtual screening studies for 12 of the 13 benchmark sets from the clustered version of the Database of Useful Decoys (DUD). This new hybrid scoring function outperforms several conventional structure-based scoring functions, including XSCORE∷HMSCORE, ChemScore, PLP, and Chemgauss3, in six out of 13 data sets at early stage of VS (up 1% decoys of the screening database). We compare our hybrid method with several novel VS methods that were recently reported to have good performances on the same DUD data sets. We find that the retrieved ligands using our method are chemically more diverse in comparison with two ligand-based methods (FieldScreen and FLAP∷LBX). We also compare our method with FLAP∷RBLB, a high-performance VS method that also utilizes both the receptor and the cognate ligand structures. Interestingly, we find that the top ligands retrieved using our method are highly complementary to those retrieved using FLAP∷RBLB, hinting effective directions for best VS applications. We suggest that this integrative virtual screening approach combining cheminformatics and molecular mechanics methodologies may be applied to a broad variety of protein targets to improve the outcome of structure-based drug discovery studies.
There is a critical need for improving the level of chemistry awareness in systems biology. The data and information related to modulation of genes and proteins by small molecules continue to accumulate at the same time as simulation tools in systems biology and whole body physiologically-based pharmacokinetics (PBPK) continue to evolve. We called this emerging area at the interface between chemical biology and systems biology systems chemical biology, SCB (Oprea et al., 2007).
The overarching goal of computational SCB is to develop tools for integrated chemical-biological data acquisition, filtering and processing, by taking into account relevant information related to interactions between proteins and small molecules, possible metabolic transformations of small molecules, as well as associated information related to genes, networks, small molecules and, where applicable, mutants and variants of those proteins. There is yet an unmet need to develop an integrated in silico pharmacology / systems biology continuum that embeds drug-target-clinical outcome (DTCO) triplets, a capability that is vital to the future of chemical biology, pharmacology and systems biology. Through the development of the SCB approach, scientists will be able to start addressing, in an integrated simulation environment, questions that make the best use of our ever-growing chemical and biological data repositories at the system-wide level. This chapter reviews some of the major research concepts and describes key components that constitute the emerging area of computational systems chemical biology.
Physiologically-based pharmacokinetics (PBPK); biological networks; cheminformatics; QSAR modeling; biochemical network simulations; systems biology
The curated CSAR-NRC benchmark sets provide valuable opportunity for testing or comparing the performance of both existing and novel scoring functions. We apply two different scoring functions, both independently and in combination, to predict binding affinity of ligands in the CSAR-NRC datasets. One, reported here for the first time, employs multiple chemical-geometrical descriptors of the protein-ligand interface to develop Quantitative Structure – Binding Affinity Relationships (QSBAR) models; these models are then used to predict binding affinity of ligands in the external dataset. Second is a physical force field-based scoring function, MedusaScore. We show that both individual scoring functions achieve statistically significant prediction accuracies with the squared correlation coefficient (R2) between actual and predicted binding affinity of 0.44/0.53 (Set1/Set2) with QSBAR models and 0.34/0.47 (Set1/Set2) with MedusaScore. Importantly, we find that the combination of QSBAR models and MedusaScore into consensus scoring function affords higher prediction accuracy than any of the contributing methods achieving R2 of 0.45/0.58 (Set1/Set2). Furthermore, we identify several chemical features and non-covalent interactions that may be responsible for the inaccurate prediction of binding affinity for several ligands by the scoring functions employed in this study.
Some antipsychotic drugs are known to cause valvular heart disease by activating serotonin 5-HT2B receptors. We have developed and validated binary classification QSAR models capable of predicting potential 5-HT2B binders. The classification accuracies of the models to discriminate 5-HT2B actives from the inactives were as high as 80% for the external test set. These models were used to screen in silico 59,000 compounds included in the World Drug Index and 122 compounds were predicted as actives with high confidence. Ten of them were tested in radioligand binding assays and nine were found active suggesting a success rate of 90%. All validated binders were then tested in functional assays and one compound was identified as a true 5-HT2B agonist. We suggest that the QSAR models developed in this study could be used as reliable predictors to flag drug candidates that are likely to cause valvulopathy.
Remote loading of liposomes by trans-membrane gradients is used to achieve therapeutically efficacious intra-liposome concentrations of drugs. We have developed Quantitative Structure Property Relationship (QSPR) models of remote liposome loading for a dataset including 60 drugs studied in 366 loading experiments internally or elsewhere. Both experimental conditions and computed chemical descriptors were employed as independent variables to predict the initial drug/lipid ratio (D/L) required to achieve high loading efficiency. Both binary (to distinguish high vs. low initial D/L) and continuous (to predict real D/L values) models were generated using advanced machine learning approaches and five-fold external validation. The external prediction accuracy for binary models was as high as 91–96%; for continuous models the mean coefficient R2 for regression between predicted versus observed values was 0.76–0.79. We conclude that QSPR models can be used to identify candidate drugs expected to have high remote loading capacity while simultaneously optimizing the design of formulation experiments.
chemical descriptors; liposome; loading conditions; loading efficiency; QSPR; remote loading
The rapidly increasing amount of public data in chemistry and biology provides new opportunities for large-scale data mining for drug discovery. Systematic integration of these heterogeneous sets and provision of algorithms to data mine the integrated sets would permit investigation of complex mechanisms of action of drugs. In this work we integrated and annotated data from public datasets relating to drugs, chemical compounds, protein targets, diseases, side effects and pathways, building a semantic linked network consisting of over 290,000 nodes and 720,000 edges. We developed a statistical model to assess the association of drug target pairs based on their relation with other linked objects. Validation experiments demonstrate the model can correctly identify known direct drug target pairs with high precision. Indirect drug target pairs (for example drugs which change gene expression level) are also identified but not as strongly as direct pairs. We further calculated the association scores for 157 drugs from 10 disease areas against 1683 human targets, and measured their similarity using a score matrix. The similarity network indicates that drugs from the same disease area tend to cluster together in ways that are not captured by structural similarity, with several potential new drug pairings being identified. This work thus provides a novel, validated alternative to existing drug target prediction algorithms. The web service is freely available at: http://chem2bio2rdf.org/slap.
Modern drug discovery requires the understanding of chemogenomics, the complex interaction of chemical compounds and drugs with a wide variety of protein target and genes in the body. A large amount of data pertaining to such relationships exists in publicly-accessible datasets but it is siloed and thus impossible to use in an integrated fashion. In this work we have integrated and semantically annotated a large amount of public data from a wide range of databases, including compound-gene, drug-drug, protein-protein, drug-side effects and so on, to create a complex network of interactions relating to compounds and protein targets. We developed a statistical algorithm called Semantic Link Association Prediction (SLAP) for predicting “missing links” in this data network: i.e. compound-target interactions for which there is no experimental data but which are statistically probable given the other relationships that exist in this set. We present validation experiments which show this method works with a high degree of accuracy, and also demonstrate how it can be used to create a drug similarity network to make predictions of new indications for existing drugs.
Motivation: Advances in the field of cheminformatics have been hindered by a lack of freely available tools. We have created Chembench, a publicly available cheminformatics portal for analyzing experimental chemical structure–activity data. Chembench provides a broad range of tools for data visualization and embeds a rigorous workflow for creating and validating predictive Quantitative Structure–Activity Relationship models and using them for virtual screening of chemical libraries to prioritize the compound selection for drug discovery and/or chemical safety assessment.
Availability: Freely accessible at: http://chembench.mml.unc.edu
Drug discovery is the process of identifying compounds which have potentially meaningful biological activity. A major challenge that arises is that the number of compounds to search over can be quite large, sometimes numbering in the millions, making experimental testing intractable. For this reason computational methods are employed to filter out those compounds which do not exhibit strong biological activity. This filtering step, also called virtual screening reduces the search space, allowing for the remaining compounds to be experimentally tested.
In this paper we propose several novel approaches to the problem of virtual screening based on Canonical Correlation Analysis (CCA) and on a kernel-based extension. Spectral learning ideas motivate our proposed new method called Indefinite Kernel CCA (IKCCA). We show the strong performance of this approach both for a toy problem as well as using real world data with dramatic improvements in predictive accuracy of virtual screening over an existing methodology.
Kernel methods; canonical correlation analysis; indefinite kernels; drug discovery; virtual screening
Evaluation of biological effects, both desired and undesired, caused by Manufactured NanoParticles (MNPs) is of critical importance for nanotechnology. Experimental studies, especially toxicological, are time-consuming, costly, and often impractical, calling for the development of efficient computational approaches capable of predicting biological effects of MNPs. To this end, we have investigated the potential of cheminformatics methods such as Quantitative Structure – Activity Relationship (QSAR) modeling to establish statistically significant relationships between measured biological activity profiles of MNPs and their physical, chemical, and geometrical properties, either measured experimentally or computed from the structure of MNPs. To reflect the context of the study, we termed our approach Quantitative Nanostructure-Activity Relationship (QNAR) modeling. We have employed two representative sets of MNPs studied recently using in vitro cell-based assays: (i) 51 various MNPs with diverse metal cores (PNAS, 2008, 105, pp 7387–7392) and (ii) 109 MNPs with similar core but diverse surface modifiers (Nat. Biotechnol., 2005, 23, pp 1418–1423). We have generated QNAR models using machine learning approaches such as Support Vector Machine (SVM)-based classification and k Nearest Neighbors (kNN)-based regression; their external prediction power was shown to be as high as 73% for classification modeling and R2 of 0.72 for regression modeling. Our results suggest that QNAR models can be employed for: (i) predicting biological activity profiles of novel nanomaterials, and (ii) prioritizing the design and manufacturing of nanomaterials towards better and safer products.
nanoparticles; QSAR; cheminformatics; nanotoxicity; modeling
Molecular modelers and cheminformaticians typically analyze experimental data generated by other scientists. Consequently, when it comes to data accuracy, cheminformaticians are always at the mercy of data providers who may inadvertently publish (partially) erroneous data. Thus, dataset curation is crucial for any cheminformatics analysis such as similarity searching, clustering, QSAR modeling, virtual screening, etc., especially nowadays when the availability of chemical datasets in public domain has skyrocketed in recent years. Despite the obvious importance of this preliminary step in the computational analysis of any dataset, there appears to be no commonly accepted guidance or set of procedures for chemical data curation. The main objective of this paper is to emphasize the need for a standardized chemical data curation strategy that should be followed at the onset of any molecular modeling investigation. Herein, we discuss several simple but important steps for cleaning chemical records in a database including the removal of a fraction of the data that cannot be appropriately handled by conventional cheminformatics techniques. Such steps include the removal of inorganic and organometallic compounds, counterions, salts and mixtures; structure validation; ring aromatization; normalization of specific chemotypes; curation of tautomeric forms; and the deletion of duplicates. To emphasize the importance of data curation as a mandatory step in data analysis, we discuss several case studies where chemical curation of the original “raw” database enabled the successful modeling study (specifically, QSAR analysis) or resulted in a significant improvement of model's prediction accuracy. We also demonstrate that in some cases rigorously developed QSAR models could be even used to correct erroneous biological data associated with chemical compounds. We believe that good practices for curation of chemical records outlined in this paper will be of value to all scientists working in the fields of molecular modeling, cheminformatics, and QSAR studies.
Adverse effects of drugs (AEDs) continue to be a major cause of drug withdrawals both in development and post-marketing. While liver-related AEDs are a major concern for drug safety, there are few in silico models for predicting human liver toxicity for drug candidates. We have applied the Quantitative Structure Activity Relationship (QSAR) approach to model liver AEDs. In this study, we aimed to construct a QSAR model capable of binary classification (active vs. inactive) of drugs for liver AEDs based on chemical structure. To build QSAR models, we have employed an FDA spontaneous reporting database of human liver AEDs (elevations in activity of serum liver enzymes), which contains data on approximately 500 approved drugs. Approximately 200 compounds with wide clinical data coverage, structural similarity and balanced (40/60) active/inactive ratio were selected for modeling and divided into multiple training/test and external validation sets. QSAR models were developed using the k nearest neighbor method and validated using external datasets. Models with high sensitivity (>73%) and specificity (>94%) for prediction of liver AEDs in external validation sets were developed. To test applicability of the models, three chemical databases (World Drug Index, Prestwick Chemical Library, and Biowisdom Liver Intelligence Module) were screened in silico and the validity of predictions was determined, where possible, by comparing model-based classification with assertions in publicly available literature. Validated QSAR models of liver AEDs based on the data from the FDA spontaneous reporting system can be employed as sensitive and specific predictors of AEDs in pre-clinical screening of drug candidates for potential hepatotoxicity in humans.
Quantitative high-throughput screening (qHTS) assays are increasingly being used to inform chemical hazard identification. Hundreds of chemicals have been tested in dozens of cell lines across extensive concentration ranges by the National Toxicology Program in collaboration with the National Institutes of Health Chemical Genomics Center.
Our goal was to test a hypothesis that dose–response data points of the qHTS assays can serve as biological descriptors of assayed chemicals and, when combined with conventional chemical descriptors, improve the accuracy of quantitative structure–activity relationship (QSAR) models applied to prediction of in vivo toxicity end points.
We obtained cell viability qHTS concentration–response data for 1,408 substances assayed in 13 cell lines from PubChem; for a subset of these compounds, rodent acute toxicity half-maximal lethal dose (LD50) data were also available. We used the k nearest neighbor classification and random forest QSAR methods to model LD50 data using chemical descriptors either alone (conventional models) or combined with biological descriptors derived from the concentration–response qHTS data (hybrid models). Critical to our approach was the use of a novel noise-filtering algorithm to treat qHTS data.
Both the external classification accuracy and coverage (i.e., fraction of compounds in the external set that fall within the applicability domain) of the hybrid QSAR models were superior to conventional models.
Concentration–response qHTS data may serve as informative biological descriptors of molecules that, when combined with conventional chemical descriptors, may considerably improve the accuracy and utility of computational approaches for predicting in vivo animal toxicity end points.
acute toxicity; animal testing; computational toxicology; quantitative high-throughput screening; QSAR
Few Quantitative Structure-Activity Relationship (QSAR) studies have successfully modeled large, diverse rodent toxicity endpoints. In this study, a comprehensive dataset of 7,385 compounds with their most conservative lethal dose (LD50) values has been compiled. A combinatorial QSAR approach has been employed to develop robust and predictive models of acute toxicity in rats caused by oral exposure to chemicals. To enable fair comparison between the predictive power of models generated in this study versus a commercial toxicity predictor, TOPKAT (Toxicity Prediction by Komputer Assisted Technology), a modeling subset of the entire dataset was selected that included all 3,472 compounds used in the TOPKAT’s training set. The remaining 3,913 compounds, which were not present in the TOPKAT training set, were used as the external validation set. QSAR models of five different types were developed for the modeling set. The prediction accuracy for the external validation set was estimated by determination coefficient R2 of linear regression between actual and predicted LD50 values. The use of the applicability domain threshold implemented in most models generally improved the external prediction accuracy but expectedly led to the decrease in chemical space coverage; depending on the applicability domain threshold, R2 ranged from 0.24 to 0.70. Ultimately, several consensus models were developed by averaging the predicted LD50 for every compound using all 5 models. The consensus models afforded higher prediction accuracy for the external validation dataset with the higher coverage as compared to individual constituent models. The validated consensus LD50 models developed in this study can be used as reliable computational predictors of in vivo acute toxicity.
acute toxicity; computational toxicology; LD50; oral exposure; QSAR; rat
Geranylgeranylation is critical to the function of several proteins including Rho, Rap1, Rac, Cdc42, and G-protein gamma subunits. Geranylgeranyltransferase type I (GGTase-I) inhibitors (GGTIs) have therapeutic potential to treat inflammation, multiple sclerosis, atherosclerosis, and many other diseases. Following our standard QSAR modeling workflow, we have developed and rigorously validated Quantitative Structure Activity Relationship (QSAR) models for 48 GGTIs using variable selection k nearest neighbor (kNN), automated lazy learning (ALL), and partial least square (PLS) methods. The QSAR models were employed for virtual screening of 9.5 million commercially available chemicals yielding 47 diverse computational hits. Seven of these compounds with novel scaffolds and high predicted GGTase-I inhibitory activities were tested in vitro, and all were found to be bona fide and selective micromolar inhibitors. Notably, these novel hits could not be identified using traditional similarity search. These data demonstrate that rigorously developed QSAR models can serve as reliable virtual screening tools.
Many proteins change their conformation upon ligand binding. For instance, bacterial periplasmic binding proteins (bPBPs) that transport nutrients into the cytoplasm generally consist of two globular domains connected by strands forming a hinge. During ligand binding, hinge motion changes the conformation from the open to the closed form. Both forms can be crystallized without a ligand, suggesting that the energy difference between them is small. We applied Simplicial Neighborhood Analysis of Protein Packing (SNAPP) as a method to evaluate the relative stability of open and closed forms in bPBPs. Using united residue representation of amino acids, SNAPP performs Delaunay tessellation of the protein, producing an aggregate of space-filling, irregular tetrahedra with nearest neighbor residues at the vertices. The SNAPP statistical scoring function is derived from log-likelihood scores for all possible quadruplet compositions of amino acids found in a representative subset of the Protein Data Bank, and the sum of scores for a given protein provides the total SNAPP score. Results of scoring for bPBPs suggest that in most cases, the unliganded form is more stable than the liganded form, and this conclusion is corroborated by similar observations on other proteins undergoing conformation changes upon binding their ligands. The results of these studies suggest that the SNAPP method can be used to predict relative stability of accessible protein conformations. Furthermore, the SNAPP method allows for delineation of the role of individual residues in protein stabilization, thereby providing new testable hypotheses for rational site-directed mutagenesis in the context of protein engineering.
Delaunay tessellation; periplasmic binding proteins; conformational stability change; differential SNAPP profile analysis; ligand binding
The Simplicial Neighborhood Analysis of Protein Packing (SNAPP) method was used to predict the effect of mutagenesis on the enzymatic activity of the HIV-1 protease (HIVP). SNAPP relies on a four-body statistical scoring function derived from the analysis of spatially nearest neighbor residue compositional preferences in a diverse and representative subset of protein structures from the Protein Data Bank. The method was applied to the analysis of HIVP mutants with residue substitutions in the hydrophobic core as well as at the interface between the two protease monomers. Both wild type and tethered structures were employed in the calculations. We obtained a strong correlation, with R2 as high as 0.96, between ΔSNAPP score (i.e., the difference in SNAPP scores between wild type and mutant proteins) and the protease catalytic activity for tethered structures. A weaker but significant correlation was also obtained for non-tethered structures as well. Our analysis identified residues both in the hydrophobic core and at the dimeric interface (DI) that are very important for the protease function. This study demonstrates a potential utility of the SNAPP method for rational design of mutagenesis studies and protein engineering.
HIV-1 Protease (HIVP); Mutation; Tethered Dimer; Protein Packing; Delaunay Tessellation; Dimeric Interface (DI); Protein Stability; Catalytic Activity
Novel geometrical chemical descriptors have been derived based on the computational geometry of protein-ligand interfaces and Pauling atomic electronegativities (EN). Delaunay tessellation has been applied to a diverse set of 517 X-ray characterized protein-ligand complexes yielding a unique collection of interfacial nearest neighbor atomic quadruplets for each complex. Each quadruplet composition was characterized by a single descriptor calculated as the sum of the EN values for the four participating atom types. We termed these simple descriptors generated from atomic EN values and derived with the Delaunay Tessellation the ENTess descriptors and used them in the variable selection k-Nearest Neighbor quantitative structure-binding affinity relationship (QSBR) studies of 264 diverse protein-ligand complexes with known binding constants. 24 complexes with chemically dissimilar ligands were set aside as an independent validation set, and the remaining dataset of 240 complexes was divided into multiple training and test sets. The best models were characterized by the leave-one-out cross-validated correlation coefficient q2 as high as 0.66 for the training set and the correlation coefficient R2 as high as 0.83 for the test set. High predictive power of these models was confirmed independently by applying them to the validation set of 24 complexes yielding R2 as high as 0.85. We conclude that QSBR models built with the ENTess descriptors can be instrumental for predicting the binding affinity of receptor-ligand complexes.
Receptor-Ligand Interactions; Delaunay Tessellation; k-Nearest Neighbors; Quantitative Structure-Activity Relationships; QSAR; Binding Affinity; Geometrical Chemical Descriptors; Model Validation; Consensus Prediction
We have developed a novel structure-based approach to search for Complimentary Ligands Based on Receptor Information (CoLiBRI). CoLiBRI is based on the representation of both receptor binding sites and their respective ligands in a space of universal chemical descriptors. The binding site atoms involved in the interaction with ligands are identified by the means of computational geometry technique known as Delaunay tessellation as applied to x-ray characterized ligand-receptor complexes. TAE/RECON1 multiple chemical descriptors are calculated independently for each ligand as well as for its active site atoms. The representation of both ligands and active sites using chemical descriptors allows the application of well-known chemometric techniques in order to correlate chemical similarities between active sites and their respective ligands. From these calculations, we have established a protocol to map patterns of nearest neighbor active site vectors in a multidimensional TAE/RECON space onto those of their complementary ligands, and vice versa. This protocol affords the prediction of a virtual complementary ligand vector in the ligand chemical space from the position of a known active site vector. This prediction is followed by chemical similarity calculations between this virtual ligand vector and those calculated for molecules in a chemical database to identify real compounds most similar to the virtual ligand. Consequently, the knowledge of the receptor active site structure affords straightforward and efficient identification of its complementary ligands in large databases of chemical compounds using rapid chemical similarity searches. Conversely, starting from the ligand chemical structure, one may identify possible complementary receptor cavities as well. We have applied the CoLiBRI approach to a dataset of 800 x-ray characterized ligand receptor complexes in the PDBbind database2. Using a k nearest neighbor (kNN) pattern recognition approach and variable selection, we have shown that knowledge of the active site structure affords identification of its complimentary ligand among the top 1% of a large chemical database in over 90% of all test active sites when a binding site of the same protein family was present in the training set. In the case where test receptors are highly dissimilar and not present among the receptor families in the training set, the prediction accuracy is decreased; however CoLiBRI was still able to quickly eliminate 75% of the chemical database as improbable ligands. The CoLiBRI approach provides an efficient prescreening tool for large chemical databases prior to traditional, yet much more computationally intensive, three-dimensional docking approaches.
A combined approach of validated QSAR modeling and virtual screening was successfully applied to the discovery of novel tylophrine derivatives as anticancer agents. QSAR models have been initially developed for 52 chemically diverse phenanthrine-based tylophrine derivatives (PBTs) with known experimental EC50 using chemical topological descriptors (calculated with the MolConnZ program) and variable selection k nearest neighbor (kNN) method. Several validation protocols have been applied to achieve robust QSAR models. The original dataset was divided into multiple training and test sets, and the models were considered acceptable only if the leave-one-out cross-validated R2 (q2) values were greater than 0.5 for the training sets and the correlation coefficient R2 values were greater than 0.6 for the test sets. Furthermore, the q2 values for the actual dataset were shown to be significantly higher than those obtained for the same dataset with randomized target properties (Y-randomization test), indicating that models were statistically significant. Ten best models were then employed to mine a commercially available ChemDiv Database (ca. 500K compounds) resulting in 34 consensus hits with moderate to high predicted activities. Ten structurally diverse hits were experimentally tested and eight were confirmed active with the highest experimental EC50 of 1.8µM implying an exceptionally high hit rate (80%). The same ten models were further applied to predict EC50 for four new PBTs, and the correlation coefficient (R2) between the experimental and predicted EC50 for these compounds plus eight active consensus hits was shown to be as high as 0.57. Our studies suggest that the approach combining validated QSAR modeling and virtual screening could be successfully used as a general tool for the discovery of novel biologically active compounds.
The increasing availability of data related to genes, proteins and their modulation by small molecules, paralleled by the emergence of simulation tools in systems biology, has provided a vast amount of biological information. However, there is a critical need to develop cheminformatics tools that can integrate chemical knowledge with these biological databases, with the goal of creating systems chemical biology.
Accurate prediction of in vivo toxicity from in vitro testing is a challenging problem. Large public–private consortia have been formed with the goal of improving chemical safety assessment by the means of high-throughput screening.
A wealth of available biological data requires new computational approaches to link chemical structure, in vitro data, and potential adverse health effects.
Methods and results
A database containing experimental cytotoxicity values for in vitro half-maximal inhibitory concentration (IC50) and in vivo rodent median lethal dose (LD50) for more than 300 chemicals was compiled by Zentralstelle zur Erfassung und Bewertung von Ersatz- und Ergaenzungsmethoden zum Tierversuch (ZEBET; National Center for Documentation and Evaluation of Alternative Methods to Animal Experiments). The application of conventional quantitative structure–activity relationship (QSAR) modeling approaches to predict mouse or rat acute LD50 values from chemical descriptors of ZEBET compounds yielded no statistically significant models. The analysis of these data showed no significant correlation between IC50 and LD50. However, a linear IC50 versus LD50 correlation could be established for a fraction of compounds. To capitalize on this observation, we developed a novel two-step modeling approach as follows. First, all chemicals are partitioned into two groups based on the relationship between IC50 and LD50 values: One group comprises compounds with linear IC50 versus LD50 relationships, and another group comprises the remaining compounds. Second, we built conventional binary classification QSAR models to predict the group affiliation based on chemical descriptors only. Third, we developed k-nearest neighbor continuous QSAR models for each subclass to predict LD50 values from chemical descriptors. All models were extensively validated using special protocols.
The novelty of this modeling approach is that it uses the relationships between in vivo and in vitro data only to inform the initial construction of the hierarchical two-step QSAR models. Models resulting from this approach employ chemical descriptors only for external prediction of acute rodent toxicity.
acute toxicity; computational toxicology; IC50; LD50; LOAEL; NOAEL; QSAR
A novel Automated Lazy Learning Quantitative Structure-Activity Relationship (ALL-QSAR) modeling approach has been developed based on the lazy learning theory. The activity of a test compound is predicted from locally weighted linear regression model using chemical descriptors and biological activity of the training set compounds most chemically similar to this test compound. The weights with which training set compounds are included in the regression depend on the similarity of those compounds to a test compound. We have applied the ALL-QSAR method to several experimental chemical datasets including 48 anticonvulsant agents with known ED50 values, 48 dopamine D1-receptor antagonists with known competitive binding affinities (Ki), and a Tetrahymena pyriformis dataset containing 250 phenolic compounds with toxicity IGC50 values. When applied to database screening, models developed for anticonvulsant agents identified several known anticonvulsant compounds that were not only absent in the training set but highly chemically dissimilar to the training set compounds. This initial success indicates that ALL-QSAR can be further exploited as a general tool for accurate bioactivity prediction and database screening in drug design and discovery. Due to its local nature, the ALL-QSAR approach appears to be especially well suited for the development of highly predictive models for the sparse or unevenly distributed datasets.
To develop efficient approaches for rapid evaluation of chemical toxicity and human health risk of environmental compounds, the National Toxicology Program (NTP) in collaboration with the National Center for Chemical Genomics has initiated a project on high-throughput screening (HTS) of environmental chemicals. The first HTS results for a set of 1,408 compounds tested for their effects on cell viability in six different cell lines have recently become available via PubChem.
We have explored these data in terms of their utility for predicting adverse health effects of the environmental agents.
Methods and results
Initially, the classification k nearest neighbor (kNN) quantitative structure–activity relationship (QSAR) modeling method was applied to the HTS data only, for a curated data set of 384 compounds. The resulting models had prediction accuracies for training, test (containing 275 compounds together), and external validation (109 compounds) sets as high as 89%, 71%, and 74%, respectively. We then asked if HTS results could be of value in predicting rodent carcinogenicity. We identified 383 compounds for which data were available from both the Berkeley Carcinogenic Potency Database and NTP–HTS studies. We found that compounds classified by HTS as “actives” in at least one cell line were likely to be rodent carcinogens (sensitivity 77%); however, HTS “inactives” were far less informative (specificity 46%). Using chemical descriptors only, kNN QSAR modeling resulted in 62.3% prediction accuracy for rodent carcinogenicity applied to this data set. Importantly, the prediction accuracy of the model was significantly improved (72.7%) when chemical descriptors were augmented by HTS data, which were regarded as biological descriptors.
Our studies suggest that combining NTP–HTS profiles with conventional chemical descriptors could considerably improve the predictive power of computational approaches in toxicology.
carcinogenesis; computational toxicology; high-throughput screening; QSAR