Quantitative Structure-Activity Relationship modeling is one of the major computational tools employed in medicinal chemistry. However, throughout its entire history it has drawn both praise and criticism concerning its reliability, limitations, successes, and failures. In this paper, we discuss: (i) the development and evolution of QSAR; (ii) the current trends, unsolved problems, and pressing challenges; and (iii) several novel and emerging applications of QSAR modeling. Throughout this discussion, we provide guidelines for QSAR development, validation, and application, which are summarized in best practices for building rigorously validated and externally predictive QSAR models. We hope that this Perspective will help communications between computational and experimental chemists towards collaborative development and use of QSAR models. We also believe that the guidelines presented here will help journal editors and reviewers apply more stringent scientific standards to manuscripts reporting new QSAR studies, as well as encourage the use of high quality, validated QSARs for regulatory decision making.
Drug-induced cholestasis is an important form of acquired liver disease and is associated with significant morbidity and mortality. Bile acids are key signaling molecules, but they can exert toxic responses when they accumulate in hepatocytes. This review focuses on the physiological mechanisms of drug-induced cholestasis associated with altered bile acid homeostasis due to direct (e.g. bile acid transporter inhibition) or indirect (e.g. activation of nuclear receptors, altered function/expression of bile acid transporters) processes. Mechanistic information about the effects of a drug on bile acid homeostasis is important when evaluating the cholestatic potential of a compound, but experimental data often are not available. The relationship between physicochemical properties, pharmacokinetic parameters, and inhibition of the bile salt export pump (BSEP) among seventy-seven cholestatic drugs with different pathophysiological mechanisms of cholestasis (i.e. impaired formation of bile vs. physical obstruction of bile flow) was investigated. The utility of in silico models to obtain mechanistic information about the impact of compounds on bile acid homeostasis to aid in predicting the cholestatic potential of drugs is highlighted.
Drug-induced cholestasis; bile acid; transporters; physicochemical properties; pharmacokinetic parameters; in silico modeling
To develop accurate in silico predictors of Plasma Protein Binding (PPB).
Experimental PPB data were compiled for over 1,200 compounds. Two endpoints have been considered: (1) fraction bound (%PPB); and (2) the logarithm of a pseudo binding constant (lnKa) derived from %PPB. The latter metric was employed because it reflects the PPB thermodynamics and the distribution of the transformed data is closer to normal. Quantitative Structure-Activity Relationship (QSAR) models were built with Dragon descriptors and three statistical methods.
Five-fold external validation procedure resulted in models with the prediction accuracy (R2) of 0.67±0.04 and 0.66±0.04, respectively, and the mean absolute error (MAE) of 15.3±0.2% and 13.6±0.2%, respectively. Models were validated with two external datasets: 173 compounds from DrugBank, and 236 chemicals from the US EPA ToxCast project. Models built with lnKa were significantly more accurate (MAE of 6.2–10.7%) than those built with %PPB (MAE of 11.9–17.6%) for highly bound compounds both for the training and the external sets.
The pseudo binding constant (lnKa) is more appropriate for characterizing PPB binding than conventional %PPB. Validated QSAR models developed herein can be applied as reliable tools in early drug development and in chemical risk assessment.
machine learning; %PPB; drug fraction bound; ADMET; pharmacokinetics
Clozapine is a particularly effective antipsychotic medication but its use is curtailed by the risk of clozapine-induced agranulocytosis/granulocytopenia (CIAG), a severe adverse drug reaction occurring in up to 1% of treated individuals. Identifying genetic risk factors for CIAG could enable safer and more widespread use of clozapine. Here we perform the largest and most comprehensive genetic study of CIAG to date by interrogating 163 cases using genome-wide genotyping and whole-exome sequencing. We find that two loci in the major histocompatibility complex are independently associated with CIAG: a single amino acid in HLA-DQB1 (126Q) (P=4.7×10−14, odds ratio, OR=0.19, 95% CI 0.12–0.29) and an amino acid change in the extracellular binding pocket of HLA-B (158T) (P=6.4×10−10, OR=3.3, 95% CI 2.3–4.9). These associations dovetail with the roles of these genes in immunogenetic phenotypes and adverse drug responses for other medications, and provide insight into the pathophysiology of CIAG.
Summary: We report on the development of the high-throughput screening (HTS) Navigator software to analyze and visualize the results of HTS of chemical libraries. The HTS Navigator processes output files from different plate readers' formats, computes the overall HTS matrix, automatically detects hits and has different types of baseline navigation and correction features. The software incorporates advanced cheminformatics capabilities such as chemical structure storage and visualization, fast similarity search and chemical neighborhood analysis for retrieved hits. The software is freely available for academic laboratories.
Availability and implementation:
Supplementary data are available at Bioinformatics online.
We introduce a simple MODelability Index (MODI) that estimates the feasibility of obtaining predictive QSAR models (Correct Classification Rate above 0.7) for a binary dataset of bioactive compounds. MODI is defined as an activity class-weighted ratio of the number of the nearest neighbor pairs of compounds with the same activity class versus the total number of pairs. The MODI values were calculated for more than 100 datasets and the threshold of 0.65 was found to separate non-modelable from the modelable datasets.
5-hydroxytryptamine 1A (5-HT1A) serotonin receptor
has been an attractive target for treating mood and anxiety disorders
such as schizophrenia. We have developed binary classification quantitative
structure–activity relationship (QSAR) models of 5-HT1A receptor binding activity using data retrieved from the PDSP Ki database. The prediction accuracy of these
models was estimated by external 5-fold cross-validation as well as
using an additional validation set comprising 66 structurally distinct
compounds from the World of Molecular Bioactivity database. These
validated models were then used to mine three major types of chemical
screening libraries, i.e., drug-like libraries, GPCR targeted libraries,
and diversity libraries, to identify novel computational hits. The
five best hits from each class of libraries were chosen for further
experimental testing in radioligand binding assays, and nine of the
15 hits were confirmed to be active experimentally with binding affinity
better than 10 μM. The most active compound, Lysergol, from
the diversity library showed very high binding affinity (Ki) of 2.3 nM against 5-HT1A receptor. The novel
5-HT1A actives identified with the QSAR-based virtual screening
approach could be potentially developed as novel anxiolytics or potential
Previously we have developed and statistically validated Quantitative Structure Property Relationship (QSPR) models that correlate drugs’ structural, physical and chemical properties as well as experimental conditions with the relative efficiency of remote loading of drugs into liposomes (Cern et al, Journal of Controlled Release, 160(2012) 14–157). Herein, these models have been used to virtually screen a large drug database to identify novel candidate molecules for liposomal drug delivery. Computational hits were considered for experimental validation based on their predicted remote loading efficiency as well as additional considerations such as availability, recommended dose and relevance to the disease. Three compounds were selected for experimental testing which were confirmed to be correctly classified by our previously reported QSPR models developed with Iterative Stochastic Elimination (ISE) and k-nearest neighbors (kNN) approaches. In addition, 10 new molecules with known liposome remote loading efficiency that were not used in QSPR model development were identified in the published literature and employed as an additional model validation set. The external accuracy of the models was found to be as high as 82% or 92%, depending on the model. This study presents the first successful application of QSPR models for the computer-model-driven design of liposomal drugs.
Liposomes; Remote loading; QSPR; Virtual screening; Iterative Stochastic Elimination, k-nearest neighbors
Quantitative Structure-Activity Relationship (QSAR) modeling and toxicogenomics are used independently as predictive tools in toxicology. In this study, we evaluated the power of several statistical models for predicting drug hepatotoxicity in rats using different descriptors of drug molecules, namely their chemical descriptors and toxicogenomic profiles. The records were taken from the Toxicogenomics Project rat liver microarray database containing information on 127 drugs (http://toxico.nibio.go.jp/datalist.html). The model endpoint was hepatotoxicity in the rat following 28 days of exposure, established by liver histopathology and serum chemistry. First, we developed multiple conventional QSAR classification models using a comprehensive set of chemical descriptors and several classification methods (k nearest neighbor, support vector machines, random forests, and distance weighted discrimination). With chemical descriptors alone, external predictivity (Correct Classification Rate, CCR) from 5-fold external cross-validation was 61%. Next, the same classification methods were employed to build models using only toxicogenomic data (24h after a single exposure) treated as biological descriptors. The optimized models used only 85 selected toxicogenomic descriptors and had CCR as high as 76%. Finally, hybrid models combining both chemical descriptors and transcripts were developed; their CCRs were between 68 and 77%. Although the accuracy of hybrid models did not exceed that of the models based on toxicogenomic data alone, the use of both chemical and biological descriptors enriched the interpretation of the models. In addition to finding 85 transcripts that were predictive and highly relevant to the mechanisms of drug-induced liver injury, chemical structural alerts for hepatotoxicity were also identified. These results suggest that concurrent exploration of the chemical features and acute treatment-induced changes in transcript levels will both enrich the mechanistic understanding of sub-chronic liver injury and afford models capable of accurate prediction of hepatotoxicity from chemical structure and short-term assay results.
Quantitative Structure Activity Relationship (QSAR) modeling; toxicogenomics; biological descriptors; hepatotoxicity
The ability of Mycobacterium tuberculosis (Mtb) to
survive in low oxygen environments enables the bacterium to persist in a latent
state within host tissues. In vitro studies of Mtb growth have identified
changes in isocitrate lyase (ICL) and malate synthase (MS) that enable bacterial
persistent under low oxygen and other environmentally limiting conditions.
Systems chemical biology (SCB) enables us to evaluate the effects of small
molecule inhibitors not only on the reaction catalyzed by malate synthase and
isocitrate lyase, but the effect on the complete tricarboxylic acid cycle (TCA)
by taking into account complex network relationships within that system.
To study the kinetic consequences of inhibition on persistent bacilli, we
implement a systems-chemical biology (SCB) platform and perform a
chemistry-centric analysis of key metabolic pathways believed to impact Mtb
latency. We explore consequences of disrupting the function of malate synthase
(MS) and isocitrate lyase (ICL) during aerobic and hypoxic non-replicating
persistence (NRP) growth by using the SCB method to identify small molecules
that inhibit the function of MS and ICL, and simulating the metabolic
consequence of the disruption.
Results indicate variations in target and non-target reaction steps,
clear differences in the normal and low oxygen models, as well as dosage
dependent response. Simulation results from singular and combined enzyme
inhibition strategies suggest ICL may be the more effective target for
chemotherapeutic treatment against Mtb growing in a microenvironment where
oxygen is slowly depleted, which may favor persistence.
Biological networks; cheminformatics; biochemical network simulations; systems biology; chemical biology; Mycobacterium tuberculosis
Identification of Endocrine Disrupting Chemicals is one of the important goals of environmental chemical hazard screening. We report on the development of validated in silico predictors of chemicals likely to cause Estrogen Receptor (ER)-mediated endocrine disruption to facilitate their prioritization for future screening. A database of relative binding affinity of a large number of ERα and/or ERβ ligands was assembled (546 for ERα and 137 for ERβ). Both single-task learning (STL) and multi-task learning (MTL) continuous Quantitative Structure-Activity Relationships (QSAR) models were developed for predicting ligand binding affinity to ERα or ERβ. High predictive accuracy was achieved for ERα binding affinity (MTL R2=0.71, STL R2=0.73). For ERβ binding affinity, MTL models were significantly more predictive (R2=0.53, p<0.05) than STL models. In addition, docking studies were performed on a set of ER agonists/antagonists (67 agonists and 39 antagonists for ERα, 48 agonists and 32 antagonists for ERβ, supplemented by putative decoys/non-binders) using the following ER structures (in complexes with respective ligands) retrieved from the Protein Data Bank: ERα agonist (PDB ID: 1L2I), ERα antagonist (PDB ID: 3DT3), ERβ agonist (PDB ID: 2NV7), ERβ antagonist (PDB ID: 1L2J). We found that all four ER conformations discriminated their corresponding ligands from presumed non-binders. Finally, both QSAR models and ER structures were employed in parallel to virtually screen several large libraries of environmental chemicals to derive a ligand- and structure-based prioritized list of putative estrogenic compounds to be used for in vitro and in vivo experimental validation.
Endocrine Disrupting Chemicals; Estrogen Receptor; Quantitative Structure-Activity Relationships modeling; Multi-Task Learning; Docking; Virtual Screening
We report on the prediction accuracy of ligand-based (2D QSAR) and structure-based (MedusaDock) methods used both independently and in consensus for ranking the congeneric series of ligands binding to three protein targets (UK, ERK2, and CHK1) from the CSAR 2011 benchmark exercise. An ensemble of predictive QSAR models was developed using known binders of these three targets extracted from the publicly-available ChEMBL database. Selected models were used to predict the binding affinity of CSAR compounds towards the corresponding targets and rank them accordingly; the overall ranking accuracy evaluated by Spearman correlation was as high as 0.78 for UK, 0.60 for ERK2, and 0.56 for CHK1, placing our predictions in top-10% among all the participants. In parallel, MedusaDock designed to predict reliable docking poses was also used for ranking the CSAR ligands according to their docking scores; the resulting accuracy (Spearman correlation) for UK, ERK2, and CHK1 were 0.76, 0.31, and 0.26, respectively. In addition, performance of several consensus approaches combining MedusaDock and QSAR predicted ranks altogether has been explored; the best approach yielded Spearman correlation coefficients for UK, ERK2, and CHK1 of 0.82, 0.50, and 0.45, respectively. This study shows that (i) externally validated 2D QSAR models were capable of ranking CSAR ligands at least as accurately as more computationally intensive structure-based approaches used both by us and by other groups and (ii) ligand-based QSAR models can complement structure-based approaches by boosting the prediction performances when used in consensus.
Traditional read-across approaches typically rely on the chemical similarity principle to predict chemical toxicity; however, the accuracy of such predictions is often inadequate due to the underlying complex mechanisms of toxicity. Here we report on the development of a hazard classification and visualization method that draws upon both chemical structural similarity and comparisons of biological responses to chemicals measured in multiple short-term assays (”biological” similarity). The Chemical-Biological Read-Across (CBRA) approach infers each compound's toxicity from those of both chemical and biological analogs whose similarities are determined by the Tanimoto coefficient. Classification accuracy of CBRA was compared to that of classical RA and other methods using chemical descriptors alone, or in combination with biological data. Different types of adverse effects (hepatotoxicity, hepatocarcinogenicity, mutagenicity, and acute lethality) were classified using several biological data types (gene expression profiling and cytotoxicity screening). CBRA-based hazard classification exhibited consistently high external classification accuracy and applicability to diverse chemicals. Transparency of the CBRA approach is aided by the use of radial plots that show the relative contribution of analogous chemical and biological neighbors. Identification of both chemical and biological features that give rise to the high accuracy of CBRA-based toxicity prediction facilitates mechanistic interpretation of the models.
Recent highly expected structural characterizations of agonist-bound and antagonist-bound beta-2 adrenoreceptor (β2AR) by X-ray crystallography have been widely regarded as critical advances to enable more effective structure-based discovery of GPCRs ligands. It appears that this very important development may have undermined many previous efforts to develop 3D theoretical models of GPCRs. To address this question directly we have compared several historical β2AR models versus the inactive state and nanobody-stabilized active state of β2AR crystal structures in terms of their structural similarity and effectiveness of use in virtual screening for β2AR specific agonists and antagonists. Theoretical models, incluing both homology and de novo types, were collected from five different groups who have published extensively in the field of GPCRs modeling; all models were built before X-ray structures became available. In general, β2AR theoretical models differ significantly from the crystal structure in terms of TMH definition and the global packing. Nevertheless, surprisingly, several models afforded hit rates resulting from virtual screening of large chemical library enriched by known β2AR ligands that exceeded those using X-ray structures; the hit rates were particularly higher for agonists. Furthemore, the screening performance of models is associated with local structural quality such as the RMSDs for binding pocket residues and the ability to capture accurately most if not all critical protein/ligand interactions. These results suggest that carefully built models of GPCRs could capture critical chemical and structural features of the binding pocket thus may be even more useful for practical structure-based drug discovery than X-ray structures.
GPCRs modeling; crystallography; beta-2 adrenoreceptor; agonist-bound; antagonist-bound; molecular docking; enrichment factor
Sec14-like phosphatidylinositol transfer proteins (PITPs) integrate diverse territories of intracellular lipid metabolism with stimulated phosphatidylinositol-4-phosphate production, and are discriminating portals for interrogating phosphoinositide signaling. Yet, neither Sec14-like PITPs, nor PITPs in general, have been exploited as targets for chemical inhibition for such purposes. Herein, we validate the first small molecule inhibitors (SMIs) of the yeast PITP Sec14. These SMIs are nitrophenyl(4-(2-methoxyphenyl)piperazin-1-yl)methanones (NPPMs), and are effective inhibitors in vitro and in vivo. We further establish Sec14 is the sole essential NPPM target in yeast, that NPPMs exhibit exquisite targeting specificities for Sec14 (relative to related Sec14-like PITPs), propose a mechanism for how NPPMs exert their inhibitory effects, and demonstrate NPPMs exhibit exquisite pathway selectivity in inhibiting phosphoinositide signaling in cells. These data deliver proof-of-concept that PITP-directed SMIs offer new and generally applicable avenues for intervening with phosphoinositide signaling pathways with selectivities superior to those afforded by contemporary lipid kinase-directed strategies.
Membrane transporters mediate many biological effects of chemicals and play a major role in pharmacokinetics and drug resistance. The selection of viable drug candidates among biologically active compounds requires the assessment of their transporter interaction profiles.
Using public sources, we have assembled and curated the largest, to our knowledge, human intestinal transporter database (>5,000 interaction entries for >3,700 molecules). This data was used to develop thoroughly validated classification Quantitative Structure-Activity Relationship (QSAR) models of transport and/or inhibition of several major transporters including MDR1, BCRP, MRP1-4, PEPT1, ASBT, OATP2B1, OCT1, and MCT1.
Results & Conclusions
QSAR models have been developed with advanced machine learning techniques such as Support Vector Machines, Random Forest, and k Nearest Neighbors using Dragon and MOE chemical descriptors. These models afforded high external prediction accuracies of 71–100% estimated by 5-fold external validation, and showed hit retrieval rates with up to 20-fold enrichment in the virtual screening of DrugBank compounds. The compendium of predictive QSAR models developed in this study can be used for virtual profiling of drug candidates and/or environmental agents with the optimal transporter profiles.
membrane transport proteins; ADMET; drug transport; permeability; efflux
Quantitative structure–activity relationship (QSAR) models have been developed for a dataset of 3133 compounds defined as either active or inactive against P. falciparum. Since the dataset was strongly biased towards inactive compounds, different sampling approaches were employed to balance the ratio of actives vs. inactives, and models were rigorously validated using both internal and external validation approaches. The balanced accuracy for assessing the antimalarial activities of 70 external compounds was between 87% and 100% depending on the approach used to balance the dataset. Virtual screening of the ChemBridge database using QSAR models identified 176 putative antimalarial compounds that were submitted for experimental validation, along with 42 putative inactives as negative controls. Twenty five (14.2%) computational hits were found to have antimalarial activities with minimal cytotoxicity to mammalian cells, while all 42 putative inactives were confirmed experimentally. Structural inspection of confirmed active hits revealed novel chemical scaffolds, which could be employed as starting points to discover novel antimalarial agents.
Antimalarial activity; quantitative structure–activity relationships; virtual screening; experimental confirmation
The CAPRI and CASP prediction experiments have demonstrated the power of community wide tests of methodology in assessing the current state of the art and spurring progress in the very challenging areas of protein docking and structure prediction. We sought to bring the power of community wide experiments to bear on a very challenging protein design problem that provides a complementary but equally fundamental test of current understanding of protein-binding thermodynamics. We have generated a number of designed protein-protein interfaces with very favorable computed binding energies but which do not appear to be formed in experiments, suggesting there may be important physical chemistry missing in the energy calculations. 28 research groups took up the challenge of determining what is missing: we provided structures of 87 designed complexes and 120 naturally occurring complexes and asked participants to identify energetic contributions and/or structural features that distinguish between the two sets. The community found that electrostatics and solvation terms partially distinguish the designs from the natural complexes, largely due to the non-polar character of the designed interactions. Beyond this polarity difference, the community found that the designed binding surfaces were on average structurally less embedded in the designed monomers, suggesting that backbone conformational rigidity at the designed surface is important for realization of the designed function. These results can be used to improve computational design strategies, but there is still much to be learned; for example, one designed complex, which does form in experiments, was classified by all metrics as a non-binder.
Accurate prediction of the structure of protein-protein complexes in computational docking experiments remains a formidable challenge. It has been recognized that identifying native or native-like poses among multiple decoys is the major bottleneck of the current scoring functions used in docking. We have developed a novel multi-body pose-scoring function that has no theoretical limit on the number of residues contributing to the individual interaction terms. We use a coarse-grain representation of a protein-protein complex where each residue is represented by its side chain centroid. We apply a computational geometry approach called Almost-Delaunay tessellation that transforms protein-protein complexes into a residue contact network, or an un-directional graph where vertex-residues are nodes connected by edges. This treatment forms a family of interfacial graphs representing a dataset of protein-protein complexes. We then employ frequent subgraph mining approach to identify common interfacial residue patterns that appear in at least a subset of native protein-protein interfaces. The geometrical parameters and frequency of occurrence of each “native” pattern in the training set are used to develop the new SPIDER scoring function. SPIDER was validated using standard “ZDOCK” benchmark dataset that was not used in the development of SPIDER. We demonstrate that SPIDER scoring function ranks native and native-like poses above geometrical decoys and that it exceeds in performance a popular ZRANK scoring function. SPIDER was ranked among the top scoring functions in a recent round of CAPRI (Critical Assessment of PRedicted Interactions) blind test of protein–protein docking methods.
Bioinformatics; Amino acids; Centroids; Statistical potential; Delaunay tessellation; Subgraph mining; Motifs; Coarse-grained; ZDOCK; CAPRI
We have devised a chemocentric informatics methodology for drug discovery integrating independent approaches to mining biomolecular databases. As a proof of concept, we have searched for novel putative cognition enhancers. First, we generated Quantitative Structure- Activity Relationship (QSAR) models of compounds binding to 5-hydroxytryptamine-6 receptor (5HT6R), a known target for cognition enhancers, and employed these models for virtual screening to identify putative 5-HT6R actives. Second, we queried chemogenomics data from the Connectivity Map (http://www.broad.mit.edu/cmap/) with the gene expression profile signatures of Alzheimer’s disease patients to identify compounds putatively linked to the disease. Thirteen common hits were tested in 5-HT6R radioligand binding assays and ten were confirmed as actives. Four of them were known selective estrogen receptor modulators that were never reported as 5-HT6R ligands. Furthermore, nine of the confirmed actives were reported elsewhere to have memory-enhancing effects. The approaches discussed herein can be used broadly to identify novel drug-target-disease associations.
Quantitative structure-activity relationship (QSAR) models are widely used for in silico prediction of in vivo toxicity of drug candidates or environmental chemicals, adding value to candidate selection in drug development or in a search for less hazardous and more sustainable alternatives for chemicals in commerce. The development of traditional QSAR models is enabled by numerical descriptors representing the inherent chemical properties that can be easily defined for any number of molecules; however, traditional QSAR models often have limited predictive power due to the lack of data and complexity of in vivo endpoints. Although it has been indeed difficult to obtain experimentally derived toxicity data on a large number of chemicals in the past, the results of quantitative in vitro screening of thousands of environmental chemicals in hundreds of experimental systems are now available and continue to accumulate. In addition, publicly accessible toxicogenomics data collected on hundreds of chemicals provide another dimension of molecular information that is potentially useful for predictive toxicity modeling. These new characteristics of molecular bioactivity arising from short-term biological assays, i.e., in vitro screening and/or in vivo toxicogenomics data can now be exploited in combination with chemical structural information to generate hybrid QSAR–like quantitative models to predict human toxicity and carcinogenicity. Using several case studies, we illustrate the benefits of a hybrid modeling approach, namely improvements in the accuracy of models, enhanced interpretation of the most predictive features, and expanded applicability domain for wider chemical space coverage.
QSAR; toxicity screening; hybrid modeling