PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1079719)

Clipboard (0)
None

Related Articles

1.  MedusaScore: An Accurate Force-Field Based Scoring Function for Virtual Drug Screening 
Virtual screening is becoming an important tool for drug discovery. However, the application of virtual screening has been limited by the lack of accurate scoring functions. Here, we present a novel scoring function, MedusaScore, for evaluating protein-ligand binding. MedusaScore is based on models of physical interactions that include van der Waals, solvation and hydrogen bonding energies. To ensure the best transferability of the scoring function, we do not use any protein-ligand experimental data for parameter training. We then test the MedusaScore for docking decoy recognition and binding affinity prediction and find superior performance compared to other widely used scoring functions. Statistical analysis indicates that one source of inaccuracy of MedusaScore may arise from the unaccounted entropic loss upon ligand binding, which suggests avenues of approach for further MedusaScore improvement.
doi:10.1021/ci8001167
PMCID: PMC2665000  PMID: 18672869
2.  Cheminformatics Meets Molecular Mechanics: A Combined Application of Knowledge-based Pose Scoring and Physical Force Field-based Hit Scoring Functions Improves the Accuracy of Structure-Based Virtual Screening 
Poor performance of scoring functions is a well-known bottleneck in structure-based virtual screening, which is most frequently manifested in the scoring functions’ inability to discriminate between true ligands versus known non-binders (therefore designated as binding decoys). This deficiency leads to a large number of false positive hits resulting from virtual screening. We have hypothesized that filtering out or penalizing docking poses recognized as non-native (i.e., pose decoys) should improve the performance of virtual screening in terms of improved identification of true binders. Using several concepts from the field of cheminformatics, we have developed a novel approach to identifying pose decoys from an ensemble of poses generated by computational docking procedures. We demonstrate that the use of target-specific pose (-scoring) filter in combination with a physical force field-based scoring function (MedusaScore) leads to significant improvement of hit rates in virtual screening studies for 12 of the 13 benchmark sets from the clustered version of the Database of Useful Decoys (DUD). This new hybrid scoring function outperforms several conventional structure-based scoring functions, including XSCORE∷HMSCORE, ChemScore, PLP, and Chemgauss3, in six out of 13 data sets at early stage of VS (up 1% decoys of the screening database). We compare our hybrid method with several novel VS methods that were recently reported to have good performances on the same DUD data sets. We find that the retrieved ligands using our method are chemically more diverse in comparison with two ligand-based methods (FieldScreen and FLAP∷LBX). We also compare our method with FLAP∷RBLB, a high-performance VS method that also utilizes both the receptor and the cognate ligand structures. Interestingly, we find that the top ligands retrieved using our method are highly complementary to those retrieved using FLAP∷RBLB, hinting effective directions for best VS applications. We suggest that this integrative virtual screening approach combining cheminformatics and molecular mechanics methodologies may be applied to a broad variety of protein targets to improve the outcome of structure-based drug discovery studies.
doi:10.1021/ci2002507
PMCID: PMC3264743  PMID: 22017385
3.  Predicting binding affinity of CSAR ligands using both structure-based and ligand-based approaches 
We report on the prediction accuracy of ligand-based (2D QSAR) and structure-based (MedusaDock) methods used both independently and in consensus for ranking the congeneric series of ligands binding to three protein targets (UK, ERK2, and CHK1) from the CSAR 2011 benchmark exercise. An ensemble of predictive QSAR models was developed using known binders of these three targets extracted from the publicly-available ChEMBL database. Selected models were used to predict the binding affinity of CSAR compounds towards the corresponding targets and rank them accordingly; the overall ranking accuracy evaluated by Spearman correlation was as high as 0.78 for UK, 0.60 for ERK2, and 0.56 for CHK1, placing our predictions in top-10% among all the participants. In parallel, MedusaDock designed to predict reliable docking poses was also used for ranking the CSAR ligands according to their docking scores; the resulting accuracy (Spearman correlation) for UK, ERK2, and CHK1 were 0.76, 0.31, and 0.26, respectively. In addition, performance of several consensus approaches combining MedusaDock and QSAR predicted ranks altogether has been explored; the best approach yielded Spearman correlation coefficients for UK, ERK2, and CHK1 of 0.82, 0.50, and 0.45, respectively. This study shows that (i) externally validated 2D QSAR models were capable of ranking CSAR ligands at least as accurately as more computationally intensive structure-based approaches used both by us and by other groups and (ii) ligand-based QSAR models can complement structure-based approaches by boosting the prediction performances when used in consensus.
doi:10.1021/ci400216q
PMCID: PMC3779696  PMID: 23809015
4.  Automated large-scale file preparation, docking, and scoring: Evaluation of ITScore and STScore using the 2012 Community Structure-Activity Resource Benchmark 
In this study, we use the recently released 2012 Community Structure-Activity Resource (CSAR) Dataset to evaluate two knowledge-based scoring functions, ITScore and STScore, and a simple force-field-based potential (VDWScore). The CSAR Dataset contains 757 compounds, most with known affinities, and 57 crystal structures. With the help of the script files for docking preparation, we use the full CSAR Dataset to evaluate the performances of the scoring functions on binding affinity prediction and active/inactive compound discrimination. The CSAR subset that includes crystal structures is used as well, to evaluate the performances of the scoring functions on binding mode and affinity predictions. Within this structure subset, we investigate the importance of accurate ligand and protein conformational sampling and find that the binding affinity predictions are less sensitive to non-native ligand and protein conformations than the binding mode predictions. We also find the full CSAR Dataset to be more challenging in making binding mode predictions than the subset with structures. The script files used for preparing the CSAR Dataset for docking, including scripts for canonicalization of the ligand atoms, are offered freely to the academic community.
doi:10.1021/ci400045v
PMCID: PMC3755023  PMID: 23656179
CSAR; community structure-activity resource; protein-ligand docking; knowledge-based scoring functions
5.  istar: A Web Platform for Large-Scale Protein-Ligand Docking 
PLoS ONE  2014;9(1):e85678.
Protein-ligand docking is a key computational method in the design of starting points for the drug discovery process. We are motivated by the desire to automate large-scale docking using our popular docking engine idock and thus have developed a publicly-accessible web platform called istar. Without tedious software installation, users can submit jobs using our website. Our istar website supports 1) filtering ligands by desired molecular properties and previewing the number of ligands to dock, 2) monitoring job progress in real time, and 3) visualizing ligand conformations and outputting free energy and ligand efficiency predicted by idock, binding affinity predicted by RF-Score, putative hydrogen bonds, and supplier information for easy purchase, three useful features commonly lacked on other online docking platforms like DOCK Blaster or iScreen. We have collected 17,224,424 ligands from the All Clean subset of the ZINC database, and revamped our docking engine idock to version 2.0, further improving docking speed and accuracy, and integrating RF-Score as an alternative rescoring function. To compare idock 2.0 with the state-of-the-art AutoDock Vina 1.1.2, we have carried out a rescoring benchmark and a redocking benchmark on the 2,897 and 343 protein-ligand complexes of PDBbind v2012 refined set and CSAR NRC HiQ Set 24Sept2010 respectively, and an execution time benchmark on 12 diverse proteins and 3,000 ligands of different molecular weight. Results show that, under various scenarios, idock achieves comparable success rates while outperforming AutoDock Vina in terms of docking speed by at least 8.69 times and at most 37.51 times. When evaluated on the PDBbind v2012 core set, our istar platform combining with RF-Score manages to reproduce Pearson's correlation coefficient and Spearman's correlation coefficient of as high as 0.855 and 0.859 respectively between the experimental binding affinity and the predicted binding affinity of the docked conformation. istar is freely available at http://istar.cse.cuhk.edu.hk/idock.
doi:10.1371/journal.pone.0085678
PMCID: PMC3901662  PMID: 24475049
6.  CSAR Data Set Release 2012: Ligands, Affinities, Complexes, and Docking Decoys 
A major goal in drug design is the improvement of computational methods for docking and scoring. The Community Structure Activity Resource (CSAR) has collected several data sets from industry and added in-house data sets that may be used for this purpose (www.csardock.org). CSAR has currently obtained data from Abbott, GlaxoSmithKline, and Vertex and is working on obtaining data from several others. Combined with our in-house projects, we are providing a data set consisting of 6 protein targets, 647 compounds with biological affinities, and 82 crystal structures. Multiple congeneric series are available for several targets with a few representative crystal structures of each of the series. These series generally contain a few inactive compounds, usually not available in the literature, to provide an upper bound to the affinity range. The affinity ranges are typically 3–4 orders of magnitude per series. For our in-house projects, we have had compounds synthesized for biological testing. Affinities were measured by Thermofluor, Octet RED, and isothermal titration calorimetry for the most soluble. This allows the direct comparison of the biological affinities for those compounds, providing a measure of the variance in the experimental affinity. It appears that there can be considerable variance in the absolute value of the affinity, making the prediction of the absolute value ill-defined. However, the relative rankings within the methods are much better, and this fits with the observation that predicting relative ranking is a more tractable problem computationally. For those in-house compounds, we also have measured the following physical properties: logD, logP, thermodynamic solubility, and pKa. This data set also provides a substantial decoy set for each target consisting of diverse conformations covering the entire active site for all of the 58 CSAR-quality crystal structures. The CSAR data sets (CSAR-NRC HiQ and the 2012 release) provide substantial, publically available, curated data sets for use in parametrizing and validating docking and scoring methods.
doi:10.1021/ci4000486
PMCID: PMC3753885  PMID: 23617227
7.  A Molecular Mechanics Approach to Modeling Protein-Ligand Interactions: Relative Binding Affinities in Congeneric Series 
We introduce the “Prime-ligand” method for ranking ligands in congeneric series. The method employs a single scoring function, the OPLS-AA/GBSA molecular mechanics/implicit solvent model, for all stages of sampling and scoring. We evaluate the method using 12 test sets of congeneric series for which experimental binding data is available in the literature, as well as the structure of one member of the series bound to the protein. Ligands are ‘docked’ by superimposing a common stem fragment among the compounds in the series using a crystal complex from the Protein Databank, and sampling the conformational space of the variable region. Our results show good correlation between our predicted rankings and experimental data for cases in which binding affinities differ by at least one order of magnitude. For 11 out of 12 cases, >90% of such ligand pairs could be correctly ranked, while for the remaining case, Factor Xa, 76% of such pairs were correctly ranked. A small number of compounds could not be docked using the current protocol due to the large size of functional groups that could not be accommodated by a rigid receptor. CPU requirements for the method, involving CPU-minutes per ligand, are modest compared with more rigorous methods that use similar force fields, such as free energy perturbation. We also benchmark the scoring function using series of ligand bound to the same protein within the CSAR data set. We demonstrate that energy minimization of ligand in the crystal structures is critical to obtain any correlation with experimentally determined binding affinities.
doi:10.1021/ci200033n
PMCID: PMC3183355  PMID: 21780805
force field based scoring function; docking; scoring; congeneric series; SAR; molecular mechanics; MM-GBSA
8.  Scoring and lessons learned with the CSAR benchmark using an improved iterative knowledge-based scoring function 
Based on a statistical mechanics-based iterative method, we have extracted a set of distance-dependent, all-atom pairwise potentials for protein-ligand interactions from the crystal structures of 1300 protein-ligand complexes. The iterative method circumvents the long-standing reference state problem in knowledge-based scoring functions. The resulted scoring function, referred to as ITScore 2.0, has been tested with the CSAR (Community Structure-Activity Resource, 2009 release) benchmark of 345 diverse protein-ligand complexes. ITScore 2.0 achieved a Pearson correlation of R2 = 0.54 in binding affinity prediction. A comparative analysis has been done on the scoring performances of ITScore 2.0, the van der Waals (VDW) scoring function, the VDW with heavy atoms only, and the force field (FF) scoring function of DOCK which consists of a VDW term and an electrostatic term. The results reveal several important factors that affect the scoring performances, which could be helpful for the improvement of scoring functions.
doi:10.1021/ci2000727
PMCID: PMC3190652  PMID: 21830787
scoring function; molecular docking; CSAR benchmark; ligand-protein interactions; knowledge-based
9.  CSAR Benchmark Exercise of 2010: Combined Evaluation Across All Submitted Scoring Functions 
As part of the Community Structure-Activity Resource (CSAR) center, a set of 343 high-quality, protein–ligand crystal structures were assembled with experimentally determined Kd or Ki information from the literature. We encouraged the community to score the crystallographic poses of the complexes by any method of their choice. The goal of the exercise was to (1) evaluate the current ability of the field to predict activity from structure and (2) investigate the properties of the complexes and methods that appear to hinder scoring. A total of 19 different methods were submitted with numerous parameter variations for a total of 64 sets of scores from 16 participating groups. Linear regression and nonparametric tests were used to correlate scores to the experimental values. Correlation to experiment for the various methods ranged R2 = 0.58–0.12, Spearman ρ = 0.74–0.37, Kendall τ = 0.55–0.25, and median unsigned error = 1.00–1.68 pKd units. All types of scoring functions—force field based, knowledge based, and empirical—had examples with high and low correlation, showing no bias/advantage for any particular approach. The data across all the participants were combined to identify 63 complexes that were poorly scored across the majority of the scoring methods and 123 complexes that were scored well across the majority. The two sets were compared using a Wilcoxon rank-sum test to assess any significant difference in the distributions of >400 physicochemical properties of the ligands and the proteins. Poorly scored complexes were found to have ligands that were the same size as those in well-scored complexes, but hydrogen bonding and torsional strain were significantly different. These comparisons point to a need for CSAR to develop data sets of congeneric series with a range of hydrogen-bonding and hydrophobic characteristics and a range of rotatable bonds.
doi:10.1021/ci200269q
PMCID: PMC3186041  PMID: 21809884
10.  Analysis of multiple compound–protein interactions reveals novel bioactive molecules 
The authors use machine learning of compound-protein interactions to explore drug polypharmacology and to efficiently identify bioactive ligands, including novel scaffold-hopping compounds for two pharmaceutically important protein families: G-protein coupled receptors and protein kinases.
We have demonstrated that machine learning of multiple compound–protein interactions is useful for efficient ligand screening and for assessing drug polypharmacology.This approach successfully identified novel scaffold-hopping compounds for two pharmaceutically important protein families: G-protein-coupled receptors and protein kinases.These bioactive compounds were not detected by existing computational ligand-screening methods in comparative studies.The results of this study indicate that data derived from chemical genomics can be highly useful for exploring chemical space, and this systems biology perspective could accelerate drug discovery processes.
The discovery of novel bioactive molecules advances our systems-level understanding of biological processes and is crucial for innovation in drug development. Perturbations of biological systems by chemical probes provide broader applications not only for analysis of complex systems but also for intentional manipulations of these systems. Nevertheless, the lack of well-characterized chemical modulators has limited their use. Recently, chemical genomics has emerged as a promising area of research applicable to the exploration of novel bioactive molecules, and researchers are currently striving toward the identification of all possible ligands for all target protein families (Wang et al, 2009). Chemical genomics studies have shown that patterns of compound–protein interactions (CPIs) are too diverse to be understood as simple one-to-one events. There is an urgent need to develop appropriate data mining methods for characterizing and visualizing the full complexity of interactions between chemical space and biological systems. However, no existing screening approach has so far succeeded in identifying novel bioactive compounds using multiple interactions among compounds and target proteins.
High-throughput screening (HTS) and computational screening have greatly aided in the identification of early lead compounds for drug discovery. However, the large number of assays required for HTS to identify drugs that target multiple proteins render this process very costly and time-consuming. Therefore, interest in using in silico strategies for screening has increased. The most common computational approaches, ligand-based virtual screening (LBVS) and structure-based virtual screening (SBVS; Oprea and Matter, 2004; Muegge and Oloff, 2006; McInnes, 2007; Figure 1A), have been used for practical drug development. LBVS aims to identify molecules that are very similar to known active molecules and generally has difficulty identifying compounds with novel structural scaffolds that differ from reference molecules. The other popular strategy, SBVS, is constrained by the number of three-dimensional crystallographic structures available. To circumvent these limitations, we have shown that a new computational screening strategy, chemical genomics-based virtual screening (CGBVS), has the potential to identify novel, scaffold-hopping compounds and assess their polypharmacology by using a machine-learning method to recognize conserved molecular patterns in comprehensive CPI data sets.
The CGBVS strategy used in this study was made up of five steps: CPI data collection, descriptor calculation, representation of interaction vectors, predictive model construction using training data sets, and predictions from test data (Figure 1A). Importantly, step 1, the construction of a data set of chemical structures and protein sequences for known CPIs, did not require the three-dimensional protein structures needed for SBVS. In step 2, compound structures and protein sequences were converted into numerical descriptors. These descriptors were used to construct chemical or biological spaces in which decreasing distance between vectors corresponded to increasing similarity of compound structures or protein sequences. In step 3, we represented multiple CPI patterns by concatenating these chemical and protein descriptors. Using these interaction vectors, we could quantify the similarity of molecular interactions for compound–protein pairs, despite the fact that the ligand and protein similarity maps differed substantially. In step 4, concatenated vectors for CPI pairs (positive samples) and non-interacting pairs (negative samples) were input into an established machine-learning method. In the final step, the classifier constructed using training sets was applied to test data.
To evaluate the predictive value of CGBVS, we first compared its performance with that of LBVS by fivefold cross-validation. CGBVS performed with considerably higher accuracy (91.9%) than did LBVS (84.4%; Figure 1B). We next compared CGBVS and SBVS in a retrospective virtual screening based on the human β2-adrenergic receptor (ADRB2). Figure 1C shows that CGBVS provided higher hit rates than did SBVS. These results suggest that CGBVS is more successful than conventional approaches for prediction of CPIs.
We then evaluated the ability of the CGBVS method to predict the polypharmacology of ADRB2 by attempting to identify novel ADRB2 ligands from a group of G-protein-coupled receptor (GPCR) ligands. We ranked the prediction scores for the interactions of 826 reported GPCR ligands with ADRB2 and then analyzed the 50 highest-ranked compounds in greater detail. Of 21 commercially available compounds, 11 showed ADRB2-binding activity and were not previously reported to be ADRB2 ligands. These compounds included ligands not only for aminergic receptors but also for neuropeptide Y-type 1 receptors (NPY1R), which have low protein homology to ADRB2. Most ligands we identified were not detected by LBVS and SBVS, which suggests that only CGBVS could identify this unexpected cross-reaction for a ligand developed as a target to a peptidergic receptor.
The true value of CGBVS in drug discovery must be tested by assessing whether this method can identify scaffold-hopping lead compounds from a set of compounds that is structurally more diverse. To assess this ability, we analyzed 11 500 commercially available compounds to predict compounds likely to bind to two GPCRs and two protein kinases. Functional assays revealed that nine ADRB2 ligands, three NPY1R ligands, five epidermal growth factor receptor (EGFR) inhibitors, and two cyclin-dependent kinase 2 (CDK2) inhibitors were concentrated in the top-ranked compounds (hit rate=30, 15, 25, and 10%, respectively). We also evaluated the extent of scaffold hopping achieved in the identification of these novel ligands. One ADRB2 ligand, two NPY1R ligands, and one CDK2 inhibitor exhibited scaffold hopping (Figure 4), indicating that CGBVS can use this characteristic to rationally predict novel lead compounds, a crucial and very difficult step in drug discovery. This feature of CGBVS is critically different from existing predictive methods, such as LBVS, which depend on similarities between test and reference ligands, and focus on a single protein or highly homologous proteins. In particular, CGBVS is useful for targets with undefined ligands because this method can use CPIs with target proteins that exhibit lower levels of homology.
In summary, we have demonstrated that data mining of multiple CPIs is of great practical value for exploration of chemical space. As a predictive model, CGBVS could provide an important step in the discovery of such multi-target drugs by identifying the group of proteins targeted by a particular ligand, leading to innovation in pharmaceutical research.
The discovery of novel bioactive molecules advances our systems-level understanding of biological processes and is crucial for innovation in drug development. For this purpose, the emerging field of chemical genomics is currently focused on accumulating large assay data sets describing compound–protein interactions (CPIs). Although new target proteins for known drugs have recently been identified through mining of CPI databases, using these resources to identify novel ligands remains unexplored. Herein, we demonstrate that machine learning of multiple CPIs can not only assess drug polypharmacology but can also efficiently identify novel bioactive scaffold-hopping compounds. Through a machine-learning technique that uses multiple CPIs, we have successfully identified novel lead compounds for two pharmaceutically important protein families, G-protein-coupled receptors and protein kinases. These novel compounds were not identified by existing computational ligand-screening methods in comparative studies. The results of this study indicate that data derived from chemical genomics can be highly useful for exploring chemical space, and this systems biology perspective could accelerate drug discovery processes.
doi:10.1038/msb.2011.5
PMCID: PMC3094066  PMID: 21364574
chemical genomics; data mining; drug discovery; ligand screening; systems chemical biology
11.  Support Vector Regression Scoring of Receptor-Ligand Complexes for Rank-Ordering and Virtual Screening of Chemical Libraries 
The Community Structure-Activity Resource (CSAR) datasets are used develop and test a Support Vector Machine-based scoring function in regression mode (SVR). Two scoring functions (SVR-KB and SVR-EP) are derived with the objective of reproducing the trend of the experimental binding affinities provided within the two CSAR datasets. The features used to train SVR-KB are knowledge-based pairwise potentials, while SVR-EP is based on physico-chemical properties. SVR-KB and SVR-EP were compared to seven other widely-used scoring functions, including Glide, X-score, GoldScore, ChemScore, Vina, Dock and PMF. Results showed that SVR-KB trained with features obtained from three-dimensional complexes of the PDBbind dataset outperformed all other scoring functions including best performing X-score, by nearly 0.1 using three correlation coefficients, namely Pearson, Spearman and Kendall. It was interesting that higher performance in rank-ordering did not translate into greater enrichment in virtual screening assessed using the 40 targets of the Directory of Useful Decoys (DUD). To remedy this situation, a variant of SVR-KB (SVR-KBD) was developed by following a target-specific tailoring strategy that we had previously employed to derive SVM-SP. SVR-KBD showed much higher enrichment outperforming all other scoring functions tested, and was comparable in performance to our previously-derived scoring function SVM-SP.
doi:10.1021/ci200078f
PMCID: PMC3209528  PMID: 21728360
12.  Evaluation of Several Two-Step Scoring Functions Based on Linear Interaction Energy, Effective Ligand Size, and Empirical Pair Potentials for Prediction of Protein-Ligand Binding Geometry and Free Energy 
The performance of several two-step scoring approaches for molecular docking were assessed for their ability to predict binding geometries and free energies. Two new scoring functions designed for “step 2 discrimination” were proposed and compared to our CHARMM implementation of the linear interaction energy (LIE) approach using the Generalized-Born with Molecular Volume (GBMV) implicit solvation model. A scoring function S1 was proposed by considering only “interacting” ligand atoms as the “effective size” of the ligand, and extended to an empirical regression-based pair potential S2. The S1 and S2 scoring schemes were trained and five-fold cross validated on a diverse set of 259 protein-ligand complexes from the Ligand Protein Database (LPDB). The regression-based parameters for S1 and S2 also demonstrated reasonable transferability in the CSARdock 2010 benchmark using a new dataset (NRC HiQ) of diverse protein-ligand complexes. The ability of the scoring functions to accurately predict ligand geometry was evaluated by calculating the discriminative power (DP) of the scoring functions to identify native poses. The parameters for the LIE scoring function with the optimal discriminative power (DP) for geometry (step 1 discrimination) were found to be very similar to the best-fit parameters for binding free energy over a large number of protein-ligand complexes (step 2 discrimination). Reasonable performance of the scoring functions in enrichment of active compounds in four different protein target classes established that the parameters for S1 and S2 provided reasonable accuracy and transferability. Additional analysis was performed to definitively separate scoring function performance from molecular weight effects. This analysis included the prediction of ligand binding efficiencies for a subset of the CSARdock NRC HiQ dataset where the number of ligand heavy atoms ranged from 17 to 35. This range of ligand heavy atoms is where improved accuracy of predicted ligand efficiencies is most relevant to real-world drug design efforts.
doi:10.1021/ci1003009
PMCID: PMC3183351  PMID: 21644546
CDOCKER; CHARMM; Protein-Ligand Interactions; Docking; Scoring Functions; Distance Dependent Pair Potential; Decoys; Molecular Weight; Fragment; Kinase; p38alpha; p38MAP; Fragment-Based-Design
13.  Four-body atomic potential for modeling protein-ligand binding affinity: application to enzyme-inhibitor binding energy prediction 
BMC Structural Biology  2013;13(Suppl 1):S1.
Background
Models that are capable of reliably predicting binding affinities for protein-ligand complexes play an important role the field of structure-guided drug design.
Methods
Here, we begin by applying the computational geometry technique of Delaunay tessellation to each set of atomic coordinates for over 1400 diverse macromolecular structures, for the purpose of deriving a four-body statistical potential that serves as a topological scoring function. Next, we identify a second, independent set of three hundred protein-ligand complexes, having both high-resolution structures and known dissociation constants. Two-thirds of these complexes are randomly selected to train a predictive model of binding affinity as follows: two tessellations are generated in each case, one for the entire complex and another strictly for the isolated protein without its bound ligand, and a topological score is computed for each tessellation with the four-body potential. Predicted protein-ligand binding affinity is then based on an empirically derived linear function of the difference between both topological scores, one that appropriately scales the value of this difference.
Results
A comparison between experimental and calculated binding affinity values over the two hundred complexes reveals a Pearson's correlation coefficient of r = 0.79 with a standard error of SE = 1.98 kcal/mol. To validate the method, we similarly generated two tessellations for each of the remaining protein-ligand complexes, computed their topological scores and the difference between the two scores for each complex, and applied the previously derived linear transformation of this topological score difference to predict binding affinities. For these one hundred complexes, we again observe a correlation of r = 0.79 (SE = 1.93 kcal/mol) between known and calculated binding affinities. Applying our model to an independent test set of high-resolution structures for three hundred diverse enzyme-inhibitor complexes, each with an experimentally known inhibition constant, also yields a correlation of r = 0.79 (SE = 2.39 kcal/mol) between experimental and calculated binding energies.
Conclusions
Lastly, we generate predictions with our model on a diverse test set of one hundred protein-ligand complexes previously used to benchmark 15 related methods, and our correlation of r = 0.66 between the calculated and experimental binding energies for this dataset exceeds those of the other approaches. Compared with these related prediction methods, our approach stands out based on salient features that include the reliability of our model, combined with the rapidity of the generated predictions, which are less than one second for an average sized complex.
doi:10.1186/1472-6807-13-S1-S1
PMCID: PMC3952120  PMID: 24564918
14.  Construction and test of ligand decoy sets using MDock: CSAR benchmarks for binding mode prediction 
Two sets of ligand binding decoys have been constructed for the CSAR (Community Structure-Activity Resource) benchmark by using the MDock and DOCK programs for rigid-ligand and flexible-ligand docking, respectively. The decoys generated for each complex in the benchmark thoroughly cover the binding site and also contain a certain number of near-native binding modes. A few scoring functions have been evaluated using the ligand binding decoy sets for their abilities of predicting near-native binding modes. Among them, ITScore achieved a success rate of 86.7% for the rigid-ligand decoys and 79.7% for the flexible-ligand decoys, under the common definition of a successful prediction as RMSD < 2.0 Å from the native structure if the top-scored binding mode was considered. The decoy sets may serve as benchmarks for binding mode prediction of a scoring function, which are available at the CSAR website (http://www.csardock.org/).
doi:10.1021/ci200080g
PMCID: PMC3190646  PMID: 21755952
molecular docking; scoring function; CSAR benchmark; binding mode; knowledge-based
15.  A Predictive Model of the Oxygen and Heme Regulatory Network in Yeast 
PLoS Computational Biology  2008;4(11):e1000224.
Deciphering gene regulatory mechanisms through the analysis of high-throughput expression data is a challenging computational problem. Previous computational studies have used large expression datasets in order to resolve fine patterns of coexpression, producing clusters or modules of potentially coregulated genes. These methods typically examine promoter sequence information, such as DNA motifs or transcription factor occupancy data, in a separate step after clustering. We needed an alternative and more integrative approach to study the oxygen regulatory network in Saccharomyces cerevisiae using a small dataset of perturbation experiments. Mechanisms of oxygen sensing and regulation underlie many physiological and pathological processes, and only a handful of oxygen regulators have been identified in previous studies. We used a new machine learning algorithm called MEDUSA to uncover detailed information about the oxygen regulatory network using genome-wide expression changes in response to perturbations in the levels of oxygen, heme, Hap1, and Co2+. MEDUSA integrates mRNA expression, promoter sequence, and ChIP-chip occupancy data to learn a model that accurately predicts the differential expression of target genes in held-out data. We used a novel margin-based score to extract significant condition-specific regulators and assemble a global map of the oxygen sensing and regulatory network. This network includes both known oxygen and heme regulators, such as Hap1, Mga2, Hap4, and Upc2, as well as many new candidate regulators. MEDUSA also identified many DNA motifs that are consistent with previous experimentally identified transcription factor binding sites. Because MEDUSA's regulatory program associates regulators to target genes through their promoter sequences, we directly tested the predicted regulators for OLE1, a gene specifically induced under hypoxia, by experimental analysis of the activity of its promoter. In each case, deletion of the candidate regulator resulted in the predicted effect on promoter activity, confirming that several novel regulators identified by MEDUSA are indeed involved in oxygen regulation. MEDUSA can reveal important information from a small dataset and generate testable hypotheses for further experimental analysis. Supplemental data are included.
Author Summary
The cell uses complex regulatory networks to modulate the expression of genes in response to changes in cellular and environmental conditions. The transcript level of a gene is directly affected by the binding of transcriptional regulators to DNA motifs in its promoter sequence. Therefore, both expression levels of transcription factors and other regulatory proteins as well as sequence information in the promoters contribute to transcriptional gene regulation. In this study, we describe a new computational strategy for learning gene regulatory programs from gene expression data based on the MEDUSA algorithm. We learn a model that predicts differential expression of target genes from the expression levels of regulators, the presence of DNA motifs in promoter sequences, and binding data for transcription factors. Unlike many previous approaches, we do not assume that genes are regulated in clusters, and we learn DNA motifs de novo from promoter sequences as an integrated part of our algorithm. We use MEDUSA to produce a global map of the yeast oxygen and heme regulatory network. To demonstrate that MEDUSA can reveal detailed information about regulatory mechanisms, we perform biochemical experiments to confirm the predicted regulators for an important hypoxia gene.
doi:10.1371/journal.pcbi.1000224
PMCID: PMC2573020  PMID: 19008939
16.  CSAR Benchmark Exercise 2011–2012: Evaluation of Results from Docking and Relative Ranking of Blinded Congeneric Series 
The Community Structure–Activity Resource (CSAR) recently held its first blinded exercise based on data provided by Abbott, Vertex, and colleagues at the University of Michigan, Ann Arbor. A total of 20 research groups submitted results for the benchmark exercise where the goal was to compare different improvements for pose prediction, enrichment, and relative ranking of congeneric series of compounds. The exercise was built around blinded high-quality experimental data from four protein targets: LpxC, Urokinase, Chk1, and Erk2. Pose prediction proved to be the most straightforward task, and most methods were able to successfully reproduce binding poses when the crystal structure employed was co-crystallized with a ligand from the same chemical series. Multiple evaluation metrics were examined, and we found that RMSD and native contact metrics together provide a robust evaluation of the predicted poses. It was notable that most scoring functions underpredicted contacts between the hetero atoms (i.e., N, O, S, etc.) of the protein and ligand. Relative ranking was found to be the most difficult area for the methods, but many of the scoring functions were able to properly identify Urokinase actives from the inactives in the series. Lastly, we found that minimizing the protein and correcting histidine tautomeric states positively trended with low RMSD for pose prediction but minimizing the ligand negatively trended. Pregenerated ligand conformations performed better than those that were generated on the fly. Optimizing docking parameters and pretraining with the native ligand had a positive effect on the docking performance as did using restraints, substructure fitting, and shape fitting. Lastly, for both sampling and ranking scoring functions, the use of the empirical scoring function appeared to trend positively with the RMSD. Here, by combining the results of many methods, we hope to provide a statistically relevant evaluation and elucidate specific shortcomings of docking methodology for the community.
doi:10.1021/ci400025f
PMCID: PMC3753884  PMID: 23548044
17.  CSAR Benchmark Exercise of 2010: Selection of the Protein–Ligand Complexes 
A major goal in drug design is the improvement of computational methods for docking and scoring. The Community Structure Activity Resource (CSAR) aims to collect available data from industry and academia which may be used for this purpose (www.csardock.org). Also, CSAR is charged with organizing community-wide exercises based on the collected data. The first of these exercises was aimed to gauge the overall state of docking and scoring, using a large and diverse data set of protein–ligand complexes. Participants were asked to calculate the affinity of the complexes as provided and then recalculate with changes which may improve their specific method. This first data set was selected from existing PDB entries which had binding data (Kd or Ki) in Binding MOAD, augmented with entries from PDBbind. The final data set contains 343 diverse protein–ligand complexes and spans 14 pKd. Sixteen proteins have three or more complexes in the data set, from which a user could start an inspection of congeneric series. Inherent experimental error limits the possible correlation between scores and measured affinity; R2 is limited to ∼0.9 when fitting to the data set without over parametrizing. R2 is limited to ∼0.8 when scoring the data set with a method trained on outside data. The details of how the data set was initially selected, and the process by which it matured to better fit the needs of the community are presented. Many groups generously participated in improving the data set, and this underscores the value of a supportive, collaborative effort in moving our field forward.
doi:10.1021/ci200082t
PMCID: PMC3180202  PMID: 21728306
18.  PHOENIX: A Scoring Function for Affinity Prediction Derived Using High-Resolution Crystal Structures and Calorimetry Measurements 
Binding affinity prediction is one of the most critical components to computer-aided structure-based drug design. Despite advances in first-principle methods for predicting binding affinity, empirical scoring functions that are fast and only relatively accurate are still widely used in structure-based drug design. With the increasing availability of X-ray crystallographic structures in the Protein Data Bank and continuing application of biophysical methods such as isothermal titration calorimetry to measure thermodynamic parameters contributing to binding free energy, sufficient experimental data exists that scoring functions can now be derived by separating enthalpic (ΔH) and entropic (TΔS) contributions to binding free energy (ΔG). PHOENIX, a scoring function to predict binding affinities of protein-ligand complexes, utilizes the increasing availability of experimental data to improve binding affinity predictions by the following: model training and testing using high-resolution crystallographic data to minimize structural noise, independent models of enthalpic and entropic contributions fitted to thermodynamic parameters assumed to be thermodynamically biased to calculate binding free energy, use of shape and volume descriptors to better capture entropic contributions. A set of 42 descriptors and 112 protein-ligand complexes were used to derive functions using partial least squares for change of enthalpy (ΔH) and change of entropy (TΔS) to calculate change of binding free energy (ΔG), resulting in a predictive r2 (r2pred) of 0.55 and a standard error (SE) of 1.34 kcal/mol. External validation using the 2009 version of the PDBbind “refined set” (n = 1612) resulted in a Pearson correlation coefficient (Rp) of 0.575 and a mean error (ME) of 1.41 pKd. Enthalpy and entropy predictions were of limited accuracy individually. However, their difference resulted in a relatively accurate binding free energy. While the development of an accurate and applicable scoring function was an objective of this study, the main focus was evaluation of the use of high-resolution X-ray crystal structures with high-quality thermodynamic parameters from isothermal titration calorimetry for scoring function development. With the increasing application of structure-based methods in molecular design, this study suggests that using high-resolution crystal structures, separating enthalpy and entropy contributions to binding free energy, and including descriptors to better capture entropic contributions may prove to be effective strategies towards rapid and accurate calculation of binding affinity.
doi:10.1021/ci100257s
PMCID: PMC3046228  PMID: 21214225
19.  Prediction of cyclin-dependent kinase 2 inhibitor potency using the fragment molecular orbital method 
Background
The reliable and robust estimation of ligand binding affinity continues to be a challenge in drug design. Many current methods rely on molecular mechanics (MM) calculations which do not fully explain complex molecular interactions. Full quantum mechanical (QM) computation of the electronic state of protein-ligand complexes has recently become possible by the latest advances in the development of linear-scaling QM methods such as the ab initio fragment molecular orbital (FMO) method. This approximate molecular orbital method is sufficiently fast that it can be incorporated into the development cycle during structure-based drug design for the reliable estimation of ligand binding affinity. Additionally, the FMO method can be combined with approximations for entropy and solvation to make it applicable for binding affinity prediction for a broad range of target and chemotypes.
Results
We applied this method to examine the binding affinity for a series of published cyclin-dependent kinase 2 (CDK2) inhibitors. We calculated the binding affinity for 28 CDK2 inhibitors using the ab initio FMO method based on a number of X-ray crystal structures. The sum of the pair interaction energies (PIE) was calculated and used to explain the gas-phase enthalpic contribution to binding. The correlation of the ligand potencies to the protein-ligand interaction energies gained from FMO was examined and was seen to give a good correlation which outperformed three MM force field based scoring functions used to appoximate the free energy of binding. Although the FMO calculation allows for the enthalpic component of binding interactions to be understood at the quantum level, as it is an in vacuo single point calculation, the entropic component and solvation terms are neglected. For this reason a more accurate and predictive estimate for binding free energy was desired. Therefore, additional terms used to describe the protein-ligand interactions were then calculated to improve the correlation of the FMO derived values to experimental free energies of binding. These terms were used to account for the polar and non-polar solvation of the molecule estimated by the Poisson-Boltzmann equation and the solvent accessible surface area (SASA), respectively, as well as a correction term for ligand entropy. A quantitative structure-activity relationship (QSAR) model obtained by Partial Least Squares projection to latent structures (PLS) analysis of the ligand potencies and the calculated terms showed a strong correlation (r2 = 0.939, q2 = 0.896) for the 14 molecule test set which had a Pearson rank order correlation of 0.97. A training set of a further 14 molecules was well predicted (r2 = 0.842), and could be used to obtain meaningful estimations of the binding free energy.
Conclusions
Our results show that binding energies calculated with the FMO method correlate well with published data. Analysis of the terms used to derive the FMO energies adds greater understanding to the binding interactions than can be gained by MM methods. Combining this information with additional terms and creating a scaled model to describe the data results in more accurate predictions of ligand potencies than the absolute values obtained by FMO alone.
doi:10.1186/1758-2946-3-2
PMCID: PMC3032746  PMID: 21219630
20.  A knowledge-guided strategy for improving the accuracy of scoring functions in binding affinity prediction 
BMC Bioinformatics  2010;11:193.
Background
Current scoring functions are not very successful in protein-ligand binding affinity prediction albeit their popularity in structure-based drug designs. Here, we propose a general knowledge-guided scoring (KGS) strategy to tackle this problem. Our KGS strategy computes the binding constant of a given protein-ligand complex based on the known binding constant of an appropriate reference complex. A good training set that includes a sufficient number of protein-ligand complexes with known binding data needs to be supplied for finding the reference complex. The reference complex is required to share a similar pattern of key protein-ligand interactions to that of the complex of interest. Thus, some uncertain factors in protein-ligand binding may cancel out, resulting in a more accurate prediction of absolute binding constants.
Results
In our study, an automatic algorithm was developed for summarizing key protein-ligand interactions as a pharmacophore model and identifying the reference complex with a maximal similarity to the query complex. Our KGS strategy was evaluated in combination with two scoring functions (X-Score and PLP) on three test sets, containing 112 HIV protease complexes, 44 carbonic anhydrase complexes, and 73 trypsin complexes, respectively. Our results obtained on crystal structures as well as computer-generated docking poses indicated that application of the KGS strategy produced more accurate predictions especially when X-Score or PLP alone did not perform well.
Conclusions
Compared to other targeted scoring functions, our KGS strategy does not require any re-parameterization or modification on current scoring methods, and its application is not tied to certain systems. The effectiveness of our KGS strategy is in theory proportional to the ever-increasing knowledge of experimental protein-ligand binding data. Our KGS strategy may serve as a more practical remedy for current scoring functions to improve their accuracy in binding affinity prediction.
doi:10.1186/1471-2105-11-193
PMCID: PMC2868011  PMID: 20398404
21.  Hybrid Scoring and Classification Approaches to Predict Human Pregnane X Receptor Activators 
Pharmaceutical research  2008;26(4):1001-1011.
Purpose
The human pregnane X receptor (PXR) is a transcriptional regulator of many genes involved in xenobiotic metabolism and excretion. Reliable prediction of high affinity binders with this receptor would be valuable for pharmaceutical drug discovery to predict potential toxicological responses
Materials and Methods
Computational models were developed and validated for a dataset consisting of human PXR (PXR) activators and non-activators. We used support vector machine (SVM) algorithms with molecular descriptors derived from two sources, Shape Signatures and the Molecular Operating Environment (MOE) application software. We also employed the molecular docking program GOLD in which the GoldScore method was supplemented with other scoring functions to improve docking results.
Results
The overall test set prediction accuracy for PXR activators with SVM was 72% to 81%. This indicates that molecular shape descriptors are useful in classification of compounds binding to this receptor. The best docking prediction accuracy (61%) was obtained using 1D Shape Signature descriptors as a weighting factor to the GoldScore. By pooling the available human PXR data sets we revealed those molecular features that are associated with human PXR activators.
Conclusions
These combined computational approaches using molecular shape information may assist scientists to more confidently identify PXR activators.
doi:10.1007/s11095-008-9809-7
PMCID: PMC2836910  PMID: 19115096
docking; hybrid methods; machine learning; pregnane X receptor; shape signatures descriptors; support vector machine
22.  BDDCS Class Prediction for New Molecular Entities 
Molecular Pharmaceutics  2012;9(3):570-580.
The Biopharmaceutics Drug Disposition Classification System (BDDCS) was successfully employed for predicting drug-drug interactions (DDIs) with respect to drug metabolizing enzymes (DMEs), drug transporters and their interplay. The major assumption of BDDCS is that the extent of metabolism (EoM) predicts high versus low intestinal permeability rate, and vice versa, at least when uptake transporters or paracellular transport are not involved. We recently published a collection of over 900 marketed drugs classified for BDDCS. We suggest that a reliable model for predicting BDDCS class, integrated with in vitro assays, could anticipate disposition and potential DDIs of new molecular entities (NMEs). Here we describe a computational procedure for predicting BDDCS class from molecular structures. The model was trained on a set of 300 oral drugs, and validated on an external set of 379 oral drugs, using 17 descriptors calculated or derived from the VolSurf+ software. For each molecule, a probability of BDDCS class membership was given, based on predicted EoM, FDA solubility (FDAS) and their confidence scores. The accuracy in predicting FDAS was 78% in training and 77% in validation, while for EoM prediction the accuracy was 82% in training and 79% in external validation. The actual BDDCS class corresponded to the highest ranked calculated class for 55% of the validation molecules, and it was within the top two ranked more than 92% of the times. The unbalanced stratification of the dataset didn’t affect the prediction, which showed highest accuracy in predicting classes 2 and 3 with respect to the most populated class 1. For class 4 drugs a general lack of predictability was observed. A linear discriminant analysis (LDA) confirmed the degree of accuracy for the prediction of the different BDDCS classes is tied to the structure of the dataset. This model could routinely be used in early drug discovery to prioritize in vitro tests for NMEs (e.g., affinity to transporters, intestinal metabolism, intestinal absorption and plasma protein binding). We further applied the BDDCS prediction model on a large set of medicinal chemistry compounds (over 30,000 chemicals). Based on this application, we suggest that solubility, and not permeability, is the major difference between NMEs and drugs. We anticipate that the forecast of BDDCS categories in early drug discovery may lead to a significant R&D cost reduction.
doi:10.1021/mp2004302
PMCID: PMC3295927  PMID: 22224483
BDDCS; ADMET; GRID; MIF; Drug Disposition; Drug-Drug Interactions; VolSurf+; FDA solubility; machine learning
23.  Better estimation of protein-DNA interaction parameters improve prediction of functional sites 
BMC Biotechnology  2008;8:94.
Background
Characterizing transcription factor binding motifs is a common bioinformatics task. For transcription factors with variable binding sites, we need to get many suboptimal binding sites in our training dataset to get accurate estimates of free energy penalties for deviating from the consensus DNA sequence. One procedure to do that involves a modified SELEX (Systematic Evolution of Ligands by Exponential Enrichment) method designed to produce many such sequences.
Results
We analyzed low stringency SELEX data for E. coli Catabolic Activator Protein (CAP), and we show here that appropriate quantitative analysis improves our ability to predict in vitro affinity. To obtain large number of sequences required for this analysis we used a SELEX SAGE protocol developed by Roulet et al. The sequences obtained from here were subjected to bioinformatic analysis. The resulting bioinformatic model characterizes the sequence specificity of the protein more accurately than those sequence specificities predicted from previous analysis just by using a few known binding sites available in the literature. The consequences of this increase in accuracy for prediction of in vivo binding sites (and especially functional ones) in the E. coli genome are also discussed. We measured the dissociation constants of several putative CAP binding sites by EMSA (Electrophoretic Mobility Shift Assay) and compared the affinities to the bioinformatics scores provided by methods like the weight matrix method and QPMEME (Quadratic Programming Method of Energy Matrix Estimation) trained on known binding sites as well as on the new sites from SELEX SAGE data. We also checked predicted genome sites for conservation in the related species S. typhimurium. We found that bioinformatics scores based on SELEX SAGE data does better in terms of prediction of physical binding energies as well as in detecting functional sites.
Conclusion
We think that training binding site detection algorithms on datasets from binding assays lead to better prediction. The improvements in accuracy came from the unbiased nature of the SELEX dataset rather than from the number of sites available. We believe that with progress in short-read sequencing technology, one could use SELEX methods to characterize binding affinities of many low specificity transcription factors.
doi:10.1186/1472-6750-8-94
PMCID: PMC2654563  PMID: 19105805
24.  Lessons Learned in Empirical Scoring with smina from the CSAR 2011 Benchmarking Exercise 
We describe a general methodology for designing an empirical scoring function and provide smina, a version of AutoDock Vina specially optimized to support high-throughput scoring and user-specified custom scoring functions. Using our general method, the unique capabilities of smina, a set of default interaction terms from AutoDock Vina, and the CSAR (Community Structure-Activity Resource) 2010 dataset, we created a custom scoring function and evaluated it in the context of the CSAR 2011 benchmarking exercise. We find that our custom scoring function does a better job sampling low RMSD poses when crossdocking compared to the default AutoDock Vina scoring function. The design and application of our method and scoring function reveal several insights into possible improvements and the remaining challenges when scoring and ranking putative ligands.
doi:10.1021/ci300604z
PMCID: PMC3726561  PMID: 23379370
25.  CoMSIA and Docking Study of Rhenium Based Estrogen Receptor Ligand Analogs 
Steroids  2007;72(3):247-260.
OPLS all atom force field parameters were developed in order to model a diverse set of novel rhenium based estrogen receptor ligands whose relative binding affinities (RBA) to the estrogen receptor alpha isoform (ERα) with respect to 17β-Estradiol were available. The binding properties of these novel rhenium based organometallic complexes were studied with a combination of Comparative Molecular Similarity Indices Analysis (CoMSIA) and docking. A total of 29 estrogen receptor ligands consisting of 11 rhenium complexes and 18 organic ligands were docked inside the ligand-binding domain (LBD) of ERα utilizing the program Gold. The top ranked pose was used to construct CoMSIA models from a training set of 22 of the estrogen receptor ligands which were selected at random. In addition scoring functions from the docking runs and the polar volume (PV) were also studied to investigate their ability to predict RBA ERα. A partial least-squares analysis consisting of the CoMSIA steric, electrostatic and hydrophobic indices together with the polar volume proved sufficiently predictive having a correlation coefficient, r2, of 0.94 and a cross-validated correlation coefficient, q2, utilizing the leave one out method of 0.68. Analysis of the scoring functions from Gold showed particularly poor correlation to RBA ERα which did not improve when the rhenium complexes were extracted to leave the organic ligands. The combined CoMSIA and polar volume model ranked correctly the ligands in order of increasing RBA ERα, illustrating the utility of this method as a prescreening tool in the development of novel rhenium based estrogen receptor ligands.
doi:10.1016/j.steroids.2006.11.011
PMCID: PMC1964785  PMID: 17280694
steroid; docking; estrogen receptor; rhenium; CoMSIA

Results 1-25 (1079719)