Search tips
Search criteria

Results 1-25 (255)

Clipboard (0)

Select a Filter Below

Year of Publication
1.  Hsp90 Inhibitors, Part 2: Combining Ligand-Based and Structure-Based Approaches for Virtual Screening Application 
Hsp90 continues to be an important target for pharmaceutical discovery. In this project, virtual screening (VS) for novel Hsp90 inhibitors was performed using a combination of Autodock and Surflex-Sim (LB) scoring functions with the predictive ability of 3-D QSAR models, previously generated with the 3-D QSAutogrid/R procedure. Extensive validation of both structure-based (SB) and ligand-based (LB), through realignments and cross-alignments, allowed the definition of LB and SB alignment rules. The mixed LB/SB protocol was applied to virtually screen potential Hsp90 inhibitors from the NCI Diversity Set composed of 1785 compounds. A selected ensemble of 80 compounds were biologically tested. Among these molecules, preliminary data yielded four derivatives exhibiting IC50 values ranging between 18 and 63 μM as hits for a subsequent medicinal chemistry optimization procedure.
PMCID: PMC3985681  PMID: 24555544
2.  Structure Based Design, Synthesis, Pharmacophore Modeling, Virtual Screening, and Molecular Docking Studies for Identification of Novel Cyclophilin D Inhibitors 
Cyclophilin D (CypD) is a peptidyl prolyl isomerase F that resides in the mitochondrial matrix and associates with the inner mitochondrial membrane during the mitochondrial membrane permeability transition. CypD plays a central role in opening the mitochondrial membrane permeability transition pore (mPTP) leading to cell death and has been linked to Alzheimer’s disease (AD). Because CypD interacts with amyloid beta (Aβ) to exacerbate mitochondrial and neuronal stress, it is a potential target for drugs to treat AD. Since appropriately designed small organic molecules might bind to CypD and block its interaction with Aβ, 20 trial compounds were designed using known procedures that started with fundamental pyrimidine and sulfonamide scaffolds know to have useful therapeutic effects. Two-dimensional (2D) quantitative structure–activity relationship (QSAR) methods were applied to 40 compounds with known IC50 values. These formed a training set and were followed by a trial set of 20 designed compounds. A correlation analysis was carried out comparing the statistics of the measured IC50 with predicted values for both sets. Selectivity-determining descriptors were interpreted graphically in terms of principle component analyses. These descriptors can be very useful for predicting activity enhancement for lead compounds. A 3D pharmacophore model was also created. Molecular dynamics simulations were carried out for the 20 trial compounds with known IC50 values, and molecular descriptors were determined by 2D QSAR studies using the Lipinski rule-of-five. Fifteen of the 20 molecules satisfied all 5 Lipinski rules, and the remaining 5 satisfied 4 of the 5 Lipinski criteria and nearly satisfied the fifth. Our previous use of 2D QSAR, 3D pharmacophore models, and molecular docking experiments to successfully predict activity indicates that this can be a very powerful technique for screening large numbers of new compounds as active drug candidates. These studies will hopefully provide a basis for efficiently designing and screening large numbers of more potent and selective inhibitors for CypD treatment of AD.
PMCID: PMC3985759  PMID: 24555519
3.  QSAR Modeling of Imbalanced High-Throughput Screening Data in PubChem 
Many of the structures in PubChem are annotated with activities determined in high-throughput screening (HTS) assays. Because of the nature of these assays, the activity data are typically strongly imbalanced, with a small number of active compounds contrasting with a very large number of inactive compounds. We have used several such imbalanced PubChem HTS assays to test and develop strategies to efficiently build robust QSAR models from imbalanced data sets. Different descriptor types [Quantitative Neighborhoods of Atoms (QNA) and “biological” descriptors] were used to generate a variety of QSAR models in the program GUSAR. The models obtained were compared using external test and validation sets. We also report on our efforts to incorporate the most predictive of our models in the publicly available NCI/CADD Group Web services (
PMCID: PMC3985743  PMID: 24524735
4.  Identification of Novel Serotonin Transporter Compounds by Virtual Screening 
The serotonin (5-hydroxytryptamine, 5-HT) transporter (SERT) plays an essential role in the termination of serotonergic neurotransmission by removing 5-HT from the synaptic cleft into the presynaptic neuron. It is also of pharmacological importance being targeted by antidepressants and psychostimulant drugs. Here, five commercial databases containing approximately 3.24 million drug-like compounds have been screened using a combination of two-dimensional (2D) fingerprint-based and three-dimensional (3D) pharmacophore-based screening and flexible docking into multiple conformations of the binding pocket detected in an outward-open SERT homology model. Following virtual screening (VS), selected compounds were evaluated using in vitro screening and full binding assays and an in silico hit-to-lead (H2L) screening was performed to obtain analogues of the identified compounds. Using this multistep VS/H2L approach, 74 active compounds, 46 of which had Ki values of ≤1000 nM, belonging to 16 structural classes, have been identified, and multiple compounds share no structural resemblance with known SERT binders.
PMCID: PMC3982395  PMID: 24521202
5.  Dataset Modelability by QSAR 
We introduce a simple MODelability Index (MODI) that estimates the feasibility of obtaining predictive QSAR models (Correct Classification Rate above 0.7) for a binary dataset of bioactive compounds. MODI is defined as an activity class-weighted ratio of the number of the nearest neighbor pairs of compounds with the same activity class versus the total number of pairs. The MODI values were calculated for more than 100 datasets and the threshold of 0.65 was found to separate non-modelable from the modelable datasets.
PMCID: PMC3984298  PMID: 24251851
6.  Exploiting conformational dynamics in drug discovery: design of C-terminal inhibitors of Hsp90 with improved activities 
The interaction that occurs between molecules is a dynamic process that impacts both structural and conformational properties of the ligand and the ligand binding site. Herein, we investigate the dynamic cross-talk between a protein and the ligand as a source for new opportunities in ligand design. Analysis of the formation/disappearance of protein pockets produced in response to a first-generation inhibitor assisted in the identification of functional groups that could be introduced onto scaffolds to facilitate optimal binding, which allowed for increased binding with previously uncharacterized regions. MD simulations were used to elucidate primary changes that occur in the Hsp90 C-terminal binding pocket in the presence of first-generation ligands. This data was then used to design ligands that adapt to these receptor conformations, which provides access to an energy landscape that is not visible in a static model. The newly synthesized compounds demonstrated anti-proliferative activity at ~150 nanomolar concentration. The method identified herein may be used to design chemical probes that provide additional information on structural variations of Hsp90 C-terminal binding site.
PMCID: PMC4123794  PMID: 24397468
Drug-Design; Flexibility; Allostery; MD simulations; Dynamics-Based Design; Hsp90
7.  SCISSORS: Practical Considerations 
Molecular similarity has been effectively applied to many problems in cheminformatics and computational drug discovery, but modern methods can be prohibitively expensive for large-scale applications. The SCISSORS method rapidly approximates measures of pairwise molecular similarity such as ROCS and LINGO Tanimotos, acting as a filter to quickly reduce the size of a problem. We report an in-depth analysis of SCISSORS performance, including a mapping of the SCISSORS error distribution, benchmarking, and investigation of several algorithmic modifications. We show that SCISSORS can accurately predict multiconformer similarity, and suggest a method for estimating optimal SCISSORS parameters in a dataset-specific manner. These results are a useful resource for researchers seeking to incorporate SCISSORS into molecular similarity applications.
PMCID: PMC4207653  PMID: 24289274
8.  Pathway Analysis for Drug Repositioning Based on Public Database Mining 
Sixteen FDA-approved drugs were investigated to elucidate their mechanisms of action (MOAs) and clinical functions by pathway analysis based on retrieved drug targets interacting with or affected by the investigated drugs. Protein and gene targets and associated pathways were obtained by data-mining of public databases including the MMDB, PubChem BioAssay, GEO DataSets, and the BioSystems databases. Entrez E-Utilities were applied, and in-house Ruby scripts were developed for data retrieval and pathway analysis to identify and evaluate relevant pathways common to the retrieved drug targets. Pathways pertinent to clinical uses or MOAs were obtained for most drugs. Interestingly, some drugs identified pathways responsible for other diseases than their current therapeutic uses, and these pathways were verified retrospectively by in vitro tests, in vivo tests, or clinical trials. The pathway enrichment analysis based on drug target information from public databases could provide a novel approach for elucidating drug MOAs and repositioning, therefore benefiting the discovery of new therapeutic treatments for diseases.
PMCID: PMC3956470  PMID: 24460210
9.  A New Approach to Radial Basis Function Approximation and Its Application to QSAR 
We describe a novel approach to RBF approximation, which combines two new elements: (1) linear radial basis functions and (2) weighting the model by each descriptor’s contribution. Linear radial basis functions allow one to achieve more accurate predictions for diverse data sets. Taking into account the contribution of each descriptor produces more accurate similarity values used for model development. The method was validated on 14 public data sets comprising nine physicochemical properties and five toxicity endpoints. We also compared the new method with five different QSAR methods implemented in the EPA T.E.S.T. program. Our approach, implemented in the program GUSAR, showed a reasonable accuracy of prediction and high coverage for all external test sets, providing more accurate prediction results than the comparison methods and even the consensus of these methods. Using our new method, we have created models for physicochemical and toxicity endpoints, which we have made freely available in the form of an online service at
PMCID: PMC3985791  PMID: 24451033
10.  LIBSA – A Method for the Determination of Ligand-Binding Preference to Allosteric Sites on Receptor Ensembles 
Incorporation of receptor flexibility into computational drug discovery through the relaxed complex scheme is well suited for screening against a single binding site. In the absence of a known pocket or if there are multiple potential binding sites, it may be necessary to do docking against the entire surface of the target (global docking). However no suitable and easy-to-use tool is currently available to rank global docking results based on the preference of a ligand for a given binding site. We have developed a protocol, termed LIBSA for LIgand Binding Specificity Analysis, that analyzes multiple docked poses against a single or ensemble of receptor conformations and returns a metric for the relative binding to a specific region of interest. By using novel filtering algorithms and the signal-to-noise ratio (SNR), the relative ligand-binding frequency at different pockets can be calculated and compared quantitatively. Ligands can then be triaged by their tendency to bind to a site instead of ranking by affinity alone. The method thus facilitates screening libraries of ligand cores against a large library of receptor conformations without prior knowledge of specific pockets, which is especially useful to search for hits that selectively target a particular site. We demonstrate the utility of LIBSA by showing that it correctly identifies known ligand binding sites and predicts the relative preference of a set of related ligands for different pockets on the same receptor.
PMCID: PMC3985772  PMID: 24437606
11.  Quality Matters: Extension of Clusters of Residues with Good Hydrophobic Contacts Stabilize (Hyper)Thermophilic Proteins 
Identifying determinant(s) of protein thermostability is key for rational and data-driven protein engineering. By analyzing more than 130 pairs of mesophilic/(hyper)thermophilic proteins, we identified the quality (residue-wise energy) of hydrophobic interactions as a key factor for protein thermostability. This distinguishes our study from previous ones that investigated predominantly structural determinants. Considering this key factor, we successfully discriminated between pairs of mesophilic/(hyper)thermophilic proteins (discrimination accuracy: ∼80%) and searched for structural weak spots in E. coli dihydrofolate reductase (classification accuracy: 70%).
PMCID: PMC3985445  PMID: 24437522
12.  Implementation of the Hungarian Algorithm to Account for Ligand Symmetry and Similarity in Structure-Based Design 
False negative docking outcomes for highly symmetric molecules are a barrier to the accurate evaluation of docking programs, scoring functions, and protocols. This work describes an implementation of a symmetry-corrected root-mean-square deviation (RMSD) method into the program DOCK based on the Hungarian algorithm for solving the minimum assignment problem, which dynamically assigns atom correspondence in molecules with symmetry. The algorithm adds only a trivial amount of computation time to the RMSD calculations and is shown to increase the reported overall docking success rate by approximately 5% when tested over 1043 receptor–ligand systems. For some families of protein systems the results are even more dramatic, with success rate increases up to 16.7%. Several additional applications of the method are also presented including as a pairwise similarity metric to compare molecules during de novo design, as a scoring function to rank-order virtual screening results, and for the analysis of trajectories from molecular dynamics simulation. The new method, including source code, is available to registered users of DOCK6 (
PMCID: PMC3958141  PMID: 24410429
13.  Application of Quantitative Structure–Activity Relationship Models of 5-HT1A Receptor Binding to Virtual Screening Identifies Novel and Potent 5-HT1A Ligands 
The 5-hydroxytryptamine 1A (5-HT1A) serotonin receptor has been an attractive target for treating mood and anxiety disorders such as schizophrenia. We have developed binary classification quantitative structure–activity relationship (QSAR) models of 5-HT1A receptor binding activity using data retrieved from the PDSP Ki database. The prediction accuracy of these models was estimated by external 5-fold cross-validation as well as using an additional validation set comprising 66 structurally distinct compounds from the World of Molecular Bioactivity database. These validated models were then used to mine three major types of chemical screening libraries, i.e., drug-like libraries, GPCR targeted libraries, and diversity libraries, to identify novel computational hits. The five best hits from each class of libraries were chosen for further experimental testing in radioligand binding assays, and nine of the 15 hits were confirmed to be active experimentally with binding affinity better than 10 μM. The most active compound, Lysergol, from the diversity library showed very high binding affinity (Ki) of 2.3 nM against 5-HT1A receptor. The novel 5-HT1A actives identified with the QSAR-based virtual screening approach could be potentially developed as novel anxiolytics or potential antischizophrenic drugs.
PMCID: PMC3985444  PMID: 24410373
14.  Protein Structure Refinement of CASP Target Proteins Using GNEIMO Torsional Dynamics Method 
A longstanding challenge in using computational methods for protein structure prediction is the refinement of low-resolution structural models derived from comparative modeling methods into highly accurate atomistic models useful for detailed structural studies. Previously, we have developed and demonstrated the utility of the internal coordinate molecular dynamics (MD) technique, generalized Newton–Euler inverse mass operator (GNEIMO), for refinement of small proteins. Using GNEIMO, the high-frequency degrees of freedom are frozen and the protein is modeled as a collection of rigid clusters connected by torsional hinges. This physical model allows larger integration time steps and focuses the conformational search in the low frequency torsional degrees of freedom. Here, we have applied GNEIMO with temperature replica exchange to refine low-resolution protein models of 30 proteins taken from the continuous assessment of structure prediction (CASP) competition. We have shown that GNEIMO torsional MD method leads to refinement of up to 1.3 Å in the root-mean-square deviation in coordinates for 30 CASP target proteins without using any experimental data as restraints in performing the GNEIMO simulations. This is in contrast with the unconstrained all-atom Cartesian MD method performed under the same conditions, where refinement requires the use of restraints during the simulations.
PMCID: PMC3985798  PMID: 24397429
15.  Small-molecule 3D Structure Prediction Using Open Crystallography Data 
Predicting the 3D structures of small molecules is a common problem in chemoinformatics. Even the best methods are inaccurate for complex molecules, and there is a large gap in accuracy between proprietary and free algorithms. Previous work presented COSMOS, a novel, data-driven algorithm that uses knowledge of known structures from the Cambridge Structural Database, and demonstrated performance that was competitive with proprietary algorithms. However, dependence on the Cambridge Structural Database prevented its widespread use. Here we present an updated version of the COSMOS structure predictor, complete with a free structure library derived from open data sources. We demonstrate that COSMOS performs better than other freely-available methods, with a mean RMSD of 1.16 Å and 1.68 Å for organic and metal-organic structures, and a mean prediction time of 60 ms per molecule. This is a 17% and 20% reduction in RMSD compared to the free predictor provided by Open Babel, and ten times faster. The ChemDB webportal provides a COSMOS prediction webserver, as well as downloadable copies of the COSMOS executable and the library of molecular substructures.
PMCID: PMC3918487  PMID: 24261562
16.  Inclusion of multiple fragment types in the Site Identification by Ligand Competitive Saturation (SILCS) approach 
The Site Identification by Ligand Competitive Saturation (SILCS) method identifies the location and approximate affinities of small molecular fragments on a target macromolecular surface by performing Molecular Dynamics (MD) simulations of the target in an aqueous solution of small molecules representative of different chemical functional groups. In this study, we introduce a set of small molecules to map potential interactions made by neutral hydrogen bond donors and acceptors, and charged donor and acceptor fragments in addition to nonpolar fragments. The affinity pattern is obtained in the form of discretized probability or, equivalently, free energy maps, called FragMaps, which can be visualized with the target surface. We performed SILCS simulations for four proteins for which structural and thermodynamic data is available for multiple, diverse ligands. Good overlap is shown between high affinity regions identified by the FragMaps and the crystallographic positions of ligand functional groups with similar chemical functionality, thus demonstrating the validity of the qualitative information obtained from the simulations. To test the ability of FragMaps in providing quantitative predictions, we calculate the previously introduced Ligand Grid Free Energy (LGFE) metric and observe its correspondence with experimentally measured binding affinity. LGFE is computed for different conformational ensembles and improvement in prediction is shown with increasing ligand conformational sampling. Ensemble generation includes a Monte Carlo sampling approach that uses the GFE FragMaps directly as the energy function. The results show some, but not all experimental trends are predicted, and warrant improvements in the scoring methodology. In addition, the potential utility of atom-based free energy contributions to the LGFE scores and the use of multiple ligands in SILCS to identify displaceable water molecules during ligand design are discussed.
PMCID: PMC3947602  PMID: 24245913
17.  How Does Catalase Release Nitric Oxide? A Computational Structure Activity Relationship Study 
Hydroxyurea (HU) is the only FDA approved medication for treating sickle cell disease in adults. The primary mechanism of action is pharmacological elevation of nitric oxide (NO) levels which induces propagation of fetal hemoglobin. HU is known to undergo redox reactions with heme based enzymes like hemoglobin and catalase to produce NO. However, specific details about the HU based NO release remain unknown. Experimental studies indicate that interaction of HU with human catalase compound I produces NO. Presently, we combine flexible receptor-flexible substrate induced fit docking (IFD) with energy decomposition analyses to examine the atomic level details of a possible key step in the clinical conversion of HU to NO. Substrate binding modes of nine HU analogs with catalase compound I were investigated to determine the essential properties necessary for effective NO release. Three major binding orientations were found that provide insight into the possible reaction mechanisms for producing NO. Further results show that anion/radical intermediates produced as part of these mechanisms would be stabilized by hydrogen bonding interactions from distal residues His75, Asn148, Gln168, and oxoferryl-heme. These details will ideally contribute to both a clearer mechanistic picture and provide insights for future structure based drug design efforts.
PMCID: PMC3893047  PMID: 24087936
18.  Fusing Dual-Event Datasets for Mycobacterium Tuberculosis Machine Learning Models and their Evaluation 
The search for new tuberculosis treatments continues as we need to find molecules that can act more quickly, be accommodated in multi-drug regimens, and overcome ever increasing levels of drug resistance. Multiple large scale phenotypic high-throughput screens against Mycobacterium tuberculosis (Mtb) have generated dose response data, enabling the generation of machine learning models. These models also incorporated cytotoxicity data and were recently validated with a large external dataset.
A cheminformatics data-fusion approach followed by Bayesian machine learning, Support Vector Machine or Recursive Partitioning model development (based on publicly available Mtb screening data) was used to compare individual datasets and subsequent combined models. A set of 1924 commercially available molecules with promising antitubercular activity (and lack of relative cytotoxicity to Vero cells) were used to evaluate the predictive nature of the models. We demonstrate that combining three datasets incorporating antitubercular and cytotoxicity data in Vero cells from our previous screens results in external validation receiver operator curve (ROC) of 0.83 (Bayesian or RP Forest). Models that do not have the highest five-fold cross validation ROC scores can outperform other models in a test set dependent manner.
We demonstrate with predictions for a recently published set of Mtb leads from GlaxoSmithKline that no single machine learning model may be enough to identify compounds of interest. Dataset fusion represents a further useful strategy for machine learning construction as illustrated with Mtb. Coverage of chemistry and Mtb target spaces may also be limiting factors for the whole-cell screening data generated to date.
PMCID: PMC3910492  PMID: 24144044
Bayesian models; Collaborative Drug Discovery Tuberculosis database; Dual-event models; Function class fingerprints; Lead optimization; Mycobacterium tuberculosis; Recursive partitioning; Support vector machine; Tuberculosis
19.  Correlating protein hot spot surface analysis using ProBiS with simulated free energies of protein-protein interfacial residues 
A protocol was developed for the computational determination of the contribution of interfacial amino acid residues to the free energy of protein-protein binding. Thermodynamic integration, based on molecular dynamics simulation in CHARMM, was used to determine the free energy associated with single point mutations to glycine in a protein-protein interface. The hot spot amino acids found in this way were then correlated to structural similarity scores detected by the ProBiS algorithm for local structural alignment. We find that amino acids with high structural similarity scores contribute on average −3.19 kcal/mol to the free energy of protein-protein binding and are thus correlated with hot spot residues, while residues with low similarity scores contribute on average only −0.43 kcal/mol. This suggests that the local structural alignment method provides a good approximation of the contribution of a residue to the free energy of binding and is particularly useful for detection of hot spots in proteins with known structures but undetermined protein-protein complexes.
PMCID: PMC4219562  PMID: 23009716
hot spot prediction; protein-protein binding; thermodynamic integration
20.  Molecular Recognition in a Diverse Set of Protein-Ligand Interactions Studied with Molecular Dynamics Simulations and End-Point Free Energy Calculations 
End-point free energy calculations using MM-GBSA and MM-PBSA provide a detailed understanding of molecular recognition in protein-ligand interactions. The binding free energy can be used to rank-order protein-ligand structures in virtual screening for compound or target identification. Here, we carry out free energy calculations for a diverse set of 11 proteins bound to 14 small molecules using extensive explicit-solvent MD simulations. The structure of these complexes was previously solved by crystallography and their binding studied with isothermal titration calorimetry (ITC) data enabling direct comparison to the MM-GBSA and MM-PBSA calculations. Four MM-GBSA and three MM-PBSA calculations reproduced the ITC free energy within 1 kcal•mol−1 highlighting the challenges in reproducing the absolute free energy from end-point free energy calculations. MM-GBSA exhibited better rank-ordering with a Spearman ρ of 0.68 compared to 0.40 for MM-PBSA with dielectric constant (ε = 1). An increase in ε resulted in significantly better rank-ordering for MM-PBSA (ρ = 0.91 for ε = 10). But larger ε significantly reduced the contributions of electrostatics, suggesting that the improvement is due to the non-polar and entropy components, rather than a better representation of the electrostatics. SVRKB scoring function applied to MD snapshots resulted in excellent rank-ordering (ρ = 0.81). Calculations of the configurational entropy using normal mode analysis led to free energies that correlated significantly better to the ITC free energy than the MD-based quasi-harmonic approach, but the computed entropies showed no correlation with the ITC entropy. When the adaptation energy is taken into consideration by running separate simulations for complex, apo and ligand (MM-PBSAADAPT), there is less agreement with the ITC data for the individual free energies, but remarkably good rank-ordering is observed (ρ = 0.89). Interestingly, filtering MD snapshots by pre-scoring protein-ligand complexes with a machine learning-based approach (SVMSP) resulted in a significant improvement in the MM-PBSA results (ε = 1) from ρ = 0.40 to ρ = 0.81. Finally, the non-polar components of MM-GBSA and MM-PBSA, but not the electrostatic components, showed strong correlation to the ITC free energy; the computed entropies did not correlate with the ITC entropy.
PMCID: PMC4058328  PMID: 24032517
21.  Conditional Probabilistic Analysis for Prediction of the Activity Landscape and Relative Compound Activities 
Journal of chemical information and modeling  2013;53(10):10.1021/ci400243e.
Structure-property relationships and structure-activity relationships play an important role in many research areas, such as medicinal chemistry and drug discovery. Such methods, however, have focused on providing post-hoc descriptions of such relationships based on known data. The ability for these descriptions to remain relevant when considering compounds of unknown activity, and thus the prediction of activity and property landscapes using existing data, remain little explored. In this study, we present a novel method of evaluating the ability of a compound comparison methodology to provide accurate information about a set of unknown compounds, and also explore the ability of these predicted activity landscapes to prioritize active compounds over inactive. These methods are applied to three distinct and diverse sets of compounds, each with activity data for multiple targets, for a total of eight target-compound set pairs. Six methodologically distinct compound comparison methods were evaluated. We show that overall, all compound comparison methods provided an improvement in structural-activity relationship prediction over random and were able to prioritize compounds in a superior manner to random sampling, but the degree of success and therefore applicability varied markedly.
PMCID: PMC3850180  PMID: 23971977
conditional probability; molecular representation; property landscapes; structure-activity relationships
22.  In silico enzymatic synthesis of a 400,000 compound biochemical database for non-targeted metabolomics 
Journal of chemical information and modeling  2013;53(9):10.1021/ci400368v.
Current methods of structure identification in mass spectrometry based non-targeted metabolomics rely on matching experimentally determined features of an unknown compound to those of candidate compounds contained in biochemical databases. A major limitation of this approach is the relatively small number of compounds currently included in these databases. If the correct structure is not present in a database it cannot be identified, and if it cannot be identified it cannot be included in a database. Thus, there is an urgent need to augment metabolomics databases with rationally designed biochemical structures using alternative means. In this study, we present a database of in silico enzymatically synthesized metabolites (IIMDB) to partially address this problem. The database, which is available from, includes ~23,000 known compounds (mammalian metabolites, drugs, secondary plant metabolites and glycerophospholipids) collected from existing biochemical databases plus more than 400,000 computationally generated human phase I and phase II metabolites of these known compounds. The IIMDB database features a user-friendly web interface and a programmer-friendly RESTful web service. Ninety-five percent of the computationally generated metabolites in IIMDB were not found in any existing database. However, 21,640 were identical to compounds already listed in PubChem, HMDB, KEGG or HumanCyc. Furthermore, a vast majority of these in silico metabolites were scored as biological using BioSM, a software program that identifies biochemical structures in chemical structure space. These results suggest that in silico biochemical synthesis represents a viable approach for significantly augmenting biochemical databases for non-targeted metabolomics applications.
PMCID: PMC3819714  PMID: 23991755
metabolomics; mass spectrometry; in silico structure generation; biochemical databases
23.  Ligand Binding Site Detection by Local Structure Alignment and Its Performance Complementarity 
Journal of chemical information and modeling  2013;53(9):10.1021/ci4003602.
Accurate determination of potential ligand binding sites (BS) is a key step for protein function characterization and structure-based drug design. Despite promising results of template-based BS prediction methods using global structure alignment (GSA), there is a room to improve the performance by properly incorporating local structure alignment (LSA) because BS are local structures and often similar for proteins with dissimilar global folds. We present a template-based ligand BS prediction method using G-LoSA, our LSA tool. A large benchmark set validation shows that G-LoSA predicts drug-like ligands’ positions in single-chain protein targets more precisely than TM-align, a GSA-based method, while the overall success rate of TM-align is better. G-LoSA is particularly efficient for accurate detection of local structures conserved across proteins with diverse global topologies. Recognizing the performance complementarity of G-LoSA to TM-align and a non-template geometry-based method, fpocket, a robust consensus scoring method, CMCS-BSP (Complementary Methods and Consensus Scoring for ligand Binding Site Prediction), is developed and shows improvement on prediction accuracy. The G-LoSA source code is freely available at
PMCID: PMC3821077  PMID: 23957286
template-based method; G-LoSA; global structure alignment; pocket shape; computer-aided drug design
24.  Molecular Simulations of Aromatase Reveal New Insights Into the Mechanism of Ligand Binding 
CYP19A1, also known as aromatase or estrogen synthetase, is the rate-limiting enzyme in the biosynthesis of estrogens from their corresponding androgens. Several clinically used breast cancer therapies target aromatase. In this work, explicitly solvated all-atom molecular dynamics simulations of aromatase with a model of the lipid bilayer and the transmembrane helix are performed. The dynamics of aromatase and the role of titration of an important amino acid residue involved in aromatization of androgens are investigated via two 250-ns long simulations. One simulation treats the protonated form of the catalytic aspartate 309, which appears more consistent with crystallographic data for the active site, while the simulation of the deprotonated form shows some notable conformational shifts. Ensemble-based computational solvent mapping experiments indicate possible novel druggable binding sites that could be utilized by next-generation inhibitors. In addition, the effects of protonation on the ligand positioning and channel dynamics are investigated using geometrical models that estimate the opening width of critical channels. Significant differences in channel dynamics between the protonated and deprotonated trajectories are exhibited, suggesting that the mechanism for substrate and product entry and the aromatization process may be coupled to a “locking” mechanism and channel opening. Our results may be particularly relevant in the design of novel drugs, which may be useful therapeutic treatments of cancers such as those of the breast and prostate.
PMCID: PMC3787069  PMID: 23927370
25.  Automated large-scale file preparation, docking, and scoring: Evaluation of ITScore and STScore using the 2012 Community Structure-Activity Resource Benchmark 
In this study, we use the recently released 2012 Community Structure-Activity Resource (CSAR) Dataset to evaluate two knowledge-based scoring functions, ITScore and STScore, and a simple force-field-based potential (VDWScore). The CSAR Dataset contains 757 compounds, most with known affinities, and 57 crystal structures. With the help of the script files for docking preparation, we use the full CSAR Dataset to evaluate the performances of the scoring functions on binding affinity prediction and active/inactive compound discrimination. The CSAR subset that includes crystal structures is used as well, to evaluate the performances of the scoring functions on binding mode and affinity predictions. Within this structure subset, we investigate the importance of accurate ligand and protein conformational sampling and find that the binding affinity predictions are less sensitive to non-native ligand and protein conformations than the binding mode predictions. We also find the full CSAR Dataset to be more challenging in making binding mode predictions than the subset with structures. The script files used for preparing the CSAR Dataset for docking, including scripts for canonicalization of the ligand atoms, are offered freely to the academic community.
PMCID: PMC3755023  PMID: 23656179
CSAR; community structure-activity resource; protein-ligand docking; knowledge-based scoring functions

Results 1-25 (255)