Few environments have been developed or deployed to widely share biomolecular simulation data or to enable collaborative networks to facilitate data exploration and reuse. As the amount and complexity of data generated by these simulations is dramatically increasing and the methods are being more widely applied, the need for new tools to manage and share this data has become obvious. In this paper we present the results of a process aimed at assessing the needs of the community for data representation standards to guide the implementation of future repositories for biomolecular simulations.
We introduce a list of common data elements, inspired by previous work, and updated according to feedback from the community collected through a survey and personal interviews. These data elements integrate the concepts for multiple types of computational methods, including quantum chemistry and molecular dynamics. The identified core data elements were organized into a logical model to guide the design of new databases and application programming interfaces. Finally a set of dictionaries was implemented to be used via SQL queries or locally via a Java API built upon the Apache Lucene text-search engine.
The model and its associated dictionaries provide a simple yet rich representation of the concepts related to biomolecular simulations, which should guide future developments of repositories and more complex terminologies and ontologies. The model still remains extensible through the decomposition of virtual experiments into tasks and parameter sets, and via the use of extended attributes. The benefits of a common logical model for biomolecular simulations was illustrated through various use cases, including data storage, indexing, and presentation. All the models and dictionaries introduced in this paper are available for download at http://ibiomes.chpc.utah.edu/mediawiki/index.php/Downloads.
Biomolecular simulations; Molecular dynamics; Computational chemistry; Data model; Repository; XML; UML
The Chemistry Development Kit (CDK) is an open source Java library for manipulating and processing chemical information. A key aspect in handling chemical structures is the determination of the chemical rings. The rings of a structure are used areas including descriptors, stereochemistry, similarity, screening and atom typing. The CDK includes multiple algorithms for determining the rings of a structure on demand. Non-unique descriptions of rings were often used due to the slower performance of the unique alternatives.
Efficient algorithms for handling chemical ring perception have been implemented and optimised in the CDK. The algorithms provide much faster computation of new and existing types of rings. Several optimisation and implementation considerations are discussed which improve real case usage. The performance is measured on several publicly available data sets and in several cases the new implementations were found to be more than an order of magnitude faster.
Algorithmic improvements allow handling of much larger datasets in reasonable time. Faster computation allows more appropriate rings to be utilised in procedures such as aromaticity. Several areas that require ring perception have also seen a noticeable improvement. The time taken to compute the unique rings is now comparable allowing a correct usage throughout the toolkit. All source code is open source and freely available.
Rings; Cycles; CDK
With the rapid development of high-throughput genomic technologies and the accumulation of genome-wide datasets for gene expression profiling and biological networks, the impact of diseases and drugs on gene expression can be comprehensively characterized. Drug repositioning offers the possibility of reduced risks in the drug discovery process, thus it is an essential step in drug development.
Computational prediction of drug-disease interactions using gene expression profiling datasets and biological networks is a new direction in drug repositioning that has gained increasing interest. We developed a computational framework to build disease-drug networks using drug- and disease-specific subnetworks. The framework incorporates protein networks to refine drug and disease associated genes and prioritize genes in disease and drug specific networks. For each drug and disease we built multiple networks using gene expression profiling and text mining. Finally a logistic regression model was used to build functional associations between drugs and diseases.
We found that representing drugs and diseases by genes with high centrality degree in gene networks is the most promising representation of drug or disease subnetworks.
Disease; Drug; Gene; Protein networks
In order to exploit the vast body of currently inaccessible chemical information held in Electronic Laboratory Notebooks (ELNs) it is necessary not only to make it available but also to develop protocols for discovery, access and ultimately automatic processing. An aim of the Dial-a-Molecule Grand Challenge Network is to be able to draw on the body of accumulated chemical knowledge in order to predict or optimize the outcome of reactions. Accordingly the Network drew up a working group comprising informaticians, software developers and stakeholders from industry and academia to develop protocols and mechanisms to access and process ELN records. The work presented here constitutes the first stage of this process by proposing a tiered metadata system of knowledge, information and processing where each in turn addresses a) discovery, indexing and citation b) context and access to additional information and c) content access and manipulation. A compact set of metadata terms, called the elnItemManifest, has been derived and caters for the knowledge layer of this model. The elnItemManifest has been encoded as an XML schema and some use cases are presented to demonstrate the potential of this approach.
Cardiovascular disease (CVD) is the leading cause of death and associates with multiple risk factors. Herb medicines have been used to treat CVD long ago in china and several natural products or derivatives (e.g., aspirin and reserpine) are most common drugs all over the world. The objective of this work was to construct a systematic database for drug discovery based on natural products separated from CVD-related medicinal herbs and to research on action mechanism of herb medicines.
The cardiovascular disease herbal database (CVDHD) was designed to be a comprehensive resource for virtual screening and drug discovery from natural products isolated from medicinal herbs for cardiovascular-related diseases. CVDHD comprises 35230 distinct molecules and their identification information (chemical name, CAS registry number, molecular formula, molecular weight, international chemical identifier (InChI) and SMILES), calculated molecular properties (AlogP, number of hydrogen bond acceptor and donors, etc.), docking results between all molecules and 2395 target proteins, cardiovascular-related diseases, pathways and clinical biomarkers. All 3D structures were optimized in the MMFF94 force field and can be freely accessed.
CVDHD integrated medicinal herbs, natural products, CVD-related target proteins, docking results, diseases and clinical biomarkers. By using the methods of virtual screening and network pharmacology, CVDHD will provide a platform to streamline drug/lead discovery from natural products and explore the action mechanism of medicinal herbs. CVDHD is freely available at http://pkuxxj.pku.edu.cn/CVDHD.
Cardiovascular disease; Drug discovery; Network pharmacology; Molecular docking; Virtual screening; Herbal formula; Natural products; Medicinal herbs; Traditional Chinese medicine
‘Phylogenetic trees’ are commonly used for the analysis of chemogenomics datasets and to relate protein targets to each other, based on the (shared) bioactivities of their ligands. However, no real assessment as to the suitability of this representation has been performed yet in this area. We aimed to address this shortcoming in the current work, as exemplified by a kinase data set, given the importance of kinases in many diseases as well as the availability of large-scale datasets for analysis. In this work, we analyzed a dataset comprising 157 compounds, which have been tested at concentrations of 1 μM and 10 μM against a panel of 225 human protein kinases in full-matrix experiments, aiming to explain kinase promiscuity and selectivity against inhibitors. Compounds were described by chemical features, which were used to represent kinases (i.e. each kinase had an active set of features and an inactive set).
Using this representation, a bioactivity-based classification was made of the kinome, which partially resembles previous sequence-based classifications, where particularly kinases from the TK, CDK, CLK and AGC branches cluster together. However, we were also able to show that in approximately 57% of cases, on average 6 kinase inhibitors exhibit activity against kinases which are located at a large distance in the sequence-based classification (at a relative distance of 0.6 – 0.8 on a scale from 0 to 1), but are correctly located closer to each other in our bioactivity-based tree (distance 0 – 0.4). Despite this improvement on sequence-based classification, also the bioactivity-based classification needed further attention: for approximately 80% of all analyzed kinases, kinases classified as neighbors according to the bioactivity-based classification also show high SAR similarity (i.e. a high fraction of shared active compounds and therefore, interaction with similar inhibitors). However, in the remaining ~20% of cases a clear relationship between kinase bioactivity profile similarity and shared active compounds could not be established, which is in agreement with previously published atypical SAR (such as for LCK, FGFR1, AKT2, DAPK1, TGFR1, MK12 and AKT1).
In this work we were hence able to show that (1) targets (here kinases) with few shared activities are difficult to establish neighborhood relationships for, and (2) phylogenetic tree representations make implicit assumptions (i.e. that neighboring kinases exhibit similar interaction profiles with inhibitors) that are not always suitable for analyses of bioactivity space. While both points have been implicitly alluded to before, this is to the information of the authors the first study that explores both points on a comprehensive basis. Excluding kinases with few shared activities improved the situation greatly (the percentage of kinases for which no neighborhood relationship could be established dropped from 20% to only 4%). We can conclude that all of the above findings need to be taken into account when performing chemogenomics analyses, also for other target classes.
Kinase inhibitor; Selectivity; Phylogenetics; Chemogenomics; Polypharmacology
Research in organic chemistry generates samples of novel chemicals together with their properties and other related data. The involved scientists must be able to store this data and search it by chemical structure. There are commercial solutions for common needs like chemical registration systems or electronic lab notebooks. However for specific requirements of in-house databases and processes no such solutions exist. Another issue is that commercial solutions have the risk of vendor lock-in and may require an expensive license of a proprietary relational database management system. To speed up and simplify the development for applications that require chemical structure search capabilities, I have developed Molecule Database Framework. The framework abstracts the storing and searching of chemical structures into method calls. Therefore software developers do not require extensive knowledge about chemistry and the underlying database cartridge. This decreases application development time.
Molecule Database Framework is written in Java and I created it by integrating existing free and open-source tools and frameworks. The core functionality includes:
• Support for multi-component compounds (mixtures)
• Import and export of SD-files
• Optional security (authorization)
For chemical structure searching Molecule Database Framework leverages the capabilities of the Bingo Cartridge for PostgreSQL and provides type-safe searching, caching, transactions and optional method level security. Molecule Database Framework supports multi-component chemical compounds (mixtures).
Furthermore the design of entity classes and the reasoning behind it are explained. By means of a simple web application I describe how the framework could be used. I then benchmarked this example application to create some basic performance expectations for chemical structure searches and import and export of SD-files.
By using a simple web application it was shown that Molecule Database Framework successfully abstracts chemical structure searches and SD-File import and export to simple method calls. The framework offers good search performance on a standard laptop without any database tuning. This is also due to the fact that chemical structure searches are paged and cached. Molecule Database Framework is available for download on the projects web page on bitbucket: https://bitbucket.org/kienerj/moleculedatabaseframework.
Chemical structure search; Database; Framework; Open-source
Multiple validation techniques (Y-scrambling, complete training/test set randomization, determination of the dependence of R2test on the number of randomization cycles, etc.) aimed to improve the reliability of the modeling process were utilized and their effect on the statistical parameters of the models was evaluated. A consensus partial least squares (PLS)-similarity based k-nearest neighbors (KNN) model utilizing 3D-SDAR (three dimensional spectral data-activity relationship) fingerprint descriptors for prediction of the log(1/EC50) values of a dataset of 94 aryl hydrocarbon receptor binders was developed. This consensus model was constructed from a PLS model utilizing 10 ppm x 10 ppm x 0.5 Å bins and 7 latent variables (R2test of 0.617), and a KNN model using 2 ppm x 2 ppm x 0.5 Å bins and 6 neighbors (R2test of 0.622). Compared to individual models, improvement in predictive performance of approximately 10.5% (R2test of 0.685) was observed. Further experiments indicated that this improvement is likely an outcome of the complementarity of the information contained in 3D-SDAR matrices of different granularity. For similarly sized data sets of Aryl hydrocarbon (AhR) binders the consensus KNN and PLS models compare favorably to earlier reports. The ability of 3D-QSDAR (three dimensional quantitative spectral data-activity relationship) to provide structural interpretation was illustrated by a projection of the most frequently occurring bins on the standard coordinate space, thus allowing identification of structural features related to toxicity.
QSAR; Molecular descriptors; Quantitative spectral data-activity relationship (3D-QSDAR); Estrogen receptor binding; Molecular modeling
To study the chemical determinants of small molecule transport inside cells, it is crucial to visualize relationships between the chemical structure of small molecules and their associated subcellular distribution patterns. For this purpose, we experimented with cells incubated with a synthetic combinatorial library of fluorescent, membrane-permeant small molecule chemical agents. With an automated high content screening instrument, the intracellular distribution patterns of these chemical agents were microscopically captured in image data sets, and analyzed off-line with machine vision and cheminformatics algorithms. Nevertheless, it remained challenging to interpret correlations linking the structure and properties of chemical agents to their subcellular localization patterns in large numbers of cells, captured across large number of images.
To address this challenge, we constructed a Multidimensional Online Virtual Image Display (MOVID) visualization platform using off-the-shelf hardware and software components. For analysis, the image data set acquired from cells incubated with a combinatorial library of fluorescent molecular probes was sorted based on quantitative relationships between the chemical structures, physicochemical properties or predicted subcellular distribution patterns. MOVID enabled visual inspection of the sorted, multidimensional image arrays: Using a multipanel desktop liquid crystal display (LCD) and an avatar as a graphical user interface, the resolution of the images was automatically adjusted to the avatar’s distance, allowing the viewer to rapidly navigate through high resolution image arrays, zooming in and out of the images to inspect and annotate individual cells exhibiting interesting staining patterns. In this manner, MOVID facilitated visualization and interpretation of quantitative structure-localization relationship studies. MOVID also facilitated direct, intuitive exploration of the relationship between the chemical structures of the probes and their microscopic, subcellular staining patterns.
MOVID can provide a practical, graphical user interface and computer-assisted image data visualization platform to facilitate bioimage data mining and cheminformatics analysis of high content, phenotypic screening experiments.
Machine vision; Cheminformatics; Virtual reality; Data mining; Optical probes; Multivariate analysis; Human-computer interaction; Graphical user interface
Fingerprint similarity is a common method for comparing chemical structures. Similarity is an appealing approach because, with many fingerprint types, it provides intuitive results: a chemist looking at two molecules can understand why they have been determined to be similar. This transparency is partially lost with the fuzzier similarity methods that are often used for scaffold hopping and tends to vanish completely when molecular fingerprints are used as inputs to machine-learning (ML) models. Here we present similarity maps, a straightforward and general strategy to visualize the atomic contributions to the similarity between two molecules or the predicted probability of a ML model. We show the application of similarity maps to a set of dopamine D3 receptor ligands using atom-pair and circular fingerprints as well as two popular ML methods: random forests and naïve Bayes. An open-source implementation of the method is provided.
Visualization; Machine-learning; Similarity; Fingerprints
While a large body of work exists on comparing and benchmarking of descriptors of molecular structures, a similar comparison of protein descriptor sets is lacking. Hence, in the current work a total of 13 different protein descriptor sets have been compared with respect to their behavior in perceiving similarities between amino acids. The descriptor sets included in the study are Z-scales (3 variants), VHSE, T-scales, ST-scales, MS-WHIM, FASGAI and BLOSUM, and a novel protein descriptor set termed ProtFP (4 variants). We investigate to which extent descriptor sets show collinear as well as orthogonal behavior via principal component analysis (PCA).
In describing amino acid similarities, MSWHIM, T-scales and ST-scales show related behavior, as do the VHSE, FASGAI, and ProtFP (PCA3) descriptor sets. Conversely, the ProtFP (PCA5), ProtFP (PCA8), Z-Scales (Binned), and BLOSUM descriptor sets show behavior that is distinct from one another as well as both of the clusters above. Generally, the use of more principal components (>3 per amino acid, per descriptor) leads to a significant differences in the way amino acids are described, despite that the later principal components capture less variation per component of the original input data.
In this work a comparison is provided of how similar (and differently) currently available amino acids descriptor sets behave when converting structure to property space. The results obtained enable molecular modelers to select suitable amino acid descriptor sets for structure-activity analyses, e.g. those showing complementary behavior.
GPCR; HIV; QSAM; Peptides; Amino acid index; Protein descriptor; Polypharmacology
2D diagrams are widely used in the scientific literature to represent interactions between ligands and biomacromolecules. Such schematic diagrams are very helpful to better understand the chemical interactions and biological processes in which ligands are involved. Here, a new tool for automatic and interactive generation of 2D diagrams for biomacromolecule/ligand interactions is presented. LeView (Ligand-Environment Viewer) produces customised and high-quality figures, with a good compromise between a faithful representation of the 3D data (structures and interactions) and aesthetic criteria. LeView can be freely downloaded at http://www.pegase-biosciences.com/tools/leview/.
Ligand; Interaction; Tool; Display; 2D diagram
Working with small‐molecule datasets is a routine task for cheminformaticians and chemists. The analysis and comparison of vendor catalogues and the compilation of promising candidates as starting points for screening campaigns are but a few very common applications. The workflows applied for this purpose usually consist of multiple basic cheminformatics tasks such as checking for duplicates or filtering by physico‐chemical properties. Pipelining tools allow to create and change such workflows without much effort, but usually do not support interventions once the pipeline has been started. In many contexts, however, the best suited workflow is not known in advance, thus making it necessary to take the results of the previous steps into consideration before proceeding.
To support intuition‐driven processing of compound collections, we developed MONA, an interactive tool that has been designed to prepare and visualize large small‐molecule datasets. Using an SQL database common cheminformatics tasks such as analysis and filtering can be performed interactively with various methods for visual support. Great care was taken in creating a simple, intuitive user interface which can be instantly used without any setup steps. MONA combines the interactivity of molecule database systems with the simplicity of pipelining tools, thus enabling the case‐to‐case application of chemistry expert knowledge. The current version is available free of charge for academic use and can be downloaded at http://www.zbh.uni‐hamburg.de/mona.
In the last decade the standard Naive Bayes (SNB) algorithm has been widely employed in multi–class classification problems in cheminformatics. This popularity is mainly due to the fact that the algorithm is simple to implement and in many cases yields respectable classification results. Using clever heuristic arguments “anchored” by insightful cheminformatics knowledge, Xia et al. have simplified the SNB algorithm further and termed it the Laplacian Corrected Modified Naive Bayes (LCMNB) approach, which has been widely used in cheminformatics since its publication.
In this note we mathematically illustrate the conditions under which Xia et al.’s simplification holds. It is our hope that this clarification could help Naive Bayes practitioners in deciding when it is appropriate to employ the LCMNB algorithm to classify large chemical datasets.
A general formulation that subsumes the simplified Naive Bayes version is presented. Unlike the widely used NB method, the Standard Naive Bayes description presented in this work is discriminative (not generative) in nature, which may lead to possible further applications of the SNB method.
Starting from a standard Naive Bayes (SNB) algorithm, we have derived mathematically the relationship between Xia et al.’s ingenious, but heuristic algorithm, and the SNB approach. We have also demonstrated the conditions under which Xia et al.’s crucial assumptions hold. We therefore hope that the new insight and recommendations provided can be found useful by the cheminformatics community.
Naive Bayes; Laplacian Corrected Modified Naive Bayes; Classifications; Cheminformatics
Channels and pores in biomacromolecules (proteins, nucleic acids and their complexes) play significant biological roles, e.g., in molecular recognition and enzyme substrate specificity.
We present an advanced software tool entitled MOLE 2.0, which has been designed to analyze molecular channels and pores. Benchmark tests against other available software tools showed that MOLE 2.0 is by comparison quicker, more robust and more versatile. As a new feature, MOLE 2.0 estimates physicochemical properties of the identified channels, i.e., hydropathy, hydrophobicity, polarity, charge, and mutability. We also assessed the variability in physicochemical properties of eighty X-ray structures of two members of the cytochrome P450 superfamily.
Estimated physicochemical properties of the identified channels in the selected biomacromolecules corresponded well with the known functions of the respective channels. Thus, the predicted physicochemical properties may provide useful information about the potential functions of identified channels. The MOLE 2.0 software is available at http://mole.chemi.muni.cz.
Channels; Tunnels; Pores; Protein structures; Cytochrome P450; CAM; BM3
Many Protein Data Bank (PDB) users assume that the deposited structural models are of high quality but forget that these models are derived from the interpretation of experimental data. The accuracy of atom coordinates is not homogeneous between models or throughout the same model. To avoid basing a research project on a flawed model, we present a tool for assessing the quality of ligands and binding sites in crystallographic models from the PDB.
The Validation HElper for LIgands and Binding Sites (VHELIBS) is software that aims to ease the validation of binding site and ligand coordinates for non-crystallographers (i.e., users with little or no crystallography knowledge). Using a convenient graphical user interface, it allows one to check how ligand and binding site coordinates fit to the electron density map. VHELIBS can use models from either the PDB or the PDB_REDO databank of re-refined and re-built crystallographic models. The user can specify threshold values for a series of properties related to the fit of coordinates to electron density (Real Space R, Real Space Correlation Coefficient and average occupancy are used by default). VHELIBS will automatically classify residues and ligands as Good, Dubious or Bad based on the specified limits. The user is also able to visually check the quality of the fit of residues and ligands to the electron density map and reclassify them if needed.
VHELIBS allows inexperienced users to examine the binding site and the ligand coordinates in relation to the experimental data. This is an important step to evaluate models for their fitness for drug discovery purposes such as structure-based pharmacophore development and protein-ligand docking experiments.
Electron density map; Binding site structure validation; Ligand structure validation; Protein structure validation; PDB; PDB_REDO
The number and diversity of wrappers for chemoinformatic toolkits suggests the diverse needs of the chemoinformatic community. While existing chemoinformatics libraries provide a broad range of utilities, many chemoinformaticians find compiled language libraries intimidating, time-consuming, arcane, and verbose. Although high-level language wrappers have been implemented, more can be done to leverage the intuitiveness of object-orientation, the paradigms of high-level languages, and the extensibility of languages such as Ruby. We introduce Rubabel, an intuitive, object-oriented suite of functionality that substantially increases the accessibily of the tools in the Open Babel chemoinformatics library.
Rubabel requires fewer lines of code than any other actively developed wrapper, providing better object organization and navigation, and more intuitive object behavior than extant solutions. Moreover, Rubabel provides a convenient interface to the many extensions currently available in Ruby, greatly streamlining otherwise onerous tasks such as creating web applications that serve up Rubabel functionality.
Rubabel is powerful, intuitive, concise, freely available, cross-platform, and easy to install. We expect it to be a platform of choice for new users, Ruby users, and some users of current solutions.
Chemoinformatics; Open Babel; Ruby
The rapid access to intrinsic physicochemical properties of molecules is highly desired for large scale chemical data mining explorations such as mass spectrum prediction in metabolomics, toxicity risk assessment and drug discovery. Large volumes of data are being produced by quantum chemistry calculations, which provide increasing accurate estimations of several properties, e.g. by Density Functional Theory (DFT), but are still too computationally expensive for those large scale uses. This work explores the possibility of using large amounts of data generated by DFT methods for thousands of molecular structures, extracting relevant molecular properties and applying machine learning (ML) algorithms to learn from the data. Once trained, these ML models can be applied to new structures to produce ultra-fast predictions. An approach is presented for homolytic bond dissociation energy (BDE).
Machine learning models were trained with a data set of >12,000 BDEs calculated by B3LYP/6-311++G(d,p)//DFTB. Descriptors were designed to encode atom types and connectivity in the 2D topological environment of the bonds. The best model, an Associative Neural Network (ASNN) based on 85 bond descriptors, was able to predict the BDE of 887 bonds in an independent test set (covering a range of 17.67–202.30 kcal/mol) with RMSD of 5.29 kcal/mol, mean absolute deviation of 3.35 kcal/mol, and R2 = 0.953. The predictions were compared with semi-empirical PM6 calculations, and were found to be superior for all types of bonds in the data set, except for O-H, N-H, and N-N bonds. The B3LYP/6-311++G(d,p)//DFTB calculations can approach the higher-level calculations B3LYP/6-311++G(3df,2p)//B3LYP/6-31G(d,p) with an RMSD of 3.04 kcal/mol, which is less than the RMSD of ASNN (against both DFT methods). An experimental web service for on-line prediction of BDEs is available at http://joao.airesdesousa.com/bde.
Knowledge could be automatically extracted by machine learning techniques from a data set of calculated BDEs, providing ultra-fast access to accurate estimations of DFT-calculated BDEs. This demonstrates how to extract value from large volumes of data currently being produced by quantum chemistry calculations at an increasing speed mostly without human intervention. In this way, high-level theoretical quantum calculations can be used in large-scale applications that otherwise would not afford the intrinsic computational cost.
BDE; Bond dissociation energy; Neural network; Random forest; Machine learning; Chemoinformatics; DFT; DFTB; Big data
The World Anti-Doping Agency (WADA) publishes the Prohibited List, a manually compiled international standard of substances and methods prohibited in-competition, out-of-competition and in particular sports. It would be ideal to be able to identify all substances that have one or more performance-enhancing pharmacological actions in an automated, fast and cost effective way. Here, we use experimental data derived from the ChEMBL database (~7,000,000 activity records for 1,300,000 compounds) to build a database model that takes into account both structure and experimental information, and use this database to predict both on-target and off-target interactions between these molecules and targets relevant to doping in sport.
The ChEMBL database was screened and eight well populated categories of activities (Ki, Kd, EC50, ED50, activity, potency, inhibition and IC50) were used for a rule-based filtering process to define the labels “active” or “inactive”. The “active” compounds for each of the ChEMBL families were thereby defined and these populated our bioactivity-based filtered families. A structure-based clustering step was subsequently performed in order to split families with more than one distinct chemical scaffold. This produced refined families, whose members share both a common chemical scaffold and bioactivity against a common target in ChEMBL.
We have used the Parzen-Rosenblatt machine learning approach to test whether compounds in ChEMBL can be correctly predicted to belong to their appropriate refined families. Validation tests using the refined families gave a significant increase in predictivity compared with the filtered or with the original families. Out of 61,660 queries in our Monte Carlo cross-validation, belonging to 19,639 refined families, 41,300 (66.98%) had the parent family as the top prediction and 53,797 (87.25%) had the parent family in the top four hits. Having thus validated our approach, we used it to identify the protein targets associated with the WADA prohibited classes. For compounds where we do not have experimental data, we use their computed patterns of interaction with protein targets to make predictions of bioactivity. We hope that other groups will test these predictions experimentally in the future.
Protein target prediction; Polypharmacology; Machine learning; Side effects; Multi-label prediction; Drugs in sport; Drug repurposing
Existing computational methods for drug repositioning either rely only on the gene expression response of cell lines after treatment, or on drug-to-disease relationships, merging several information levels. However, the noisy nature of the gene expression and the scarcity of genomic data for many diseases are important limitations to such approaches. Here we focused on a drug-centered approach by predicting the therapeutic class of FDA-approved compounds, not considering data concerning the diseases. We propose a novel computational approach to predict drug repositioning based on state-of-the-art machine-learning algorithms. We have integrated multiple layers of information: i) on the distances of the drugs based on how similar are their chemical structures, ii) on how close are their targets within the protein-protein interaction network, and iii) on how correlated are the gene expression patterns after treatment. Our classifier reaches high accuracy levels (78%), allowing us to re-interpret the top misclassifications as re-classifications, after rigorous statistical evaluation. Efficient drug repurposing has the potential to significantly impact the whole field of drug development. The results presented here can significantly accelerate the translation into the clinics of known compounds for novel therapeutic uses.
Drug repositioning; Connectivity map; CMap; ATC code; Mode of action; Machine learning; SVM; Integrative genomics; SMILES; Anthelmintics; Antineoplastic; Oxamniquine; Niclosamide
Herbal medicine has long been viewed as a valuable asset for potential new drug discovery and herbal ingredients’ metabolites, especially the in vivo metabolites were often found to gain better pharmacological, pharmacokinetic and even better safety profiles compared to their parent compounds. However, these herbal metabolite information is still scattered and waiting to be collected.
HIM database manually collected so far the most comprehensive available in-vivo metabolism information for herbal active ingredients, as well as their corresponding bioactivity, organs and/or tissues distribution, toxicity, ADME and the clinical research profile. Currently HIM contains 361 ingredients and 1104 corresponding in-vivo metabolites from 673 reputable herbs. Tools of structural similarity, substructure search and Lipinski’s Rule of Five are also provided. Various links were made to PubChem, PubMed, TCM-ID (Traditional Chinese Medicine Information database) and HIT (Herbal ingredients’ targets databases).
A curated database HIM is set up for the in vivo metabolites information of the active ingredients for Chinese herbs, together with their corresponding bioactivity, toxicity and ADME profile. HIM is freely accessible to academic researchers at http://www.bioinformatics.org.cn/.
TCM; In vivo; Metabolism; Metabolite; Biotransformation; Structure search
With the growing popularity of using QSAR predictions towards regulatory purposes, such predictive models are now required to be strictly validated, an essential feature of which is to have the model’s Applicability Domain (AD) defined clearly. Although in recent years several different approaches have been proposed to address this goal, no optimal approach to define the model’s AD has yet been recognized.
This study proposes a novel descriptor-based AD method which accounts for the data distribution and exploits k-Nearest Neighbours (kNN) principle to derive a heuristic decision rule. The proposed method is a three-stage procedure to address several key aspects relevant in judging the reliability of QSAR predictions. Inspired from the adaptive kernel method for probability density function estimation, the first stage of the approach defines a pattern of thresholds corresponding to the various training samples and these thresholds are later used to derive the decision rule. Criterion deciding if a given test sample will be retained within the AD is defined in the second stage of the approach. Finally, the last stage tries reflecting upon the reliability in derived results taking model statistics and prediction error into account.
The proposed approach addressed a novel strategy that integrated the kNN principle to define the AD of QSAR models. Relevant features that characterize the proposed AD approach include: a) adaptability to local density of samples, useful when the underlying multivariate distribution is asymmetric, with wide regions of low data density; b) unlike several kernel density estimators (KDE), effectiveness also in high-dimensional spaces; c) low sensitivity to the smoothing parameter k; and d) versatility to implement various distances measures. The results derived on a case study provided a clear understanding of how the approach works and defines the model’s AD for reliable predictions.
QSAR; Applicability domain; kNN; Nearest neighbour; Model validation
Similarity-search methods using molecular fingerprints are an important tool for ligand-based virtual screening. A huge variety of fingerprints exist and their performance, usually assessed in retrospective benchmarking studies using data sets with known actives and known or assumed inactives, depends largely on the validation data sets used and the similarity measure used. Comparing new methods to existing ones in any systematic way is rather difficult due to the lack of standard data sets and evaluation procedures. Here, we present a standard platform for the benchmarking of 2D fingerprints. The open-source platform contains all source code, structural data for the actives and inactives used (drawn from three publicly available collections of data sets), and lists of randomly selected query molecules to be used for statistically valid comparisons of methods. This allows the exact reproduction and comparison of results for future studies. The results for 12 standard fingerprints together with two simple baseline fingerprints assessed by seven evaluation methods are shown together with the correlations between methods. High correlations were found between the 12 fingerprints and a careful statistical analysis showed that only the two baseline fingerprints were different from the others in a statistically significant way. High correlations were also found between six of the seven evaluation methods, indicating that despite their seeming differences, many of these methods are similar to each other.
Virtual screening; Benchmark; Similarity; Fingerprints