Search tips
Search criteria

Results 1-25 (544)

Clipboard (0)

Select a Filter Below

Year of Publication
1.  Bridging experiment and theory: a template for unifying NMR data and electronic structure calculations 
The testing of theoretical models with experimental data is an integral part of the scientific method, and a logical place to search for new ways of stimulating scientific productivity. Often experiment/theory comparisons may be viewed as a workflow comprised of well-defined, rote operations distributed over several distinct computers, as exemplified by the way in which predictions from electronic structure theories are evaluated with results from spectroscopic experiments. For workflows such as this, which may be laborious and time consuming to perform manually, software that could orchestrate the operations and transfer results between computers in a seamless and automated fashion would offer major efficiency gains. Such tools also promise to alter how researchers interact with data outside their field of specialization by, e.g., making raw experimental results more accessible to theorists, and the outputs of theoretical calculations more readily comprehended by experimentalists.
An implementation of an automated workflow has been developed for the integrated analysis of data from nuclear magnetic resonance (NMR) experiments and electronic structure calculations. Kepler (Altintas et al. 2004) open source software was used to coordinate the processing and transfer of data at each step of the workflow. This workflow incorporated several open source software components, including electronic structure code to compute NMR parameters, a program to simulate NMR signals, NMR data processing programs, and others. The Kepler software was found to be sufficiently flexible to address several minor implementation challenges without recourse to other software solutions. The automated workflow was demonstrated with data from a \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{17}\hbox {O}$$\end{document}17O NMR study of uranyl salts described previously (Cho et al. in J Chem Phys 132:084501, 2010).
The functional implementation of an automated process linking NMR data with electronic structure predictions demonstrates that modern software tools such as Kepler can be used to construct programs that comprehensively manage complex, multi-step scientific workflows spanning several different computers. Automation of the workflow can greatly accelerate the pace of discovery, and allows researchers to focus on the fundamental scientific questions rather than mastery of specialized software and data processing techniques. Future developments that would expand the scope and power of this approach include tools to standardize data and associated metadata formats, and the creation of interactive user interfaces to allow real-time exploration of the effects of program inputs on calculated outputs.
PMCID: PMC4748447  PMID: 26865863
Scientific workflow; NMR spectroscopy; Electronic structure theory
2.  Selectivity profiling of BCRP versus P-gp inhibition: from automated collection of polypharmacology data to multi-label learning 
The human ATP binding cassette transporters Breast Cancer Resistance Protein (BCRP) and Multidrug Resistance Protein 1 (P-gp) are co-expressed in many tissues and barriers, especially at the blood–brain barrier and at the hepatocyte canalicular membrane. Understanding their interplay in affecting the pharmacokinetics of drugs is of prime interest. In silico tools to predict inhibition and substrate profiles towards BCRP and P-gp might serve as early filters in the drug discovery and development process. However, to build such models, pharmacological data must be collected for both targets, which is a tedious task, often involving manual and poorly reproducible steps.
Compounds with inhibitory activity measured against BCRP and/or P-gp were retrieved by combining Open Data and manually curated data from literature using a KNIME workflow. After determination of compound overlap, machine learning approaches were used to establish multi-label classification models for BCRP/P-gp. Different ways of addressing multi-label problems are explored and compared: label-powerset, binary relevance and classifiers chain. Label-powerset revealed important molecular features for selective or polyspecific inhibitory activity. In our dataset, only two descriptors (the numbers of hydrophobic and aromatic atoms) were sufficient to separate selective BCRP inhibitors from selective P-gp inhibitors. Also, dual inhibitors share properties with both groups of selective inhibitors. Binary relevance and classifiers chain allow improving the predictivity of the models.
The KNIME workflow proved a useful tool to merge data from diverse sources. It could be used for building multi-label datasets of any set of pharmacological targets for which there is data available either in the open domain or in-house. By applying various multi-label learning algorithms, important molecular features driving transporter selectivity could be retrieved. Finally, using the dataset with missing annotations, predictive models can be derived in cases where no accurate dense dataset is available (not enough data overlap or no well balanced class distribution).Graphical abstract.
Electronic supplementary material
The online version of this article (doi:10.1186/s13321-016-0121-y) contains supplementary material, which is available to authorized users.
PMCID: PMC4743411  PMID: 26855674
BCRP; P-glycoprotein; Open Data; Multi-label classification; Binary relevance; Classifiers chain; Selective inhibition; Polyspecific inhibition; KNIME; Open PHACTS
3.  Fragmentation trees reloaded 
Untargeted metabolomics commonly uses liquid chromatography mass spectrometry to measure abundances of metabolites; subsequent tandem mass spectrometry is used to derive information about individual compounds. One of the bottlenecks in this experimental setup is the interpretation of fragmentation spectra to accurately and efficiently identify compounds. Fragmentation trees have become a powerful tool for the interpretation of tandem mass spectrometry data of small molecules. These trees are determined from the data using combinatorial optimization, and aim at explaining the experimental data via fragmentation cascades. Fragmentation tree computation does not require spectral or structural databases. To obtain biochemically meaningful trees, one needs an elaborate optimization function (scoring).
We present a new scoring for computing fragmentation trees, transforming the combinatorial optimization into a Maximum A Posteriori estimator. We demonstrate the superiority of the new scoring for two tasks: both for the de novo identification of molecular formulas of unknown compounds, and for searching a database for structurally similar compounds, our method SIRIUS 3, performs significantly better than the previous version of our method, as well as other methods for this task.
SIRIUS 3 can be a part of an untargeted metabolomics workflow, allowing researchers to investigate unknowns using automated computational methods.Graphical abstractWe present a new scoring for computing fragmentation trees from tandem mass spectrometry data based on Bayesian statistics. The best scoring fragmentation tree most likely explains the molecular formula of the measured parent ion
PMCID: PMC4736045  PMID: 26839597
Mass spectrometry; Metabolites; Natural products ; Computational methods; Fragmentation trees
4.  ADMET evaluation in drug discovery: 15. Accurate prediction of rat oral acute toxicity using relevance vector machine and consensus modeling 
Determination of acute toxicity, expressed as median lethal dose (LD50), is one of the most important steps in drug discovery pipeline. Because in vivo assays for oral acute toxicity in mammals are time-consuming and costly, there is thus an urgent need to develop in silico prediction models of oral acute toxicity.
In this study, based on a comprehensive data set containing 7314 diverse chemicals with rat oral LD50 values, relevance vector machine (RVM) technique was employed to build the regression models for the prediction of oral acute toxicity in rate, which were compared with those built using other six machine learning approaches, including k-nearest-neighbor regression, random forest (RF), support vector machine, local approximate Gaussian process, multilayer perceptron ensemble, and eXtreme gradient boosting. A subset of the original molecular descriptors and structural fingerprints (PubChem or SubFP) was chosen by the Chi squared statistics. The prediction capabilities of individual QSAR models, measured by qext2 for the test set containing 2376 molecules, ranged from 0.572 to 0.659.
Considering the overall prediction accuracy for the test set, RVM with Laplacian kernel and RF were recommended to build in silico models with better predictivity for rat oral acute toxicity. By combining the predictions from individual models, four consensus models were developed, yielding better prediction capabilities for the test set (qext2 = 0.669–0.689). Finally, some essential descriptors and substructures relevant to oral acute toxicity were identified and analyzed, and they may be served as property or substructure alerts to avoid toxicity. We believe that the best consensus model with high prediction accuracy can be used as a reliable virtual screening tool to filter out compounds with high rat oral acute toxicity. Graphical abstractWorkflow of combinatorial QSAR modelling to predict rat oral acute toxicity
PMCID: PMC4736633  PMID: 26839598
5.  MetFrag relaunched: incorporating strategies beyond in silico fragmentation 
The in silico fragmenter MetFrag, launched in 2010, was one of the first approaches combining compound database searching and fragmentation prediction for small molecule identification from tandem mass spectrometry data. Since then many new approaches have evolved, as has MetFrag itself. This article details the latest developments to MetFrag and its use in small molecule identification since the original publication.
MetFrag has gone through algorithmic and scoring refinements. New features include the retrieval of reference, data source and patent information via ChemSpider and PubChem web services, as well as InChIKey filtering to reduce candidate redundancy due to stereoisomerism. Candidates can be filtered or scored differently based on criteria like occurence of certain elements and/or substructures prior to fragmentation, or presence in so-called “suspect lists”. Retention time information can now be calculated either within MetFrag with a sufficient amount of user-provided retention times, or incorporated separately as “user-defined scores” to be included in candidate ranking. The changes to MetFrag were evaluated on the original dataset as well as a dataset of 473 merged high resolution tandem mass spectra (HR-MS/MS) and compared with another open source in silico fragmenter, CFM-ID. Using HR-MS/MS information only, MetFrag2.2 and CFM-ID had 30 and 43 Top 1 ranks, respectively, using PubChem as a database. Including reference and retention information in MetFrag2.2 improved this to 420 and 336 Top 1 ranks with ChemSpider and PubChem (89 and 71 %), respectively, and even up to 343 Top 1 ranks (PubChem) when combining with CFM-ID. The optimal parameters and weights were verified using three additional datasets of 824 merged HR-MS/MS spectra in total. Further examples are given to demonstrate flexibility of the enhanced features.
In many cases additional information is available from the experimental context to add to small molecule identification, which is especially useful where the mass spectrum alone is not sufficient for candidate selection from a large number of candidates. The results achieved with MetFrag2.2 clearly show the benefit of considering this additional information. The new functions greatly enhance the chance of identification success and have been incorporated into a command line interface in a flexible way designed to be integrated into high throughput workflows. Feedback on the command line version of MetFrag2.2 available at is welcome.
Electronic supplementary material
The online version of this article (doi:10.1186/s13321-016-0115-9) contains supplementary material, which is available to authorized users.
PMCID: PMC4732001  PMID: 26834843
Compound identification; In silico fragmentation; High resolution mass spectrometry; Metabolomics; Structure elucidation
6.  How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity 
Hierarchical cluster analysis (HCA) is a widely used classificatory technique in many areas of scientific knowledge. Applications usually yield a dendrogram from an HCA run over a given data set, using a grouping algorithm and a similarity measure. However, even when such parameters are fixed, ties in proximity (i.e. two equidistant clusters from a third one) may produce several different dendrograms, having different possible clustering patterns (different classifications). This situation is usually disregarded and conclusions are based on a single result, leading to questions concerning the permanence of clusters in all the resulting dendrograms; this happens, for example, when using HCA for grouping molecular descriptors to select that less similar ones in QSAR studies.
Representing dendrograms in graph theoretical terms allowed us to introduce four measures of cluster frequency in a canonical way, and use them to calculate cluster frequencies over the set of all possible dendrograms, taking all ties in proximity into account. A toy example of well separated clusters was used, as well as a set of 1666 molecular descriptors calculated for a group of molecules having hepatotoxic activity to show how our functions may be used for studying the effect of ties in HCA analysis. Such functions were not restricted to the tie case; the possibility of using them to derive cluster stability measurements on arbitrary sets of dendrograms having the same leaves is discussed, e.g. dendrograms from variations of HCA parameters. It was found that ties occurred frequently, some yielding tens of thousands of dendrograms, even for small data sets.
Our approach was able to detect trends in clustering patterns by offering a simple way of measuring their frequency, which is often very low. This would imply, that inferences and models based on descriptor classifications (e.g. QSAR) are likely to be biased, thereby requiring an assessment of their reliability. Moreover, any classification of molecular descriptors is likely to be far from unique. Our results highlight the need for evaluating the effect of ties on clustering patterns before classification results can be used accurately.Graphical abstractFour cluster contrast functions identifying statistically sound clusters within dendrograms considering ties in proximity
PMCID: PMC4727313  PMID: 26816532
Ties in proximity; Cluster stability; Hierarchical cluster analysis (HCA); Dendrogram; Cluster frequency; Molecular descriptor
7.  The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS 
Melting point (MP) is an important property in regards to the solubility of chemical compounds. Its prediction from chemical structure remains a highly challenging task for quantitative structure–activity relationship studies. Success in this area of research critically depends on the availability of high quality MP data as well as accurate chemical structure representations in order to develop models. Currently, available datasets for MP predictions have been limited to around 50k molecules while lots more data are routinely generated following the synthesis of novel materials. Significant amounts of MP data are freely available within the patent literature and, if it were available in the appropriate form, could potentially be used to develop predictive models.
We have developed a pipeline for the automated extraction and annotation of chemical data from published PATENTS. Almost 300,000 data points have been collected and used to develop models to predict melting and pyrolysis (decomposition) points using tools available on the OCHEM modeling platform ( A number of technical challenges were simultaneously solved to develop models based on these data. These included the handing of sparse data matrices with >200,000,000,000 entries and parallel calculations using 32 × 6 cores per task using 13 descriptor sets totaling more than 700,000 descriptors. We showed that models developed using data collected from PATENTS had similar or better prediction accuracy compared to the highly curated data used in previous publications. The separation of data for chemicals that decomposed rather than melting, from compounds that did undergo a normal melting transition, was performed and models for both pyrolysis and MPs were developed. The accuracy of the consensus MP models for molecules from the drug-like region of chemical space was similar to their estimated experimental accuracy, 32 °C. Last but not least, important structural features related to the pyrolysis of chemicals were identified, and a model to predict whether a compound will decompose instead of melting was developed.
We have shown that automated tools for the analysis of chemical information have reached a mature stage allowing for the extraction and collection of high quality data to enable the development of structure–activity relationship models. The developed models and data are publicly available at
Electronic supplementary material
The online version of this article (doi:10.1186/s13321-016-0113-y) contains supplementary material, which is available to authorized users.
PMCID: PMC4724158  PMID: 26807157
8.  vSDC: a method to improve early recognition in virtual screening when limited experimental resources are available 
In drug design, one may be confronted to the problem of finding hits for targets for which no small inhibiting molecules are known and only low-throughput experiments are available (like ITC or NMR studies), two common difficulties encountered in a typical academic setting. Using a virtual screening strategy like docking can alleviate some of the problems and save a considerable amount of time by selecting only top-ranking molecules, but only if the method is very efficient, i.e. when a good proportion of actives are found in the 1–10 % best ranked molecules.
The use of several programs (in our study, Gold, Surflex, FlexX and Glide were considered) shows a divergence of the results, which presents a difficulty in guiding the experiments. To overcome this divergence and increase the yield of the virtual screening, we created the standard deviation consensus (SDC) and variable SDC (vSDC) methods, consisting of the intersection of molecule sets from several virtual screening programs, based on the standard deviations of their ranking distributions.
SDC allowed us to find hits for two new protein targets by testing only 9 and 11 small molecules from a chemical library of circa 15,000 compounds. Furthermore, vSDC, when applied to the 102 proteins of the DUD-E benchmarking database, succeeded in finding more hits than any of the four isolated programs for 13–60 % of the targets. In addition, when only 10 molecules of each of the 102 chemical libraries were considered, vSDC performed better in the number of hits found, with an improvement of 6–24 % over the 10 best-ranked molecules given by the individual docking programs.Graphical abstractIn drug design, for a given target and a given chemical library, the results obtained with different virtual screening programs are divergent. So how to rationally guide the experimental tests, especially when only a few number of experiments can be made? The variable Standard Deviation Consensus (vSDC) method was developed to answer this issue. Left panel the vSDC principle consists of intersecting molecule sets, chosen on the basis of the standard deviations of their ranking distributions, obtained from various virtual screening programs. In this study Glide, Gold, FlexX and Surflex were used and tested on the 102 targets of the DUD-E database. Right panel Comparison of the average percentage of hits found with vSDC and each of the four programs, when only 10 molecules from each of the 102 chemical libraries of the DUD-E database were considered. On average, vSDC was capable of finding 38 % of the findable hits, against 34 % for Glide, 32 % for Gold, 16 % for FlexX and 14 % for Surflex, showing that with vSDC, it was possible to overcome the unpredictability of the virtual screening results and to improve them
Electronic supplementary material
The online version of this article (doi:10.1186/s13321-016-0112-z) contains supplementary material, which is available to authorized users.
PMCID: PMC4722699  PMID: 26807156
Virtual screening; Standard deviation consensus; vSDC; Gold; Glide; Surflex; FlexX; DUD-E; Calcineurin; Cdk2
9.  Smiles2Monomers: a link between chemical and biological structures for polymers 
The monomeric composition of polymers is powerful for structure comparison and synthetic biology, among others. Many databases give access to the atomic structure of compounds but the monomeric structure of polymers is often lacking. We have designed a smart algorithm, implemented in the tool Smiles2Monomers (s2m), to infer efficiently and accurately the monomeric structure of a polymer from its chemical structure.
Our strategy is divided into two steps: first, monomers are mapped on the atomic structure by an efficient subgraph-isomorphism algorithm ; second, the best tiling is computed so that non-overlapping monomers cover all the structure of the target polymer. The mapping is based on a Markovian index built by a dynamic programming algorithm. The index enables s2m to search quickly all the given monomers on a target polymer. After, a greedy algorithm combines the mapped monomers into a consistent monomeric structure. Finally, a local branch and cut algorithm refines the structure. We tested this method on two manually annotated databases of polymers and reconstructed the structures de novo with a sensitivity over 90 %. The average computation time per polymer is 2 s.
s2m automatically creates de novo monomeric annotations for polymers, efficiently in terms of time computation and sensitivity. s2m allowed us to detect annotation errors in the tested databases and to easily find the accurate structures. So, s2m could be integrated into the curation process of databases of small compounds to verify the current entries and accelerate the annotation of new polymers. The full method can be downloaded or accessed via a website for peptide-like polymers at abstract:.
Electronic supplementary material
The online version of this article (doi:10.1186/s13321-015-0111-5) contains supplementary material, which is available to authorized users.
PMCID: PMC4693424  PMID: 26715946
Polymer; Monomer; Peptide; Chemical structure ; Compound search; Algorithm; Graph
10.  Accurate and efficient target prediction using a potency-sensitive influence-relevance voter 
A number of algorithms have been proposed to predict the biological targets of diverse molecules. Some are structure-based, but the most common are ligand-based and use chemical fingerprints and the notion of chemical similarity. These methods tend to be computationally faster than others, making them particularly attractive tools as the amount of available data grows.
Using a ChEMBL-derived database covering 490,760 molecule-protein interactions and 3236 protein targets, we conduct a large-scale assessment of the performance of several target-prediction algorithms at predicting drug-target activity. We assess algorithm performance using three validation procedures: standard tenfold cross-validation, tenfold cross-validation in a simulated screen that includes random inactive molecules, and validation on an external test set composed of molecules not present in our database.
We present two improvements over current practice. First, using a modified version of the influence-relevance voter (IRV), we show that using molecule potency data can improve target prediction. Second, we demonstrate that random inactive molecules added during training can boost the accuracy of several algorithms in realistic target-prediction experiments. Our potency-sensitive version of the IRV (PS-IRV) obtains the best results on large test sets in most of the experiments. Models and software are publicly accessible through the chemoinformatics portal at
Electronic supplementary material
The online version of this article (doi:10.1186/s13321-015-0110-6) contains supplementary material, which is available to authorized users.
PMCID: PMC4696267  PMID: 26719774
Target-prediction; Large-scale; Fingerprints; Molecular potency; Random inactive molecules; Influence-relevance voter
11.  A reliable computational workflow for the selection of optimal screening libraries 
The experimental screening of compound collections is a common starting point in many drug discovery projects. Successes of such screening campaigns critically depend on the quality of the screened library. Many libraries are currently available from different vendors yet the selection of the optimal screening library for a specific project is challenging. We have devised a novel workflow for the rational selection of project-specific screening libraries.
The workflow accepts as input a set of virtual candidate libraries and applies the following steps to each library: (1) data curation; (2) assessment of ADME/T profile; (3) assessment of the number of promiscuous binders/frequent HTS hitters; (4) assessment of internal diversity; (5) assessment of similarity to known active compound(s) (optional); (6) assessment of similarity to in-house or otherwise accessible compound collections (optional). For ADME/T profiling, Lipinski’s and Veber’s rule-based filters were implemented and a new blood brain barrier permeation model was developed and validated (85 and 74 % success rate for training set and test set, respectively). Diversity and similarity descriptors which demonstrated best performances in terms of their ability to select either diverse or focused sets of compounds from three databases (Drug Bank, CMC and CHEMBL) were identified and used for diversity and similarity assessments. The workflow was used to analyze nine common screening libraries available from six vendors. The results of this analysis are reported for each library providing an assessment of its quality. Furthermore, a consensus approach was developed to combine the results of these analyses into a single score for selecting the optimal library under different scenarios.
We have devised and tested a new workflow for the rational selection of screening libraries under different scenarios. The current workflow was implemented using the Pipeline Pilot software yet due to the usage of generic components, it can be easily adapted and reproduced by computational groups interested in rational selection of screening libraries. Furthermore, the workflow could be readily modified to include additional components. This workflow has been routinely used in our laboratory for the selection of libraries in multiple projects and consistently selects libraries which are well balanced across multiple parameters.Graphical abstract.
Electronic supplementary material
The online version of this article (doi:10.1186/s13321-015-0108-0) contains supplementary material, which is available to authorized users.
PMCID: PMC4676138  PMID: 26692904
Diversity; Fingerprints; QSAR; Screening libraries; Similarity; Library selection
12.  ChemDes: an integrated web-based platform for molecular descriptor and fingerprint computation 
Molecular descriptors and fingerprints have been routinely used in QSAR/SAR analysis, virtual drug screening, compound search/ranking, drug ADME/T prediction and other drug discovery processes. Since the calculation of such quantitative representations of molecules may require substantial computational skills and efforts, several tools have been previously developed to make an attempt to ease the process. However, there are still several hurdles for users to overcome to fully harness the power of these tools. First, most of the tools are distributed as standalone software or packages that require necessary configuration or programming efforts of users. Second, many of the tools can only calculate a subset of molecular descriptors, and the results from multiple tools need to be manually merged to generate a comprehensive set of descriptors. Third, some packages only provide application programming interfaces and are implemented in different computer languages, which pose additional challenges to the integration of these tools.
A freely available web-based platform, named ChemDes, is developed in this study. It integrates multiple state-of-the-art packages (i.e., Pybel, CDK, RDKit, BlueDesc, Chemopy, PaDEL and jCompoundMapper) for computing molecular descriptors and fingerprints. ChemDes not only provides friendly web interfaces to relieve users from burdensome programming work, but also offers three useful and convenient auxiliary tools for format converting, MOPAC optimization and fingerprint similarity calculation. Currently, ChemDes has the capability of computing 3679 molecular descriptors and 59 types of molecular fingerprints.
ChemDes provides users an integrated and friendly tool to calculate various molecular descriptors and fingerprints. It is freely available at The source code of the project is also available as a supplementary file.
An overview of ChemDes. A platform for computing various molecular descriptors and fingerprints
PMCID: PMC4674923  PMID: 26664458
Molecular descriptors; Molecular fingerprints; Online descriptor calculation; QSAR/QSPR; Molecular representation; Chemoinformatics
13.  High-quality and universal empirical atomic charges for chemoinformatics applications 
Partial atomic charges describe the distribution of electron density in a molecule and therefore provide clues to the chemical behaviour of molecules. Recently, these charges have become popular in chemoinformatics, as they are informative descriptors that can be utilised in pharmacophore design, virtual screening, similarity searches etc. Especially conformationally-dependent charges perform very successfully. In particular, their fast and accurate calculation via the Electronegativity Equalization Method (EEM) seems very promising for chemoinformatics applications. Unfortunately, published EEM parameter sets include only parameters for basic atom types and they often miss parameters for halogens, phosphorus, sulphur, triple bonded carbon etc. Therefore their applicability for drug-like molecules is limited.
We have prepared six EEM parameter sets which enable the user to calculate EEM charges in a quality comparable to quantum mechanics (QM) charges based on the most common charge calculation schemes (i.e., MPA, NPA and AIM) and a robust QM approach (HF/6-311G, B3LYP/6-311G). The calculated EEM parameters exhibited very good quality on a training set (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2 > 0.9$$\end{document}R2>0.9) and also on a test set (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2 > 0.93$$\end{document}R2>0.93). They are applicable for at least 95 % of molecules in key drug databases (DrugBank, ChEMBL, Pubchem and ZINC) compared to less than 60 % of the molecules from these databases for which currently used EEM parameters are applicable.
We developed EEM parameters enabling the fast calculation of high-quality partial atomic charges for almost all drug-like molecules. In parallel, we provide a software solution for their easy computation ( It enables the direct application of EEM in chemoinformatics.
Electronic supplementary material
The online version of this article (doi:10.1186/s13321-015-0107-1) contains supplementary material, which is available to authorized users.
PMCID: PMC4667495  PMID: 26633997
Partial atomic charges; Electronegativity Equalization Method; EEM; Quantum mechanics; QM; Drug-like molecules
14.  A note on utilising binary features as ligand descriptors 
It is common in cheminformatics to represent the properties of a ligand as a string of 1’s and 0’s, with the intention of elucidating, inter alia, the relationship between the chemical structure of a ligand and its bioactivity. In this commentary we note that, where relevant but non-redundant features are binary, they inevitably lead to a classifier capable of capturing only a linear relationship between structural features and activity. If, instead, we were to use relevant but non-redundant real-valued features, the resulting predictive model would be capable of describing a non-linear structure-activity relationship. Hence, we suggest that real-valued features, where available, are to be preferred in this scenario.
PMCID: PMC4665894  PMID: 26628925
Binary descriptors; Ligand chemical structure; Linear relationship; Bernoulli distribution
15.  Accuracy enhancement in the estimation of molecular hydration free energies by implementing the intramolecular hydrogen bond effects 
The formation of intramolecular hydrogen bonds (IHBs) may induce the remarkable changes in molecular physicochemical properties. Within the framework of the extended solvent-contact model, we investigate the effect of implementing the IHB interactions on the accuracy in estimating the molecular hydration free energies.
The performances of hydration free energy functions including and excluding the IHB parameters are compared using the molecules distributed for SAMPL4 blind prediction challenge and those in Free Solvation Database (FSD). The calculated hydration free energies with IHB effects are found to be in considerably better agreement with the experimental data than those without them. For example, the root mean square error of the estimation decreases from 2.56 to 1.66 and from 1.73 to 1.54 kcal/mol for SAMPL4 and FSD molecules, respectively, due to the extension of atomic parameter space to cope with IHBs.
These improvements are made possible by reducing the overestimation of attractive interactions between water and the solute molecules involving IHBs. The modified hydration free energy function is thus anticipated to be useful for estimating the desolvation cost for various organic molecules.
Electronic supplementary material
The online version of this article (doi:10.1186/s13321-015-0106-2) contains supplementary material, which is available to authorized users.
PMCID: PMC4660792  PMID: 26613005
Hydration free energy; Solvent-contact model; Genetic algorithm; Intramolecular hydrogen bond
16.  BCL::Conf: small molecule conformational sampling using a knowledge based rotamer library 
The interaction of a small molecule with a protein target depends on its ability to adopt a three-dimensional structure that is complementary. Therefore, complete and rapid prediction of the conformational space a small molecule can sample is critical for both structure- and ligand-based drug discovery algorithms such as small molecule docking or three-dimensional quantitative structure–activity relationships. Here we have derived a database of small molecule fragments frequently sampled in experimental structures within the Cambridge Structure Database and the Protein Data Bank. Likely conformations of these fragments are stored as ‘rotamers’ in analogy to amino acid side chain rotamer libraries used for rapid sampling of protein conformational space. Explicit fragments take into account correlations between multiple torsion bonds and effect of substituents on torsional profiles. A conformational ensemble for small molecules can then be generated by recombining fragment rotamers with a Monte Carlo search strategy. BCL::Conf was benchmarked against other conformer generator methods including Confgen, Moe, Omega and RDKit in its ability to recover experimentally determined protein bound conformations of small molecules, diversity of conformational ensembles, and sampling rate. BCL::Conf recovers at least one conformation with a root mean square deviation of 2 Å or better to the experimental structure for 99 % of the small molecules in the Vernalis benchmark dataset. The ‘rotamer’ approach will allow integration of BCL::Conf into respective computational biology programs such as Rosetta.Graphical abstract:Conformation sampling is carried out using explicit fragment conformations derived from crystallographic structure databases. Molecules from the database are decomposed into fragments and most likely conformations/rotamers are used to sample correspondng sub-structure of a molecule of interest.
Electronic supplementary material
The online version of this article (doi:10.1186/s13321-015-0095-1) contains supplementary material, which is available to authorized users.
PMCID: PMC4607025  PMID: 26473018
Conformation sampling; Knowledge-based; Fragment-based; Rotamer-library
17.  CHEMDNER: The drugs and chemical names extraction challenge 
Journal of Cheminformatics  2015;7(Suppl 1):S1.
Natural language processing (NLP) and text mining technologies for the chemical domain (ChemNLP or chemical text mining) are key to improve the access and integration of information from unstructured data such as patents or the scientific literature. Therefore, the BioCreative organizers posed the CHEMDNER (chemical compound and drug name recognition) community challenge, which promoted the development of novel, competitive and accessible chemical text mining systems. This task allowed a comparative assessment of the performance of various methodologies using a carefully prepared collection of manually labeled text prepared by specially trained chemists as Gold Standard data. We evaluated two important aspects: one covered the indexing of documents with chemicals (chemical document indexing - CDI task), and the other was concerned with finding the exact mentions of chemicals in text (chemical entity mention recognition - CEM task). 27 teams (23 academic and 4 commercial, a total of 87 researchers) returned results for the CHEMDNER tasks: 26 teams for CEM and 23 for the CDI task. Top scoring teams obtained an F-score of 87.39% for the CEM task and 88.20% for the CDI task, a very promising result when compared to the agreement between human annotators (91%). The strategies used to detect chemicals included machine learning methods (e.g. conditional random fields) using a variety of features, chemistry and drug lexica, and domain-specific rules. We expect that the tools and resources resulting from this effort will have an impact in future developments of chemical text mining applications and will form the basis to find related chemical information for the detected entities, such as toxicological or pharmacogenomic properties.
PMCID: PMC4331685  PMID: 25810766
named entity recognition; BioCreative; text mining; chemical entity recognition; machine learning; chemical indexing; ChemNLP
18.  Recognition of chemical entities: combining dictionary-based and grammar-based approaches 
Journal of Cheminformatics  2015;7(Suppl 1):S10.
The past decade has seen an upsurge in the number of publications in chemistry. The ever-swelling volume of available documents makes it increasingly hard to extract relevant new information from such unstructured texts. The BioCreative CHEMDNER challenge invites the development of systems for the automatic recognition of chemicals in text (CEM task) and for ranking the recognized compounds at the document level (CDI task). We investigated an ensemble approach where dictionary-based named entity recognition is used along with grammar-based recognizers to extract compounds from text. We assessed the performance of ten different commercial and publicly available lexical resources using an open source indexing system (Peregrine), in combination with three different chemical compound recognizers and a set of regular expressions to recognize chemical database identifiers. The effect of different stop-word lists, case-sensitivity matching, and use of chunking information was also investigated. We focused on lexical resources that provide chemical structure information. To rank the different compounds found in a text, we used a term confidence score based on the normalized ratio of the term frequencies in chemical and non-chemical journals.
The use of stop-word lists greatly improved the performance of the dictionary-based recognition, but there was no additional benefit from using chunking information. A combination of ChEBI and HMDB as lexical resources, the LeadMine tool for grammar-based recognition, and the regular expressions, outperformed any of the individual systems. On the test set, the F-scores were 77.8% (recall 71.2%, precision 85.8%) for the CEM task and 77.6% (recall 71.7%, precision 84.6%) for the CDI task. Missed terms were mainly due to tokenization issues, poor recognition of formulas, and term conjunctions.
We developed an ensemble system that combines dictionary-based and grammar-based approaches for chemical named entity recognition, outperforming any of the individual systems that we considered. The system is able to provide structure information for most of the compounds that are found. Improved tokenization and better recognition of specific entity types is likely to further improve system performance.
PMCID: PMC4331686  PMID: 25810767
CHEMDNER; BioCreative; Named entity recognition; Chemical dictionaries; Chemical structure; MOL files; Chemical databases; Chemical identifiers; Chemical compounds; Drugs
19.  A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature 
Journal of Cheminformatics  2015;7(Suppl 1):S11.
In order to improve information access on chemical compounds and drugs (chemical entities) described in text repositories, it is very crucial to be able to identify chemical entity mentions (CEMs) automatically within text. The CHEMDNER challenge in BioCreative IV was specially designed to promote the implementation of corresponding systems that are able to detect mentions of chemical compounds and drugs, which has two subtasks: CDI (Chemical Document Indexing) and CEM.
Our system processing pipeline consists of three major components: pre-processing (sentence detection, tokenization), recognition (CRF-based approach), and post-processing (rule-based approach and format conversion). In our post-challenge system, the cost parameter in CRF model was optimized by 10-fold cross validation with grid search, and word representations feature induced by Brown clustering method was introduced. For the CEM subtask, our official runs were ranked in top position by obtaining maximum 88.79% precision, 69.08% recall and 77.70% balanced F-measure, which were improved further to 88.43% precision, 76.48% recall and 82.02% balanced F-measure in our post-challenge system.
In our system, instead of extracting a CEM as a whole, we regarded it as a sequence labeling problem. Though our current system has much room for improvement, our system is valuable in showing that the performance in term of balanced F-measure can be improved largely by utilizing large amounts of relatively inexpensive un-annotated PubMed abstracts and optimizing the cost parameter in CRF model. From our practice and lessons, if one directly utilizes some open-source natural language processing (NLP) toolkits, such as OpenNLP, Standford CoreNLP, false positive (FP) rate may be very high. It is better to develop some additional rules to minimize the FP rate if one does not want to re-train the related models. Our CEM recognition system is available at:
PMCID: PMC4331687  PMID: 25810768
Chemical Compound and Drug Name Recognition; Conditional Random Fields; Brown Clustering; Natural Language Processing; Word Representations
20.  Chemical entity extraction using CRF and an ensemble of extractors 
Journal of Cheminformatics  2015;7(Suppl 1):S12.
As we are witnessing a great interest in identifying and extracting chemical entities in academic articles, many approaches have been proposed to solve this problem. In this work we describe a probabilistic framework that allows for the output of multiple information extraction systems to be combined in a systematic way. The identified entities are assigned a probability score that reflects the extractors' confidence, without the need for each individual extractor to generate a probability score. We quantitively compared the performance of multiple chemical tokenizers to measure the effect of tokenization on extraction accuracy. Later, a single Conditional Random Fields (CRF) extractor that utilizes the best performing tokenizer is built using a unique collection of features such as word embeddings and Soundex codes, which, to the best of our knowledge, has not been explored in this context before,
The ensemble of multiple extractors outperforms each extractor's individual performance during the CHEMDNER challenge. When the runs were optimized to favor recall, the ensemble approach achieved the second highest recall on unseen entities. As for the single CRF model with novel features, the extractor achieves an F1 score of 83.3% on the test set, without any post processing or abbreviation matching.
Ensemble information extraction is effective when multiple stand alone extractors are to be used, and produces higher performance than individual off the shelf extractors. The novel features introduced in the single CRF model are sufficient to achieve very competitive F1 score using a simple standalone extractor.
PMCID: PMC4331688  PMID: 25810769
Information Extraction; Chemical Entity Extraction; Ensemble Learning; Conditional Random Field
21.  Improving chemical entity recognition through h-index based semantic similarity 
Journal of Cheminformatics  2015;7(Suppl 1):S13.
Our approach to the BioCreative IV challenge of recognition and classification of drug names (CHEMDNER task) aimed at achieving high levels of precision by applying semantic similarity validation techniques to Chemical Entities of Biological Interest (ChEBI) mappings. Our assumption is that the chemical entities mentioned in the same fragment of text should share some semantic relation. This validation method was further improved by adapting the semantic similarity measure to take into account the h-index of each ancestor. We applied this method in two measures, simUI and simGIC, and validated the results obtained for the competition, comparing each adapted measure to its original version.
For the competition, we trained a Random Forest classifier that uses various scores provided by our system, including semantic similarity, which improved the F-measure obtained with the Conditional Random Fields classifiers by 4.6%. Using a notion of concept relevance based on the h-index measure, we were able to enhance our validation process so that for a fixed recall, we increased precision by excluding from the results a higher amount of false positives. We plotted precision and recall values for a range of validation thresholds using different similarity measures, obtaining higher precision values for the same recall with the measures based on the h-index.
The semantic similarity measure we introduced was more efficient at validating text mining results from machine learning classifiers than other measures. We improved the results we obtained for the CHEMDNER task by maintaining high precision values while improving the recall and F-measure.
PMCID: PMC4331689  PMID: 25810770
Ontologies; Semantic Similarity; Named Entity Recognition; ChEBI
22.  Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization 
Journal of Cheminformatics  2015;7(Suppl 1):S14.
The functions of chemical compounds and drugs that affect biological processes and their particular effect on the onset and treatment of diseases have attracted increasing interest with the advancement of research in the life sciences. To extract knowledge from the extensive literatures on such compounds and drugs, the organizers of BioCreative IV administered the CHEMical Compound and Drug Named Entity Recognition (CHEMDNER) task to establish a standard dataset for evaluating state-of-the-art chemical entity recognition methods.
This study introduces the approach of our CHEMDNER system. Instead of emphasizing the development of novel feature sets for machine learning, this study investigates the effect of various tag schemes on the recognition of the names of chemicals and drugs by using conditional random fields. Experiments were conducted using combinations of different tokenization strategies and tag schemes to investigate the effects of tag set selection and tokenization method on the CHEMDNER task.
This study presents the performance of CHEMDNER of three more representative tag schemes-IOBE, IOBES, and IOB12E-when applied to a widely utilized IOB tag set and combined with the coarse-/fine-grained tokenization methods. The experimental results thus reveal that the fine-grained tokenization strategy performance best in terms of precision, recall and F-scores when the IOBES tag set was utilized. The IOBES model with fine-grained tokenization yielded the best-F-scores in the six chemical entity categories other than the "Multiple" entity category. Nonetheless, no significant improvement was observed when a more representative tag schemes was used with the coarse or fine-grained tokenization rules. The best F-scores that were achieved using the developed system on the test dataset of the CHEMDNER task were 0.833 and 0.815 for the chemical documents indexing and the chemical entity mention recognition tasks, respectively.
The results herein highlight the importance of tag set selection and the use of different tokenization strategies. Fine-grained tokenization combined with the tag set IOBES most effectively recognizes chemical and drug names. To the best of the authors' knowledge, this investigation is the first comprehensive investigation use of various tag set schemes combined with different tokenization strategies for the recognition of chemical entities.
PMCID: PMC4331690  PMID: 25810771
23.  CheNER: a tool for the identification of chemical entities and their classes in biomedical literature 
Journal of Cheminformatics  2015;7(Suppl 1):S15.
Small chemical molecules regulate biological processes at the molecular level. Those molecules are often involved in causing or treating pathological states. Automatically identifying such molecules in biomedical text is difficult due to both, the diverse morphology of chemical names and the alternative types of nomenclature that are simultaneously used to describe them. To address these issues, the last BioCreAtIvE challenge proposed a CHEMDNER task, which is a Named Entity Recognition (NER) challenge that aims at labelling different types of chemical names in biomedical text.
To address this challenge we tested various approaches to recognizing chemical entities in biomedical documents. These approaches range from linear Conditional Random Fields (CRFs) to a combination of CRFs with regular expression and dictionary matching, followed by a post-processing step to tag those chemical names in a corpus of Medline abstracts. We named our best performing systems CheNER.
We evaluate the performance of the various approaches using the F-score statistics. Higher F-scores indicate better performance. The highest F-score we obtain in identifying unique chemical entities is 72.88%. The highest F-score we obtain in identifying all chemical entities is 73.07%. We also evaluate the F-Score of combining our system with ChemSpot, and find an increase from 72.88% to 73.83%.
CheNER presents a valid alternative for automated annotation of chemical entities in biomedical documents. In addition, CheNER may be used to derive new features to train newer methods for tagging chemical entities. CheNER can be downloaded from and included in text annotation pipelines.
PMCID: PMC4331691  PMID: 25810772
24.  The CHEMDNER corpus of chemicals and drugs and its annotation principles 
Journal of Cheminformatics  2015;7(Suppl 1):S2.
The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at:
PMCID: PMC4331692  PMID: 25810773
named entity recognition; BioCreative; text mining; chemical entity recognition; machine learning; chemical indexing; ChemNLP
25.  tmChem: a high performance approach for chemical named entity recognition and normalization 
Journal of Cheminformatics  2015;7(Suppl 1):S3.
Chemical compounds and drugs are an important class of entities in biomedical research with great potential in a wide range of applications, including clinical medicine. Locating chemical named entities in the literature is a useful step in chemical text mining pipelines for identifying the chemical mentions, their properties, and their relationships as discussed in the literature.
We introduce the tmChem system, a chemical named entity recognizer created by combining two independent machine learning models in an ensemble. We use the corpus released as part of the recent CHEMDNER task to develop and evaluate tmChem, achieving a micro-averaged f-measure of 0.8739 on the CEM subtask (mention-level evaluation) and 0.8745 f-measure on the CDI subtask (abstract-level evaluation). We also report a high-recall combination (0.9212 for CEM and 0.9224 for CDI). tmChem achieved the highest f-measure reported in the CHEMDNER task for the CEM subtask, and the high recall variant achieved the highest recall on both the CEM and CDI tasks.
We report that tmChem is a state-of-the-art tool for chemical named entity recognition and that performance for chemical named entity recognition has now tied (or exceeded) the performance previously reported for genes and diseases. Future research should focus on tighter integration between the named entity recognition and normalization steps for improved performance.
The source code and a trained model for both models of tmChem is available at: The results of running tmChem (Model 2) on PubMed are available in PubTator:
PMCID: PMC4331693  PMID: 25810774

Results 1-25 (544)