PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (728)
 

Clipboard (0)
None

Select a Filter Below

Journals
Year of Publication
1.  CheS-Mapper 2.0 for visual validation of (Q)SAR models 
Background
Sound statistical validation is important to evaluate and compare the overall performance of (Q)SAR models. However, classical validation does not support the user in better understanding the properties of the model or the underlying data. Even though, a number of visualization tools for analyzing (Q)SAR information in small molecule datasets exist, integrated visualization methods that allow the investigation of model validation results are still lacking.
Results
We propose visual validation, as an approach for the graphical inspection of (Q)SAR model validation results. The approach applies the 3D viewer CheS-Mapper, an open-source application for the exploration of small molecules in virtual 3D space. The present work describes the new functionalities in CheS-Mapper 2.0, that facilitate the analysis of (Q)SAR information and allows the visual validation of (Q)SAR models. The tool enables the comparison of model predictions to the actual activity in feature space. The approach is generic: It is model-independent and can handle physico-chemical and structural input features as well as quantitative and qualitative endpoints.
Conclusions
Visual validation with CheS-Mapper enables analyzing (Q)SAR information in the data and indicates how this information is employed by the (Q)SAR model. It reveals, if the endpoint is modeled too specific or too generic and highlights common properties of misclassified compounds. Moreover, the researcher can use CheS-Mapper to inspect how the (Q)SAR model predicts activity cliffs. The CheS-Mapper software is freely available at http://ches-mapper.org.
Graphical abstract
Comparing actual and predicted activity values with CheS-Mapper.
Electronic supplementary material
The online version of this article (doi:10.1186/s13321-014-0041-7) contains supplementary material, which is available to authorized users.
doi:10.1186/s13321-014-0041-7
PMCID: PMC4186979
Visualization; Validation; (Q)SAR; 3D space
2.  CheS-Mapper 2.0 for visual validation of (Q)SAR models 
Background
Sound statistical validation is important to evaluate and compare the overall performance of (Q)SAR models. However, classical validation does not support the user in better understanding the properties of the model or the underlying data. Even though, a number of visualization tools for analyzing (Q)SAR information in small molecule datasets exist, integrated visualization methods that allow the investigation of model validation results are still lacking.
Results
We propose visual validation, as an approach for the graphical inspection of (Q)SAR model validation results. The approach applies the 3D viewer CheS-Mapper, an open-source application for the exploration of small molecules in virtual 3D space. The present work describes the new functionalities in CheS-Mapper 2.0, that facilitate the analysis of (Q)SAR information and allows the visual validation of (Q)SAR models. The tool enables the comparison of model predictions to the actual activity in feature space. The approach is generic: It is model-independent and can handle physico-chemical and structural input features as well as quantitative and qualitative endpoints.
Conclusions
Visual validation with CheS-Mapper enables analyzing (Q)SAR information in the data and indicates how this information is employed by the (Q)SAR model. It reveals, if the endpoint is modeled too specific or too generic and highlights common properties of misclassified compounds. Moreover, the researcher can use CheS-Mapper to inspect how the (Q)SAR model predicts activity cliffs. The CheS-Mapper software is freely available at http://ches-mapper.org.
Graphical abstract
Comparing actual and predicted activity values with CheS-Mapper.
doi:10.1186/s13321-014-0041-7
PMCID: PMC4186979
Visualization; Validation; (Q)SAR; 3D space
3.  InCHlib – interactive cluster heatmap for web applications 
Background
Hierarchical clustering is an exploratory data analysis method that reveals the groups (clusters) of similar objects. The result of the hierarchical clustering is a tree structure called dendrogram that shows the arrangement of individual clusters. To investigate the row/column hierarchical cluster structure of a data matrix, a visualization tool called ‘cluster heatmap’ is commonly employed. In the cluster heatmap, the data matrix is displayed as a heatmap, a 2-dimensional array in which the colour of each element corresponds to its value. The rows/columns of the matrix are ordered such that similar rows/columns are near each other. The ordering is given by the dendrogram which is displayed on the side of the heatmap.
Results
We developed InCHlib (Interactive Cluster Heatmap Library), a highly interactive and lightweight JavaScript library for cluster heatmap visualization and exploration. InCHlib enables the user to select individual or clustered heatmap rows, to zoom in and out of clusters or to flexibly modify heatmap appearance. The cluster heatmap can be augmented with additional metadata displayed in a different colour scale. In addition, to further enhance the visualization, the cluster heatmap can be interconnected with external data sources or analysis tools. Data clustering and the preparation of the input file for InCHlib is facilitated by the Python utility script inchlib_clust.
Conclusions
The cluster heatmap is one of the most popular visualizations of large chemical and biomedical data sets originating, e.g., in high-throughput screening, genomics or transcriptomics experiments. The presented JavaScript library InCHlib is a client-side solution for cluster heatmap exploration. InCHlib can be easily deployed into any modern web application and configured to cooperate with external tools and data sources. Though InCHlib is primarily intended for the analysis of chemical or biological data, it is a versatile tool which application domain is not limited to the life sciences only.
Electronic supplementary material
The online version of this article (doi:10.1186/s13321-014-0044-4) contains supplementary material, which is available to authorized users.
doi:10.1186/s13321-014-0044-4
PMCID: PMC4173117  PMID: 25264459
Data clustering; Cluster heatmap; Scientific visualization; Web integration; Client-side scripting; JavaScript library; Big data; Exploration
4.  Quantitative estimation of pesticide-likeness for agrochemical discovery 
Background
The design of chemical libraries, an early step in agrochemical discovery programs, is frequently addressed by means of qualitative physicochemical and/or topological rule-based methods. The aim of this study is to develop quantitative estimates of herbicide- (QEH), insecticide- (QEI), fungicide- (QEF), and, finally, pesticide-likeness (QEP).
In the assessment of these definitions, we relied on the concept of desirability functions.
Results
We found a simple function, shared by the three classes of pesticides, parameterized particularly, for six, easy to compute, independent and interpretable, molecular properties: molecular weight, logP, number of hydrogen bond acceptors, number of hydrogen bond donors, number of rotatable bounds and number of aromatic rings. Subsequently, we describe the scoring of each pesticide class by the corresponding quantitative estimate. In a comparative study, we assessed the performance of the scoring functions using extensive datasets of patented pesticides.
Conclusions
The hereby-established quantitative assessment has the ability to rank compounds whether they fail well-established pesticide-likeness rules or not, and offer an efficient way to prioritize (class-specific) pesticides. These findings are valuable for the efficient estimation of pesticide-likeness of vast chemical libraries in the field of agrochemical discovery.
Graphical AbstractQuantitative models for pesticide-likeness were derived using the concept of desirability functions parameterized for six, easy to compute, independent and interpretable, molecular properties: molecular weight, logP, number of hydrogen bond acceptors, number of hydrogen bond donors, number of rotatable bounds and number of aromatic rings.
Electronic supplementary material
The online version of this article (doi:10.1186/s13321-014-0042-6) contains supplementary material, which is available to authorized users.
doi:10.1186/s13321-014-0042-6
PMCID: PMC4173135  PMID: 25264458
Herbicide; Insecticide; Fungicide; Pesticide; Agrochemicals; SAR databases
5.  UniChem: extension of InChI-based compound mapping to salt, connectivity and stereochemistry layers 
UniChem is a low-maintenance, fast and freely available compound identifier mapping service, recently made available on the Internet. Until now, the criterion of molecular equivalence within UniChem has been on the basis of complete identity between Standard InChIs. However, a limitation of this approach is that stereoisomers, isotopes and salts of otherwise identical molecules are not considered as related. Here, we describe how we have exploited the layered structural representation of the Standard InChI to create new functionality within UniChem that integrates these related molecular forms. The service, called ‘Connectivity Search’ allows molecules to be first matched on the basis of complete identity between the connectivity layer of their corresponding Standard InChIs, and the remaining layers then compared to highlight stereochemical and isotopic differences. Parsing of Standard InChI sub-layers permits mixtures and salts to also be included in this integration process. Implementation of these enhancements required simple modifications to the schema, loader and web application, but none of which have changed the original UniChem functionality or services. The scope of queries may be varied using a variety of easily configurable options, and the output is annotated to assist the user to filter, sort and understand the difference between query and retrieved structures. A RESTful web service output may be easily processed programmatically to allow developers to present the data in whatever form they believe their users will require, or to define their own level of molecular equivalence for their resource, albeit within the constraint of identical connectivity.
doi:10.1186/s13321-014-0043-5
PMCID: PMC4158273  PMID: 25221628
UniChem; Standard InChI; InChIKey; Chemical databases; Data integration; Connectivity search
6.  A document classifier for medicinal chemistry publications trained on the ChEMBL corpus 
Background
The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are ‘ChEMBL-like’ (i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining.
Results
The method has been implemented as a data protocol/workflow for both Pipeline Pilot (version 8.5) and KNIME (version 2.9) respectively. Both workflows and models are freely available at: ftp://ftp.ebi.ac.uk/pub/databases/chembl/text-mining. These can be readily modified to include additional keyword constraints to further focus searches.
Conclusions
Large-scale machine learning document classification was shown to be very robust and flexible for this particular application, as illustrated in four distinct text-mining-based use cases. The models are readily available on two data workflow platforms, which we believe will allow the majority of the scientific community to apply them to their own data.
Graphical AbstractMultidimensional scaling analysis applied to document vectors derived from titles and abstracts in different corpora. Notably, there is large overlap between the documents in the different ChEMBL versions and BindingDB, while the background MEDLINE set is largely divergent.
Electronic supplementary material
The online version of this article (doi:10.1186/s13321-014-0040-8) contains supplementary material, which is available to authorized users.
doi:10.1186/s13321-014-0040-8
PMCID: PMC4158272  PMID: 25221627
Machine learning; Triage; Curation; Document classification
7.  New target prediction and visualization tools incorporating open source molecular fingerprints for TB Mobile 2.0 
Background
We recently developed a freely available mobile app (TB Mobile) for both iOS and Android platforms that displays Mycobacterium tuberculosis (Mtb) active molecule structures and their targets with links to associated data. The app was developed to make target information available to as large an audience as possible.
Results
We now report a major update of the iOS version of the app. This includes enhancements that use an implementation of ECFP_6 fingerprints that we have made open source. Using these fingerprints, the user can propose compounds with possible anti-TB activity, and view the compounds within a cluster landscape. Proposed compounds can also be compared to existing target data, using a näive Bayesian scoring system to rank probable targets. We have curated an additional 60 new compounds and their targets for Mtb and added these to the original set of 745 compounds. We have also curated 20 further compounds (many without targets in TB Mobile) to evaluate this version of the app with 805 compounds and associated targets.
Conclusions
TB Mobile can now manage a small collection of compounds that can be imported from external sources, or exported by various means such as email or app-to-app inter-process communication. This means that TB Mobile can be used as a node within a growing ecosystem of mobile apps for cheminformatics. It can also cluster compounds and use internal algorithms to help identify potential targets based on molecular similarity. TB Mobile represents a valuable dataset, data-visualization aid and target prediction tool.
doi:10.1186/s13321-014-0038-2
PMCID: PMC4190048  PMID: 25302078
Mobile app; Mycobacterium tuberculosis; TB mobile; Tuberculosis; Target prediction
8.  Bringing the MMFF force field to the RDKit: implementation and validation 
A general purpose force field such as MMFF94/MMFF94s, which can properly deal with a wide range of diverse structures, is very valuable in the context of a cheminformatics toolkit. Herein we present an open-source implementation of this force field within the RDKit. The new MMFF functionality can be accessed through a C++/C#/Python/Java application programming interface (API) developed along the lines of the one already available for UFF in the RDKit. Our implementation was fully validated against the official validation suite provided by the MMFF authors. All energies and gradients were correctly computed; moreover, atom type and force constants were correctly assigned for 3D molecules built from SMILES strings. To provide full flexibility, the available API provides direct access to include/exclude individual terms from the MMFF energy expression and to carry out constrained geometry optimizations. The availability of a MMFF-capable molecular mechanics engine coupled with the rest of the RDKit functionality and covered by the BSD license is appealing to researchers operating in both academia and industry.
doi:10.1186/s13321-014-0037-3
PMCID: PMC4116604
Molecular mechanics; Force field; MMFF; RDKit
9.  Proteochemometric modeling in a Bayesian framework 
Proteochemometrics (PCM) is an approach for bioactivity predictive modeling which models the relationship between protein and chemical information. Gaussian Processes (GP), based on Bayesian inference, provide the most objective estimation of the uncertainty of the predictions, thus permitting the evaluation of the applicability domain (AD) of the model. Furthermore, the experimental error on bioactivity measurements can be used as input for this probabilistic model.
In this study, we apply GP implemented with a panel of kernels on three various (and multispecies) PCM datasets. The first dataset consisted of information from 8 human and rat adenosine receptors with 10,999 small molecule ligands and their binding affinity. The second consisted of the catalytic activity of four dengue virus NS3 proteases on 56 small peptides. Finally, we have gathered bioactivity information of small molecule ligands on 91 aminergic GPCRs from 9 different species, leading to a dataset of 24,593 datapoints with a matrix completeness of only 2.43%.
GP models trained on these datasets are statistically sound, at the same level of statistical significance as Support Vector Machines (SVM), with R02 values on the external dataset ranging from 0.68 to 0.92, and RMSEP values close to the experimental error. Furthermore, the best GP models obtained with the normalized polynomial and radial kernels provide intervals of confidence for the predictions in agreement with the cumulative Gaussian distribution. GP models were also interpreted on the basis of individual targets and of ligand descriptors. In the dengue dataset, the model interpretation in terms of the amino-acid positions in the tetra-peptide ligands gave biologically meaningful results.
doi:10.1186/1758-2946-6-35
PMCID: PMC4083135  PMID: 25045403
Proteochemometrics; Bayesian inference; Gaussian process; Chemogenomics; GPCRs; Adenosine receptors; Applicability domain
10.  MORT: a powerful foundational library for computational biology and CADD 
Background
A foundational library called MORT (Molecular Objects and Relevant Templates) for the development of new software packages and tools employed in computational biology and computer-aided drug design (CADD) is described here.
Results
MORT contains several advantages compared with the other libraries. Firstly, MORT written in C++ natively supports the paradigm of object-oriented design, and thus it can be understood and extended easily. Secondly, MORT employs the relational model to represent a molecule, and it is more convenient and flexible than the traditional hierarchical model employed by many other libraries. Thirdly, a lot of functions have been included in this library, and a molecule can be manipulated easily at different levels. For example, it can parse a variety of popular molecular formats (MOL/SDF, MOL2, PDB/ENT, SMILES/SMARTS, etc.), create the topology and coordinate files for the simulations supported by AMBER, calculate the energy of a specific molecule based on the AMBER force fields, etc.
Conclusions
We believe that MORT can be used as a foundational library for programmers to develop new programs and applications for computational biology and CADD. Source code of MORT is available at http://cadd.suda.edu.cn/MORT/index.htm.
doi:10.1186/1758-2946-6-36
PMCID: PMC4085231
Relational model; MORT; AMBER; Antechamber; Foundational library; CADD
11.  Using beta binomials to estimate classification uncertainty for ensemble models 
Background
Quantitative structure-activity (QSAR) models have enormous potential for reducing drug discovery and development costs as well as the need for animal testing. Great strides have been made in estimating their overall reliability, but to fully realize that potential, researchers and regulators need to know how confident they can be in individual predictions.
Results
Submodels in an ensemble model which have been trained on different subsets of a shared training pool represent multiple samples of the model space, and the degree of agreement among them contains information on the reliability of ensemble predictions. For artificial neural network ensembles (ANNEs) using two different methods for determining ensemble classification – one using vote tallies and the other averaging individual network outputs – we have found that the distribution of predictions across positive vote tallies can be reasonably well-modeled as a beta binomial distribution, as can the distribution of errors. Together, these two distributions can be used to estimate the probability that a given predictive classification will be in error. Large data sets comprised of logP, Ames mutagenicity, and CYP2D6 inhibition data are used to illustrate and validate the method. The distributions of predictions and errors for the training pool accurately predicted the distribution of predictions and errors for large external validation sets, even when the number of positive and negative examples in the training pool were not balanced. Moreover, the likelihood of a given compound being prospectively misclassified as a function of the degree of consensus between networks in the ensemble could in most cases be estimated accurately from the fitted beta binomial distributions for the training pool.
Conclusions
Confidence in an individual predictive classification by an ensemble model can be accurately assessed by examining the distributions of predictions and errors as a function of the degree of agreement among the constituent submodels. Further, ensemble uncertainty estimation can often be improved by adjusting the voting or classification threshold based on the parameters of the error distribution. Finally, the profiles for models whose predictive uncertainty estimates are not reliable provide clues to that effect without the need for comparison to an external test set.
doi:10.1186/1758-2946-6-34
PMCID: PMC4076254  PMID: 24987464
Artificial neural network ensemble; ANNE; Classification; Confidence; Error estimation; Predictive value; QSAR; Uncertainty
12.  In Silico target fishing: addressing a “Big Data” problem by ligand-based similarity rankings with data fusion 
Background
Ligand-based in silico target fishing can be used to identify the potential interacting target of bioactive ligands, which is useful for understanding the polypharmacology and safety profile of existing drugs. The underlying principle of the approach is that known bioactive ligands can be used as reference to predict the targets for a new compound.
Results
We tested a pipeline enabling large-scale target fishing and drug repositioning, based on simple fingerprint similarity rankings with data fusion. A large library containing 533 drug relevant targets with 179,807 active ligands was compiled, where each target was defined by its ligand set. For a given query molecule, its target profile is generated by similarity searching against the ligand sets assigned to each target, for which individual searches utilizing multiple reference structures are then fused into a single ranking list representing the potential target interaction profile of the query compound. The proposed approach was validated by 10-fold cross validation and two external tests using data from DrugBank and Therapeutic Target Database (TTD). The use of the approach was further demonstrated with some examples concerning the drug repositioning and drug side-effects prediction. The promising results suggest that the proposed method is useful for not only finding promiscuous drugs for their new usages, but also predicting some important toxic liabilities.
Conclusions
With the rapid increasing volume and diversity of data concerning drug related targets and their ligands, the simple ligand-based target fishing approach would play an important role in assisting future drug design and discovery.
doi:10.1186/1758-2946-6-33
PMCID: PMC4068908  PMID: 24976868
Target fishing; Big data; Molecular fingerprints; Data fusion; Similarity searching
13.  The influence of negative training set size on machine learning-based virtual screening 
Background
The paper presents a thorough analysis of the influence of the number of negative training examples on the performance of machine learning methods.
Results
The impact of this rather neglected aspect of machine learning methods application was examined for sets containing a fixed number of positive and a varying number of negative examples randomly selected from the ZINC database. An increase in the ratio of positive to negative training instances was found to greatly influence most of the investigated evaluating parameters of ML methods in simulated virtual screening experiments. In a majority of cases, substantial increases in precision and MCC were observed in conjunction with some decreases in hit recall. The analysis of dynamics of those variations let us recommend an optimal composition of training data. The study was performed on several protein targets, 5 machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest) and 2 types of molecular fingerprints (MACCS and CDK FP). The most effective classification was provided by the combination of CDK FP with SMO or Random Forest algorithms. The Naïve Bayes models appeared to be hardly sensitive to changes in the number of negative instances in the training set.
Conclusions
In conclusion, the ratio of positive to negative training instances should be taken into account during the preparation of machine learning experiments, as it might significantly influence the performance of particular classifier. What is more, the optimization of negative training set size can be applied as a boosting-like approach in machine learning-based virtual screening.
doi:10.1186/1758-2946-6-32
PMCID: PMC4061540  PMID: 24976867
14.  Efficient enumeration of monocyclic chemical graphs with given path frequencies 
Background
The enumeration of chemical graphs (molecular graphs) satisfying given constraints is one of the fundamental problems in chemoinformatics and bioinformatics because it leads to a variety of useful applications including structure determination and development of novel chemical compounds.
Results
We consider the problem of enumerating chemical graphs with monocyclic structure (a graph structure that contains exactly one cycle) from a given set of feature vectors, where a feature vector represents the frequency of the prescribed paths in a chemical compound to be constructed and the set is specified by a pair of upper and lower feature vectors. To enumerate all tree-like (acyclic) chemical graphs from a given set of feature vectors, Shimizu et al. and Suzuki et al. proposed efficient branch-and-bound algorithms based on a fast tree enumeration algorithm. In this study, we devise a novel method for extending these algorithms to enumeration of chemical graphs with monocyclic structure by designing a fast algorithm for testing uniqueness. The results of computational experiments reveal that the computational efficiency of the new algorithm is as good as those for enumeration of tree-like chemical compounds.
Conclusions
We succeed in expanding the class of chemical graphs that are able to be enumerated efficiently.
doi:10.1186/1758-2946-6-31
PMCID: PMC4049473  PMID: 24955135
Chemical graphs; Enumeration; Monocyclic structure; Feature vector
15.  Estimation of diffusion coefficients from voltammetric signals by support vector and gaussian process regression 
Background
Support vector regression (SVR) and Gaussian process regression (GPR) were used for the analysis of electroanalytical experimental data to estimate diffusion coefficients.
Results
For simulated cyclic voltammograms based on the EC, Eqr, and EqrC mechanisms these regression algorithms in combination with nonlinear kernel/covariance functions yielded diffusion coefficients with higher accuracy as compared to the standard approach of calculating diffusion coefficients relying on the Nicholson-Shain equation. The level of accuracy achieved by SVR and GPR is virtually independent of the rate constants governing the respective reaction steps. Further, the reduction of high-dimensional voltammetric signals by manual selection of typical voltammetric peak features decreased the performance of both regression algorithms compared to a reduction by downsampling or principal component analysis. After training on simulated data sets, diffusion coefficients were estimated by the regression algorithms for experimental data comprising voltammetric signals for three organometallic complexes.
Conclusions
Estimated diffusion coefficients closely matched the values determined by the parameter fitting method, but reduced the required computational time considerably for one of the reaction mechanisms. The automated processing of voltammograms according to the regression algorithms yields better results than the conventional analysis of peak-related data.
doi:10.1186/1758-2946-6-30
PMCID: PMC4074154  PMID: 24987463
Support vector regression; Gaussian process regression; Diffusion coefficient; Principal component analysis; Voltammetry; Reaction mechanism
16.  Cytochrome P450 site of metabolism prediction from 2D topological fingerprints using GPU accelerated probabilistic classifiers 
Background
The prediction of sites and products of metabolism in xenobiotic compounds is key to the development of new chemical entities, where screening potential metabolites for toxicity or unwanted side-effects is of crucial importance. In this work 2D topological fingerprints are used to encode atomic sites and three probabilistic machine learning methods are applied: Parzen-Rosenblatt Window (PRW), Naive Bayesian (NB) and a novel approach called RASCAL (Random Attribute Subsampling Classification ALgorithm). These are implemented by randomly subsampling descriptor space to alleviate the problem often suffered by data mining methods of having to exactly match fingerprints, and in the case of PRW by measuring a distance between feature vectors rather than exact matching. The classifiers have been implemented in CUDA/C++ to exploit the parallel architecture of graphical processing units (GPUs) and is freely available in a public repository.
Results
It is shown that for PRW a SoM (Site of Metabolism) is identified in the top two predictions for 85%, 91% and 88% of the CYP 3A4, 2D6 and 2C9 data sets respectively, with RASCAL giving similar performance of 83%, 91% and 88%, respectively. These results put PRW and RASCAL performance ahead of NB which gave a much lower classification performance of 51%, 73% and 74%, respectively.
Conclusions
2D topological fingerprints calculated to a bond depth of 4-6 contain sufficient information to allow the identification of SoMs using classifiers based on relatively small data sets. Thus, the machine learning methods outlined in this paper are conceptually simpler and more efficient than other methods tested and the use of simple topological descriptors derived from 2D structure give results competitive with other approaches using more expensive quantum chemical descriptors. The descriptor space subsampling approach and ensemble methodology allow the methods to be applied to molecules more distant from the training data where data mining would be more likely to fail due to the lack of common fingerprints. The RASCAL algorithm is shown to give equivalent classification performance to PRW but at lower computational expense allowing it to be applied more efficiently in the ensemble scheme.
doi:10.1186/1758-2946-6-29
PMCID: PMC4047555  PMID: 24959208
Cytochrome P450; Metabolism; Probabilistic; Classification; GPU; CUDA; 2D
17.  iDrug: a web-accessible and interactive drug discovery and design platform 
Background
The progress in computer-aided drug design (CADD) approaches over the past decades accelerated the early-stage pharmaceutical research. Many powerful standalone tools for CADD have been developed in academia. As programs are developed by various research groups, a consistent user-friendly online graphical working environment, combining computational techniques such as pharmacophore mapping, similarity calculation, scoring, and target identification is needed.
Results
We presented a versatile, user-friendly, and efficient online tool for computer-aided drug design based on pharmacophore and 3D molecular similarity searching. The web interface enables binding sites detection, virtual screening hits identification, and drug targets prediction in an interactive manner through a seamless interface to all adapted packages (e.g., Cavity, PocketV.2, PharmMapper, SHAFTS). Several commercially available compound databases for hit identification and a well-annotated pharmacophore database for drug targets prediction were integrated in iDrug as well. The web interface provides tools for real-time molecular building/editing, converting, displaying, and analyzing. All the customized configurations of the functional modules can be accessed through featured session files provided, which can be saved to the local disk and uploaded to resume or update the history work.
Conclusions
iDrug is easy to use, and provides a novel, fast and reliable tool for conducting drug design experiments. By using iDrug, various molecular design processing tasks can be submitted and visualized simply in one browser without installing locally any standalone modeling softwares. iDrug is accessible free of charge at http://lilab.ecust.edu.cn/idrug.
doi:10.1186/1758-2946-6-28
PMCID: PMC4046018  PMID: 24955134
Online drug design platform; Cavity detection; Pharmacophore search; 3D similarity calculation; Target prediction
18.  Expanding the fragrance chemical space for virtual screening 
The properties of fragrance molecules in the public databases SuperScent and Flavornet were analyzed to define a “fragrance-like” (FL) property range (Heavy Atom Count ≤ 21, only C, H, O, S, (O + S) ≤ 3, Hydrogen Bond Donor ≤ 1) and the corresponding chemical space including FL molecules from PubChem (NIH repository of molecules), ChEMBL (bioactive molecules), ZINC (drug-like molecules), and GDB-13 (all possible organic molecules up to 13 atoms of C, N, O, S, Cl). The FL subsets of these databases were classified by MQN (Molecular Quantum Numbers, a set of 42 integer value descriptors of molecular structure) and formatted for fast MQN-similarity searching and interactive exploration of color-coded principal component maps in form of the FL-mapplet and FL-browser applications freely available at http://www.gdb.unibe.ch. MQN-similarity is shown to efficiently recover 15 different fragrance molecule families from the different FL subsets, demonstrating the relevance of the MQN-based tool to explore the fragrance chemical space.
doi:10.1186/1758-2946-6-27
PMCID: PMC4037718  PMID: 24876890
19.  Expanding the fragrance chemical space for virtual screening 
The properties of fragrance molecules in the public databases SuperScent and Flavornet were analyzed to define a “fragrance-like” (FL) property range (Heavy Atom Count ≤ 21, only C, H, O, S, (O + S) ≤ 3, Hydrogen Bond Donor ≤ 1) and the corresponding chemical space including FL molecules from PubChem (NIH repository of molecules), ChEMBL (bioactive molecules), ZINC (drug-like molecules), and GDB-13 (all possible organic molecules up to 13 atoms of C, N, O, S, Cl). The FL subsets of these databases were classified by MQN (Molecular Quantum Numbers, a set of 42 integer value descriptors of molecular structure) and formatted for fast MQN-similarity searching and interactive exploration of color-coded principal component maps in form of the FL-mapplet and FL-browser applications freely available at http://www.gdb.unibe.ch. MQN-similarity is shown to efficiently recover 15 different fragrance molecule families from the different FL subsets, demonstrating the relevance of the MQN-based tool to explore the fragrance chemical space.
Electronic supplementary material
The online version of this article (doi:10.1186/1758-2946-6-27) contains supplementary material, which is available to authorized users.
doi:10.1186/1758-2946-6-27
PMCID: PMC4037718  PMID: 24876890
20.  Estimation of acute oral toxicity in rat using local lazy learning 
Background
Acute toxicity means the ability of a substance to cause adverse effects within a short period following dosing or exposure, which is usually the first step in the toxicological investigations of unknown substances. The median lethal dose, LD50, is frequently used as a general indicator of a substance’s acute toxicity, and there is a high demand on developing non-animal-based prediction of LD50. Unfortunately, it is difficult to accurately predict compound LD50 using a single QSAR model, because the acute toxicity may involve complex mechanisms and multiple biochemical processes.
Results
In this study, we reported the use of local lazy learning (LLL) methods, which could capture subtle local structure-toxicity relationships around each query compound, to develop LD50 prediction models: (a) local lazy regression (LLR): a linear regression model built using k neighbors; (b) SA: the arithmetical mean of the activities of k nearest neighbors; (c) SR: the weighted mean of the activities of k nearest neighbors; (d) GP: the projection point of the compound on the line defined by its two nearest neighbors. We defined the applicability domain (AD) to decide to what an extent and under what circumstances the prediction is reliable. In the end, we developed a consensus model based on the predicted values of individual LLL models, yielding correlation coefficients R2 of 0.712 on a test set containing 2,896 compounds.
Conclusion
Encouraged by the promising results, we expect that our consensus LLL model of LD50 would become a useful tool for predicting acute toxicity. All models developed in this study are available via http://www.dddc.ac.cn/admetus.
doi:10.1186/1758-2946-6-26
PMCID: PMC4047767  PMID: 24959207
Acute toxicity; Local lazy learning; Applicability domain; Consensus model
21.  QSAR DataBank - an approach for the digital organization and archiving of QSAR model information 
Background
Research efforts in the field of descriptive and predictive Quantitative Structure-Activity Relationships or Quantitative Structure–Property Relationships produce around one thousand scientific publications annually. All the materials and results are mainly communicated using printed media. The printed media in its present form have obvious limitations when they come to effectively representing mathematical models, including complex and non-linear, and large bodies of associated numerical chemical data. It is not supportive of secondary information extraction or reuse efforts while in silico studies poses additional requirements for accessibility, transparency and reproducibility of the research. This gap can and should be bridged by introducing domain-specific digital data exchange standards and tools. The current publication presents a formal specification of the quantitative structure-activity relationship data organization and archival format called the QSAR DataBank (QsarDB for shorter, or QDB for shortest).
Results
The article describes QsarDB data schema, which formalizes QSAR concepts (objects and relationships between them) and QsarDB data format, which formalizes their presentation for computer systems. The utility and benefits of QsarDB have been thoroughly tested by solving everyday QSAR and predictive modeling problems, with examples in the field of predictive toxicology, and can be applied for a wide variety of other endpoints. The work is accompanied with open source reference implementation and tools.
Conclusions
The proposed open data, open source, and open standards design is open to public and proprietary extensions on many levels. Selected use cases exemplify the benefits of the proposed QsarDB data format. General ideas for future development are discussed.
doi:10.1186/1758-2946-6-25
PMCID: PMC4047268  PMID: 24910716
Data format; Data interoperability; Open science; QSAR; QSPR
22.  The BioDICE Taverna plugin for clustering and visualization of biological data: a workflow for molecular compounds exploration 
Background
In many experimental pipelines, clustering of multidimensional biological datasets is used to detect hidden structures in unlabelled input data. Taverna is a popular workflow management system that is used to design and execute scientific workflows and aid in silico experimentation. The availability of fast unsupervised methods for clustering and visualization in the Taverna platform is important to support a data-driven scientific discovery in complex and explorative bioinformatics applications.
Results
This work presents a Taverna plugin, the Biological Data Interactive Clustering Explorer (BioDICE), that performs clustering of high-dimensional biological data and provides a nonlinear, topology preserving projection for the visualization of the input data and their similarities. The core algorithm in the BioDICE plugin is Fast Learning Self Organizing Map (FLSOM), which is an improved variant of the Self Organizing Map (SOM) algorithm. The plugin generates an interactive 2D map that allows the visual exploration of multidimensional data and the identification of groups of similar objects. The effectiveness of the plugin is demonstrated on a case study related to chemical compounds.
Conclusions
The number and variety of available tools and its extensibility have made Taverna a popular choice for the development of scientific data workflows. This work presents a novel plugin, BioDICE, which adds a data-driven knowledge discovery component to Taverna. BioDICE provides an effective and powerful clustering tool, which can be adopted for the explorative analysis of biological datasets.
doi:10.1186/1758-2946-6-24
PMCID: PMC4036106
Molecular compounds; Self organizing map; Clustering; Visualization; Taverna
23.  The BioDICE Taverna plugin for clustering and visualization of biological data: a workflow for molecular compounds exploration 
Background
In many experimental pipelines, clustering of multidimensional biological datasets is used to detect hidden structures in unlabelled input data. Taverna is a popular workflow management system that is used to design and execute scientific workflows and aid in silico experimentation. The availability of fast unsupervised methods for clustering and visualization in the Taverna platform is important to support a data-driven scientific discovery in complex and explorative bioinformatics applications.
Results
This work presents a Taverna plugin, the Biological Data Interactive Clustering Explorer (BioDICE), that performs clustering of high-dimensional biological data and provides a nonlinear, topology preserving projection for the visualization of the input data and their similarities. The core algorithm in the BioDICE plugin is Fast Learning Self Organizing Map (FLSOM), which is an improved variant of the Self Organizing Map (SOM) algorithm. The plugin generates an interactive 2D map that allows the visual exploration of multidimensional data and the identification of groups of similar objects. The effectiveness of the plugin is demonstrated on a case study related to chemical compounds.
Conclusions
The number and variety of available tools and its extensibility have made Taverna a popular choice for the development of scientific data workflows. This work presents a novel plugin, BioDICE, which adds a data-driven knowledge discovery component to Taverna. BioDICE provides an effective and powerful clustering tool, which can be adopted for the explorative analysis of biological datasets.
doi:10.1186/1758-2946-6-24
PMCID: PMC4036106
Molecular compounds; Self organizing map; Clustering; Visualization; Taverna
24.  Using cheminformatics to predict cross reactivity of “designer drugs” to their currently available immunoassays 
Background
A challenge for drug of abuse testing is presented by ‘designer drugs’, compounds typically discovered by modifications of existing clinical drug classes such as amphetamines and cannabinoids. Drug of abuse screening immunoassays directed at amphetamine or methamphetamine only detect a small subset of designer amphetamine-like drugs, and those immunoassays designed for tetrahydrocannabinol metabolites generally do not cross-react with synthetic cannabinoids lacking the classic cannabinoid chemical backbone. This suggests complexity in understanding how to detect and identify whether a patient has taken a molecule of one class or another, impacting clinical care.
Methods
Cross-reactivity data from immunoassays specifically targeting designer amphetamine-like and synthetic cannabinoid drugs was collected from multiple published sources, and virtual chemical libraries for molecular similarity analysis were built. The virtual library for synthetic cannabinoid analysis contained a total of 169 structures, while the virtual library for amphetamine-type stimulants contained 288 compounds. Two-dimensional (2D) similarity for each test compound was compared to the target molecule of the immunoassay undergoing analysis.
Results
2D similarity differentiated between cross-reactive and non-cross-reactive compounds for immunoassays targeting mephedrone/methcathinone, 3,4-methylenedioxypyrovalerone, benzylpiperazine, mephentermine, and synthetic cannabinoids.
Conclusions
In this study, we applied 2D molecular similarity analysis to the designer amphetamine-type stimulants and synthetic cannabinoids. Similarity calculations can be used to more efficiently decide which drugs and metabolites should be tested in cross-reactivity studies, as well as to design experiments and potentially predict antigens that would lead to immunoassays with cross reactivity for a broader array of designer drugs.
doi:10.1186/1758-2946-6-22
PMCID: PMC4029917  PMID: 24851137
Amphetamines; Cannabinoids; Molecular models; Similarity; Toxicology
25.  Using cheminformatics to predict cross reactivity of “designer drugs” to their currently available immunoassays 
Background
A challenge for drug of abuse testing is presented by ‘designer drugs’, compounds typically discovered by modifications of existing clinical drug classes such as amphetamines and cannabinoids. Drug of abuse screening immunoassays directed at amphetamine or methamphetamine only detect a small subset of designer amphetamine-like drugs, and those immunoassays designed for tetrahydrocannabinol metabolites generally do not cross-react with synthetic cannabinoids lacking the classic cannabinoid chemical backbone. This suggests complexity in understanding how to detect and identify whether a patient has taken a molecule of one class or another, impacting clinical care.
Methods
Cross-reactivity data from immunoassays specifically targeting designer amphetamine-like and synthetic cannabinoid drugs was collected from multiple published sources, and virtual chemical libraries for molecular similarity analysis were built. The virtual library for synthetic cannabinoid analysis contained a total of 169 structures, while the virtual library for amphetamine-type stimulants contained 288 compounds. Two-dimensional (2D) similarity for each test compound was compared to the target molecule of the immunoassay undergoing analysis.
Results
2D similarity differentiated between cross-reactive and non-cross-reactive compounds for immunoassays targeting mephedrone/methcathinone, 3,4-methylenedioxypyrovalerone, benzylpiperazine, mephentermine, and synthetic cannabinoids.
Conclusions
In this study, we applied 2D molecular similarity analysis to the designer amphetamine-type stimulants and synthetic cannabinoids. Similarity calculations can be used to more efficiently decide which drugs and metabolites should be tested in cross-reactivity studies, as well as to design experiments and potentially predict antigens that would lead to immunoassays with cross reactivity for a broader array of designer drugs.
Electronic supplementary material
The online version of this article (doi:10.1186/1758-2946-6-22) contains supplementary material, which is available to authorized users.
doi:10.1186/1758-2946-6-22
PMCID: PMC4029917  PMID: 24851137
Amphetamines; Cannabinoids; Molecular models; Similarity; Toxicology

Results 1-25 (728)