PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (633)
 

Clipboard (0)
None

Select a Filter Below

Journals
Year of Publication
1.  Proteochemometric modeling in a Bayesian framework 
Proteochemometrics (PCM) is an approach for bioactivity predictive modeling which models the relationship between protein and chemical information. Gaussian Processes (GP), based on Bayesian inference, provide the most objective estimation of the uncertainty of the predictions, thus permitting the evaluation of the applicability domain (AD) of the model. Furthermore, the experimental error on bioactivity measurements can be used as input for this probabilistic model.
In this study, we apply GP implemented with a panel of kernels on three various (and multispecies) PCM datasets. The first dataset consisted of information from 8 human and rat adenosine receptors with 10,999 small molecule ligands and their binding affinity. The second consisted of the catalytic activity of four dengue virus NS3 proteases on 56 small peptides. Finally, we have gathered bioactivity information of small molecule ligands on 91 aminergic GPCRs from 9 different species, leading to a dataset of 24,593 datapoints with a matrix completeness of only 2.43%.
GP models trained on these datasets are statistically sound, at the same level of statistical significance as Support Vector Machines (SVM), with R02 values on the external dataset ranging from 0.68 to 0.92, and RMSEP values close to the experimental error. Furthermore, the best GP models obtained with the normalized polynomial and radial kernels provide intervals of confidence for the predictions in agreement with the cumulative Gaussian distribution. GP models were also interpreted on the basis of individual targets and of ligand descriptors. In the dengue dataset, the model interpretation in terms of the amino-acid positions in the tetra-peptide ligands gave biologically meaningful results.
doi:10.1186/1758-2946-6-35
PMCID: PMC4083135  PMID: 25045403
Proteochemometrics; Bayesian inference; Gaussian process; Chemogenomics; GPCRs; Adenosine receptors; Applicability domain
2.  MORT: a powerful foundational library for computational biology and CADD 
Background
A foundational library called MORT (Molecular Objects and Relevant Templates) for the development of new software packages and tools employed in computational biology and computer-aided drug design (CADD) is described here.
Results
MORT contains several advantages compared with the other libraries. Firstly, MORT written in C++ natively supports the paradigm of object-oriented design, and thus it can be understood and extended easily. Secondly, MORT employs the relational model to represent a molecule, and it is more convenient and flexible than the traditional hierarchical model employed by many other libraries. Thirdly, a lot of functions have been included in this library, and a molecule can be manipulated easily at different levels. For example, it can parse a variety of popular molecular formats (MOL/SDF, MOL2, PDB/ENT, SMILES/SMARTS, etc.), create the topology and coordinate files for the simulations supported by AMBER, calculate the energy of a specific molecule based on the AMBER force fields, etc.
Conclusions
We believe that MORT can be used as a foundational library for programmers to develop new programs and applications for computational biology and CADD. Source code of MORT is available at http://cadd.suda.edu.cn/MORT/index.htm.
doi:10.1186/1758-2946-6-36
PMCID: PMC4085231
Relational model; MORT; AMBER; Antechamber; Foundational library; CADD
3.  Using beta binomials to estimate classification uncertainty for ensemble models 
Background
Quantitative structure-activity (QSAR) models have enormous potential for reducing drug discovery and development costs as well as the need for animal testing. Great strides have been made in estimating their overall reliability, but to fully realize that potential, researchers and regulators need to know how confident they can be in individual predictions.
Results
Submodels in an ensemble model which have been trained on different subsets of a shared training pool represent multiple samples of the model space, and the degree of agreement among them contains information on the reliability of ensemble predictions. For artificial neural network ensembles (ANNEs) using two different methods for determining ensemble classification – one using vote tallies and the other averaging individual network outputs – we have found that the distribution of predictions across positive vote tallies can be reasonably well-modeled as a beta binomial distribution, as can the distribution of errors. Together, these two distributions can be used to estimate the probability that a given predictive classification will be in error. Large data sets comprised of logP, Ames mutagenicity, and CYP2D6 inhibition data are used to illustrate and validate the method. The distributions of predictions and errors for the training pool accurately predicted the distribution of predictions and errors for large external validation sets, even when the number of positive and negative examples in the training pool were not balanced. Moreover, the likelihood of a given compound being prospectively misclassified as a function of the degree of consensus between networks in the ensemble could in most cases be estimated accurately from the fitted beta binomial distributions for the training pool.
Conclusions
Confidence in an individual predictive classification by an ensemble model can be accurately assessed by examining the distributions of predictions and errors as a function of the degree of agreement among the constituent submodels. Further, ensemble uncertainty estimation can often be improved by adjusting the voting or classification threshold based on the parameters of the error distribution. Finally, the profiles for models whose predictive uncertainty estimates are not reliable provide clues to that effect without the need for comparison to an external test set.
doi:10.1186/1758-2946-6-34
PMCID: PMC4076254  PMID: 24987464
Artificial neural network ensemble; ANNE; Classification; Confidence; Error estimation; Predictive value; QSAR; Uncertainty
4.  In Silico target fishing: addressing a “Big Data” problem by ligand-based similarity rankings with data fusion 
Background
Ligand-based in silico target fishing can be used to identify the potential interacting target of bioactive ligands, which is useful for understanding the polypharmacology and safety profile of existing drugs. The underlying principle of the approach is that known bioactive ligands can be used as reference to predict the targets for a new compound.
Results
We tested a pipeline enabling large-scale target fishing and drug repositioning, based on simple fingerprint similarity rankings with data fusion. A large library containing 533 drug relevant targets with 179,807 active ligands was compiled, where each target was defined by its ligand set. For a given query molecule, its target profile is generated by similarity searching against the ligand sets assigned to each target, for which individual searches utilizing multiple reference structures are then fused into a single ranking list representing the potential target interaction profile of the query compound. The proposed approach was validated by 10-fold cross validation and two external tests using data from DrugBank and Therapeutic Target Database (TTD). The use of the approach was further demonstrated with some examples concerning the drug repositioning and drug side-effects prediction. The promising results suggest that the proposed method is useful for not only finding promiscuous drugs for their new usages, but also predicting some important toxic liabilities.
Conclusions
With the rapid increasing volume and diversity of data concerning drug related targets and their ligands, the simple ligand-based target fishing approach would play an important role in assisting future drug design and discovery.
doi:10.1186/1758-2946-6-33
PMCID: PMC4068908  PMID: 24976868
Target fishing; Big data; Molecular fingerprints; Data fusion; Similarity searching
5.  The influence of negative training set size on machine learning-based virtual screening 
Background
The paper presents a thorough analysis of the influence of the number of negative training examples on the performance of machine learning methods.
Results
The impact of this rather neglected aspect of machine learning methods application was examined for sets containing a fixed number of positive and a varying number of negative examples randomly selected from the ZINC database. An increase in the ratio of positive to negative training instances was found to greatly influence most of the investigated evaluating parameters of ML methods in simulated virtual screening experiments. In a majority of cases, substantial increases in precision and MCC were observed in conjunction with some decreases in hit recall. The analysis of dynamics of those variations let us recommend an optimal composition of training data. The study was performed on several protein targets, 5 machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest) and 2 types of molecular fingerprints (MACCS and CDK FP). The most effective classification was provided by the combination of CDK FP with SMO or Random Forest algorithms. The Naïve Bayes models appeared to be hardly sensitive to changes in the number of negative instances in the training set.
Conclusions
In conclusion, the ratio of positive to negative training instances should be taken into account during the preparation of machine learning experiments, as it might significantly influence the performance of particular classifier. What is more, the optimization of negative training set size can be applied as a boosting-like approach in machine learning-based virtual screening.
doi:10.1186/1758-2946-6-32
PMCID: PMC4061540  PMID: 24976867
6.  Efficient enumeration of monocyclic chemical graphs with given path frequencies 
Background
The enumeration of chemical graphs (molecular graphs) satisfying given constraints is one of the fundamental problems in chemoinformatics and bioinformatics because it leads to a variety of useful applications including structure determination and development of novel chemical compounds.
Results
We consider the problem of enumerating chemical graphs with monocyclic structure (a graph structure that contains exactly one cycle) from a given set of feature vectors, where a feature vector represents the frequency of the prescribed paths in a chemical compound to be constructed and the set is specified by a pair of upper and lower feature vectors. To enumerate all tree-like (acyclic) chemical graphs from a given set of feature vectors, Shimizu et al. and Suzuki et al. proposed efficient branch-and-bound algorithms based on a fast tree enumeration algorithm. In this study, we devise a novel method for extending these algorithms to enumeration of chemical graphs with monocyclic structure by designing a fast algorithm for testing uniqueness. The results of computational experiments reveal that the computational efficiency of the new algorithm is as good as those for enumeration of tree-like chemical compounds.
Conclusions
We succeed in expanding the class of chemical graphs that are able to be enumerated efficiently.
doi:10.1186/1758-2946-6-31
PMCID: PMC4049473  PMID: 24955135
Chemical graphs; Enumeration; Monocyclic structure; Feature vector
7.  Estimation of diffusion coefficients from voltammetric signals by support vector and gaussian process regression 
Background
Support vector regression (SVR) and Gaussian process regression (GPR) were used for the analysis of electroanalytical experimental data to estimate diffusion coefficients.
Results
For simulated cyclic voltammograms based on the EC, Eqr, and EqrC mechanisms these regression algorithms in combination with nonlinear kernel/covariance functions yielded diffusion coefficients with higher accuracy as compared to the standard approach of calculating diffusion coefficients relying on the Nicholson-Shain equation. The level of accuracy achieved by SVR and GPR is virtually independent of the rate constants governing the respective reaction steps. Further, the reduction of high-dimensional voltammetric signals by manual selection of typical voltammetric peak features decreased the performance of both regression algorithms compared to a reduction by downsampling or principal component analysis. After training on simulated data sets, diffusion coefficients were estimated by the regression algorithms for experimental data comprising voltammetric signals for three organometallic complexes.
Conclusions
Estimated diffusion coefficients closely matched the values determined by the parameter fitting method, but reduced the required computational time considerably for one of the reaction mechanisms. The automated processing of voltammograms according to the regression algorithms yields better results than the conventional analysis of peak-related data.
doi:10.1186/1758-2946-6-30
PMCID: PMC4074154  PMID: 24987463
Support vector regression; Gaussian process regression; Diffusion coefficient; Principal component analysis; Voltammetry; Reaction mechanism
8.  Cytochrome P450 site of metabolism prediction from 2D topological fingerprints using GPU accelerated probabilistic classifiers 
Background
The prediction of sites and products of metabolism in xenobiotic compounds is key to the development of new chemical entities, where screening potential metabolites for toxicity or unwanted side-effects is of crucial importance. In this work 2D topological fingerprints are used to encode atomic sites and three probabilistic machine learning methods are applied: Parzen-Rosenblatt Window (PRW), Naive Bayesian (NB) and a novel approach called RASCAL (Random Attribute Subsampling Classification ALgorithm). These are implemented by randomly subsampling descriptor space to alleviate the problem often suffered by data mining methods of having to exactly match fingerprints, and in the case of PRW by measuring a distance between feature vectors rather than exact matching. The classifiers have been implemented in CUDA/C++ to exploit the parallel architecture of graphical processing units (GPUs) and is freely available in a public repository.
Results
It is shown that for PRW a SoM (Site of Metabolism) is identified in the top two predictions for 85%, 91% and 88% of the CYP 3A4, 2D6 and 2C9 data sets respectively, with RASCAL giving similar performance of 83%, 91% and 88%, respectively. These results put PRW and RASCAL performance ahead of NB which gave a much lower classification performance of 51%, 73% and 74%, respectively.
Conclusions
2D topological fingerprints calculated to a bond depth of 4-6 contain sufficient information to allow the identification of SoMs using classifiers based on relatively small data sets. Thus, the machine learning methods outlined in this paper are conceptually simpler and more efficient than other methods tested and the use of simple topological descriptors derived from 2D structure give results competitive with other approaches using more expensive quantum chemical descriptors. The descriptor space subsampling approach and ensemble methodology allow the methods to be applied to molecules more distant from the training data where data mining would be more likely to fail due to the lack of common fingerprints. The RASCAL algorithm is shown to give equivalent classification performance to PRW but at lower computational expense allowing it to be applied more efficiently in the ensemble scheme.
doi:10.1186/1758-2946-6-29
PMCID: PMC4047555  PMID: 24959208
Cytochrome P450; Metabolism; Probabilistic; Classification; GPU; CUDA; 2D
9.  iDrug: a web-accessible and interactive drug discovery and design platform 
Background
The progress in computer-aided drug design (CADD) approaches over the past decades accelerated the early-stage pharmaceutical research. Many powerful standalone tools for CADD have been developed in academia. As programs are developed by various research groups, a consistent user-friendly online graphical working environment, combining computational techniques such as pharmacophore mapping, similarity calculation, scoring, and target identification is needed.
Results
We presented a versatile, user-friendly, and efficient online tool for computer-aided drug design based on pharmacophore and 3D molecular similarity searching. The web interface enables binding sites detection, virtual screening hits identification, and drug targets prediction in an interactive manner through a seamless interface to all adapted packages (e.g., Cavity, PocketV.2, PharmMapper, SHAFTS). Several commercially available compound databases for hit identification and a well-annotated pharmacophore database for drug targets prediction were integrated in iDrug as well. The web interface provides tools for real-time molecular building/editing, converting, displaying, and analyzing. All the customized configurations of the functional modules can be accessed through featured session files provided, which can be saved to the local disk and uploaded to resume or update the history work.
Conclusions
iDrug is easy to use, and provides a novel, fast and reliable tool for conducting drug design experiments. By using iDrug, various molecular design processing tasks can be submitted and visualized simply in one browser without installing locally any standalone modeling softwares. iDrug is accessible free of charge at http://lilab.ecust.edu.cn/idrug.
doi:10.1186/1758-2946-6-28
PMCID: PMC4046018  PMID: 24955134
Online drug design platform; Cavity detection; Pharmacophore search; 3D similarity calculation; Target prediction
10.  Expanding the fragrance chemical space for virtual screening 
The properties of fragrance molecules in the public databases SuperScent and Flavornet were analyzed to define a “fragrance-like” (FL) property range (Heavy Atom Count ≤ 21, only C, H, O, S, (O + S) ≤ 3, Hydrogen Bond Donor ≤ 1) and the corresponding chemical space including FL molecules from PubChem (NIH repository of molecules), ChEMBL (bioactive molecules), ZINC (drug-like molecules), and GDB-13 (all possible organic molecules up to 13 atoms of C, N, O, S, Cl). The FL subsets of these databases were classified by MQN (Molecular Quantum Numbers, a set of 42 integer value descriptors of molecular structure) and formatted for fast MQN-similarity searching and interactive exploration of color-coded principal component maps in form of the FL-mapplet and FL-browser applications freely available at http://www.gdb.unibe.ch. MQN-similarity is shown to efficiently recover 15 different fragrance molecule families from the different FL subsets, demonstrating the relevance of the MQN-based tool to explore the fragrance chemical space.
doi:10.1186/1758-2946-6-27
PMCID: PMC4037718  PMID: 24876890
11.  Expanding the fragrance chemical space for virtual screening 
The properties of fragrance molecules in the public databases SuperScent and Flavornet were analyzed to define a “fragrance-like” (FL) property range (Heavy Atom Count ≤ 21, only C, H, O, S, (O + S) ≤ 3, Hydrogen Bond Donor ≤ 1) and the corresponding chemical space including FL molecules from PubChem (NIH repository of molecules), ChEMBL (bioactive molecules), ZINC (drug-like molecules), and GDB-13 (all possible organic molecules up to 13 atoms of C, N, O, S, Cl). The FL subsets of these databases were classified by MQN (Molecular Quantum Numbers, a set of 42 integer value descriptors of molecular structure) and formatted for fast MQN-similarity searching and interactive exploration of color-coded principal component maps in form of the FL-mapplet and FL-browser applications freely available at http://www.gdb.unibe.ch. MQN-similarity is shown to efficiently recover 15 different fragrance molecule families from the different FL subsets, demonstrating the relevance of the MQN-based tool to explore the fragrance chemical space.
Electronic supplementary material
The online version of this article (doi:10.1186/1758-2946-6-27) contains supplementary material, which is available to authorized users.
doi:10.1186/1758-2946-6-27
PMCID: PMC4037718  PMID: 24876890
12.  Estimation of acute oral toxicity in rat using local lazy learning 
Background
Acute toxicity means the ability of a substance to cause adverse effects within a short period following dosing or exposure, which is usually the first step in the toxicological investigations of unknown substances. The median lethal dose, LD50, is frequently used as a general indicator of a substance’s acute toxicity, and there is a high demand on developing non-animal-based prediction of LD50. Unfortunately, it is difficult to accurately predict compound LD50 using a single QSAR model, because the acute toxicity may involve complex mechanisms and multiple biochemical processes.
Results
In this study, we reported the use of local lazy learning (LLL) methods, which could capture subtle local structure-toxicity relationships around each query compound, to develop LD50 prediction models: (a) local lazy regression (LLR): a linear regression model built using k neighbors; (b) SA: the arithmetical mean of the activities of k nearest neighbors; (c) SR: the weighted mean of the activities of k nearest neighbors; (d) GP: the projection point of the compound on the line defined by its two nearest neighbors. We defined the applicability domain (AD) to decide to what an extent and under what circumstances the prediction is reliable. In the end, we developed a consensus model based on the predicted values of individual LLL models, yielding correlation coefficients R2 of 0.712 on a test set containing 2,896 compounds.
Conclusion
Encouraged by the promising results, we expect that our consensus LLL model of LD50 would become a useful tool for predicting acute toxicity. All models developed in this study are available via http://www.dddc.ac.cn/admetus.
doi:10.1186/1758-2946-6-26
PMCID: PMC4047767  PMID: 24959207
Acute toxicity; Local lazy learning; Applicability domain; Consensus model
13.  QSAR DataBank - an approach for the digital organization and archiving of QSAR model information 
Background
Research efforts in the field of descriptive and predictive Quantitative Structure-Activity Relationships or Quantitative Structure–Property Relationships produce around one thousand scientific publications annually. All the materials and results are mainly communicated using printed media. The printed media in its present form have obvious limitations when they come to effectively representing mathematical models, including complex and non-linear, and large bodies of associated numerical chemical data. It is not supportive of secondary information extraction or reuse efforts while in silico studies poses additional requirements for accessibility, transparency and reproducibility of the research. This gap can and should be bridged by introducing domain-specific digital data exchange standards and tools. The current publication presents a formal specification of the quantitative structure-activity relationship data organization and archival format called the QSAR DataBank (QsarDB for shorter, or QDB for shortest).
Results
The article describes QsarDB data schema, which formalizes QSAR concepts (objects and relationships between them) and QsarDB data format, which formalizes their presentation for computer systems. The utility and benefits of QsarDB have been thoroughly tested by solving everyday QSAR and predictive modeling problems, with examples in the field of predictive toxicology, and can be applied for a wide variety of other endpoints. The work is accompanied with open source reference implementation and tools.
Conclusions
The proposed open data, open source, and open standards design is open to public and proprietary extensions on many levels. Selected use cases exemplify the benefits of the proposed QsarDB data format. General ideas for future development are discussed.
doi:10.1186/1758-2946-6-25
PMCID: PMC4047268  PMID: 24910716
Data format; Data interoperability; Open science; QSAR; QSPR
14.  The BioDICE Taverna plugin for clustering and visualization of biological data: a workflow for molecular compounds exploration 
Background
In many experimental pipelines, clustering of multidimensional biological datasets is used to detect hidden structures in unlabelled input data. Taverna is a popular workflow management system that is used to design and execute scientific workflows and aid in silico experimentation. The availability of fast unsupervised methods for clustering and visualization in the Taverna platform is important to support a data-driven scientific discovery in complex and explorative bioinformatics applications.
Results
This work presents a Taverna plugin, the Biological Data Interactive Clustering Explorer (BioDICE), that performs clustering of high-dimensional biological data and provides a nonlinear, topology preserving projection for the visualization of the input data and their similarities. The core algorithm in the BioDICE plugin is Fast Learning Self Organizing Map (FLSOM), which is an improved variant of the Self Organizing Map (SOM) algorithm. The plugin generates an interactive 2D map that allows the visual exploration of multidimensional data and the identification of groups of similar objects. The effectiveness of the plugin is demonstrated on a case study related to chemical compounds.
Conclusions
The number and variety of available tools and its extensibility have made Taverna a popular choice for the development of scientific data workflows. This work presents a novel plugin, BioDICE, which adds a data-driven knowledge discovery component to Taverna. BioDICE provides an effective and powerful clustering tool, which can be adopted for the explorative analysis of biological datasets.
doi:10.1186/1758-2946-6-24
PMCID: PMC4036106
Molecular compounds; Self organizing map; Clustering; Visualization; Taverna
15.  The BioDICE Taverna plugin for clustering and visualization of biological data: a workflow for molecular compounds exploration 
Background
In many experimental pipelines, clustering of multidimensional biological datasets is used to detect hidden structures in unlabelled input data. Taverna is a popular workflow management system that is used to design and execute scientific workflows and aid in silico experimentation. The availability of fast unsupervised methods for clustering and visualization in the Taverna platform is important to support a data-driven scientific discovery in complex and explorative bioinformatics applications.
Results
This work presents a Taverna plugin, the Biological Data Interactive Clustering Explorer (BioDICE), that performs clustering of high-dimensional biological data and provides a nonlinear, topology preserving projection for the visualization of the input data and their similarities. The core algorithm in the BioDICE plugin is Fast Learning Self Organizing Map (FLSOM), which is an improved variant of the Self Organizing Map (SOM) algorithm. The plugin generates an interactive 2D map that allows the visual exploration of multidimensional data and the identification of groups of similar objects. The effectiveness of the plugin is demonstrated on a case study related to chemical compounds.
Conclusions
The number and variety of available tools and its extensibility have made Taverna a popular choice for the development of scientific data workflows. This work presents a novel plugin, BioDICE, which adds a data-driven knowledge discovery component to Taverna. BioDICE provides an effective and powerful clustering tool, which can be adopted for the explorative analysis of biological datasets.
doi:10.1186/1758-2946-6-24
PMCID: PMC4036106
Molecular compounds; Self organizing map; Clustering; Visualization; Taverna
16.  Using cheminformatics to predict cross reactivity of “designer drugs” to their currently available immunoassays 
Background
A challenge for drug of abuse testing is presented by ‘designer drugs’, compounds typically discovered by modifications of existing clinical drug classes such as amphetamines and cannabinoids. Drug of abuse screening immunoassays directed at amphetamine or methamphetamine only detect a small subset of designer amphetamine-like drugs, and those immunoassays designed for tetrahydrocannabinol metabolites generally do not cross-react with synthetic cannabinoids lacking the classic cannabinoid chemical backbone. This suggests complexity in understanding how to detect and identify whether a patient has taken a molecule of one class or another, impacting clinical care.
Methods
Cross-reactivity data from immunoassays specifically targeting designer amphetamine-like and synthetic cannabinoid drugs was collected from multiple published sources, and virtual chemical libraries for molecular similarity analysis were built. The virtual library for synthetic cannabinoid analysis contained a total of 169 structures, while the virtual library for amphetamine-type stimulants contained 288 compounds. Two-dimensional (2D) similarity for each test compound was compared to the target molecule of the immunoassay undergoing analysis.
Results
2D similarity differentiated between cross-reactive and non-cross-reactive compounds for immunoassays targeting mephedrone/methcathinone, 3,4-methylenedioxypyrovalerone, benzylpiperazine, mephentermine, and synthetic cannabinoids.
Conclusions
In this study, we applied 2D molecular similarity analysis to the designer amphetamine-type stimulants and synthetic cannabinoids. Similarity calculations can be used to more efficiently decide which drugs and metabolites should be tested in cross-reactivity studies, as well as to design experiments and potentially predict antigens that would lead to immunoassays with cross reactivity for a broader array of designer drugs.
doi:10.1186/1758-2946-6-22
PMCID: PMC4029917  PMID: 24851137
Amphetamines; Cannabinoids; Molecular models; Similarity; Toxicology
17.  Using cheminformatics to predict cross reactivity of “designer drugs” to their currently available immunoassays 
Background
A challenge for drug of abuse testing is presented by ‘designer drugs’, compounds typically discovered by modifications of existing clinical drug classes such as amphetamines and cannabinoids. Drug of abuse screening immunoassays directed at amphetamine or methamphetamine only detect a small subset of designer amphetamine-like drugs, and those immunoassays designed for tetrahydrocannabinol metabolites generally do not cross-react with synthetic cannabinoids lacking the classic cannabinoid chemical backbone. This suggests complexity in understanding how to detect and identify whether a patient has taken a molecule of one class or another, impacting clinical care.
Methods
Cross-reactivity data from immunoassays specifically targeting designer amphetamine-like and synthetic cannabinoid drugs was collected from multiple published sources, and virtual chemical libraries for molecular similarity analysis were built. The virtual library for synthetic cannabinoid analysis contained a total of 169 structures, while the virtual library for amphetamine-type stimulants contained 288 compounds. Two-dimensional (2D) similarity for each test compound was compared to the target molecule of the immunoassay undergoing analysis.
Results
2D similarity differentiated between cross-reactive and non-cross-reactive compounds for immunoassays targeting mephedrone/methcathinone, 3,4-methylenedioxypyrovalerone, benzylpiperazine, mephentermine, and synthetic cannabinoids.
Conclusions
In this study, we applied 2D molecular similarity analysis to the designer amphetamine-type stimulants and synthetic cannabinoids. Similarity calculations can be used to more efficiently decide which drugs and metabolites should be tested in cross-reactivity studies, as well as to design experiments and potentially predict antigens that would lead to immunoassays with cross reactivity for a broader array of designer drugs.
Electronic supplementary material
The online version of this article (doi:10.1186/1758-2946-6-22) contains supplementary material, which is available to authorized users.
doi:10.1186/1758-2946-6-22
PMCID: PMC4029917  PMID: 24851137
Amphetamines; Cannabinoids; Molecular models; Similarity; Toxicology
18.  A rotation-translation invariant molecular descriptor of partial charges and its use in ligand-based virtual screening 
Background
Measures of similarity for chemical molecules have been developed since the dawn of chemoinformatics. Molecular similarity has been measured by a variety of methods including molecular descriptor based similarity, common molecular fragments, graph matching and 3D methods such as shape matching. Similarity measures are widespread in practice and have proven to be useful in drug discovery. Because of our interest in electrostatics and high throughput ligand-based virtual screening, we sought to exploit the information contained in atomic coordinates and partial charges of a molecule.
Results
A new molecular descriptor based on partial charges is proposed. It uses the autocorrelation function and linear binning to encode all atoms of a molecule into two rotation-translation invariant vectors. Combined with a scoring function, the descriptor allows to rank-order a database of compounds versus a query molecule. The proposed implementation is called ACPC (AutoCorrelation of Partial Charges) and released in open source. Extensive retrospective ligand-based virtual screening experiments were performed and other methods were compared with in order to validate the method and associated protocol.
Conclusions
While it is a simple method, it performed remarkably well in experiments. At an average speed of 1649 molecules per second, it reached an average median area under the curve of 0.81 on 40 different targets; hence validating the proposed protocol and implementation.
doi:10.1186/1758-2946-6-23
PMCID: PMC4030740  PMID: 24887178
RTI molecular descriptor; Partial charges; Ligand-based virtual screening; Spatial auto-correlation; Cross-correlation; Linear binning; ACPC
19.  A rotation-translation invariant molecular descriptor of partial charges and its use in ligand-based virtual screening 
Background
Measures of similarity for chemical molecules have been developed since the dawn of chemoinformatics. Molecular similarity has been measured by a variety of methods including molecular descriptor based similarity, common molecular fragments, graph matching and 3D methods such as shape matching. Similarity measures are widespread in practice and have proven to be useful in drug discovery. Because of our interest in electrostatics and high throughput ligand-based virtual screening, we sought to exploit the information contained in atomic coordinates and partial charges of a molecule.
Results
A new molecular descriptor based on partial charges is proposed. It uses the autocorrelation function and linear binning to encode all atoms of a molecule into two rotation-translation invariant vectors. Combined with a scoring function, the descriptor allows to rank-order a database of compounds versus a query molecule. The proposed implementation is called ACPC (AutoCorrelation of Partial Charges) and released in open source. Extensive retrospective ligand-based virtual screening experiments were performed and other methods were compared with in order to validate the method and associated protocol.
Conclusions
While it is a simple method, it performed remarkably well in experiments. At an average speed of 1649 molecules per second, it reached an average median area under the curve of 0.81 on 40 different targets; hence validating the proposed protocol and implementation.
doi:10.1186/1758-2946-6-23
PMCID: PMC4030740  PMID: 24887178
RTI molecular descriptor; Partial charges; Ligand-based virtual screening; Spatial auto-correlation; Cross-correlation; Linear binning; ACPC
20.  Self organising hypothesis networks: a new approach for representing and structuring SAR knowledge 
Background
Combining different sources of knowledge to build improved structure activity relationship models is not easy owing to the variety of knowledge formats and the absence of a common framework to interoperate between learning techniques. Most of the current approaches address this problem by using consensus models that operate at the prediction level. We explore the possibility to directly combine these sources at the knowledge level, with the aim to harvest potentially increased synergy at an earlier stage. Our goal is to design a general methodology to facilitate knowledge discovery and produce accurate and interpretable models.
Results
To combine models at the knowledge level, we propose to decouple the learning phase from the knowledge application phase using a pivot representation (lingua franca) based on the concept of hypothesis. A hypothesis is a simple and interpretable knowledge unit. Regardless of its origin, knowledge is broken down into a collection of hypotheses. These hypotheses are subsequently organised into hierarchical network. This unification permits to combine different sources of knowledge into a common formalised framework. The approach allows us to create a synergistic system between different forms of knowledge and new algorithms can be applied to leverage this unified model. This first article focuses on the general principle of the Self Organising Hypothesis Network (SOHN) approach in the context of binary classification problems along with an illustrative application to the prediction of mutagenicity.
Conclusion
It is possible to represent knowledge in the unified form of a hypothesis network allowing interpretable predictions with performances comparable to mainstream machine learning techniques. This new approach offers the potential to combine knowledge from different sources into a common framework in which high level reasoning and meta-learning can be applied; these latter perspectives will be explored in future work.
doi:10.1186/1758-2946-6-21
PMCID: PMC4048587  PMID: 24959206
Machine learning; Knowledge discovery; Data mining; SAR; QSAR; SOHN; Interpretable model; Confidence metric; Hypothesis Network
21.  Supervised extensions of chemography approaches: case studies of chemical liabilities assessment 
Chemical liabilities, such as adverse effects and toxicity, play a significant role in modern drug discovery process. In silico assessment of chemical liabilities is an important step aimed to reduce costs and animal testing by complementing or replacing in vitro and in vivo experiments. Herein, we propose an approach combining several classification and chemography methods to be able to predict chemical liabilities and to interpret obtained results in the context of impact of structural changes of compounds on their pharmacological profile. To our knowledge for the first time, the supervised extension of Generative Topographic Mapping is proposed as an effective new chemography method. New approach for mapping new data using supervised Isomap without re-building models from the scratch has been proposed. Two approaches for estimation of model’s applicability domain are used in our study to our knowledge for the first time in chemoinformatics. The structural alerts responsible for the negative characteristics of pharmacological profile of chemical compounds has been found as a result of model interpretation.
doi:10.1186/1758-2946-6-20
PMCID: PMC4018504  PMID: 24868246
Cheminformatics; Chemography; Applicability domain; Generative topographic mapping; Dimensionality reduction; Supervised generative topographic mapping; Isomap; Supervised Isomap
22.  Supervised extensions of chemography approaches: case studies of chemical liabilities assessment 
Chemical liabilities, such as adverse effects and toxicity, play a significant role in modern drug discovery process. In silico assessment of chemical liabilities is an important step aimed to reduce costs and animal testing by complementing or replacing in vitro and in vivo experiments. Herein, we propose an approach combining several classification and chemography methods to be able to predict chemical liabilities and to interpret obtained results in the context of impact of structural changes of compounds on their pharmacological profile. To our knowledge for the first time, the supervised extension of Generative Topographic Mapping is proposed as an effective new chemography method. New approach for mapping new data using supervised Isomap without re-building models from the scratch has been proposed. Two approaches for estimation of model’s applicability domain are used in our study to our knowledge for the first time in chemoinformatics. The structural alerts responsible for the negative characteristics of pharmacological profile of chemical compounds has been found as a result of model interpretation.
doi:10.1186/1758-2946-6-20
PMCID: PMC4018504  PMID: 24868246
Cheminformatics; Chemography; Applicability domain; Generative topographic mapping; Dimensionality reduction; Supervised generative topographic mapping; Isomap; Supervised Isomap
23.  Condorcet and borda count fusion method for ligand-based virtual screening 
Background
It is known that any individual similarity measure will not always give the best recall of active molecule structure for all types of activity classes. Recently, the effectiveness of ligand-based virtual screening approaches can be enhanced by using data fusion. Data fusion can be implemented using two different approaches: group fusion and similarity fusion. Similarity fusion involves searching using multiple similarity measures. The similarity scores, or ranking, for each similarity measure are combined to obtain the final ranking of the compounds in the database.
Results
The Condorcet fusion method was examined. This approach combines the outputs of similarity searches from eleven association and distance similarity coefficients, and then the winner measure for each class of molecules, based on Condorcet fusion, was chosen to be the best method of searching. The recall of retrieved active molecules at top 5% and significant test are used to evaluate our proposed method. The MDL drug data report (MDDR), maximum unbiased validation (MUV) and Directory of Useful Decoys (DUD) data sets were used for experiments and were represented by 2D fingerprints.
Conclusions
Simulated virtual screening experiments with the standard two data sets show that the use of Condorcet fusion provides a very simple way of improving the ligand-based virtual screening, especially when the active molecules being sought have a lowest degree of structural heterogeneity. However, the effectiveness of the Condorcet fusion was increased slightly when structural sets of high diversity activities were being sought.
doi:10.1186/1758-2946-6-19
PMCID: PMC4026830  PMID: 24883114
Similarity searching; Virtual screening; Similarity coefficients; Data fusion
24.  Condorcet and borda count fusion method for ligand-based virtual screening 
Background
It is known that any individual similarity measure will not always give the best recall of active molecule structure for all types of activity classes. Recently, the effectiveness of ligand-based virtual screening approaches can be enhanced by using data fusion. Data fusion can be implemented using two different approaches: group fusion and similarity fusion. Similarity fusion involves searching using multiple similarity measures. The similarity scores, or ranking, for each similarity measure are combined to obtain the final ranking of the compounds in the database.
Results
The Condorcet fusion method was examined. This approach combines the outputs of similarity searches from eleven association and distance similarity coefficients, and then the winner measure for each class of molecules, based on Condorcet fusion, was chosen to be the best method of searching. The recall of retrieved active molecules at top 5% and significant test are used to evaluate our proposed method. The MDL drug data report (MDDR), maximum unbiased validation (MUV) and Directory of Useful Decoys (DUD) data sets were used for experiments and were represented by 2D fingerprints.
Conclusions
Simulated virtual screening experiments with the standard two data sets show that the use of Condorcet fusion provides a very simple way of improving the ligand-based virtual screening, especially when the active molecules being sought have a lowest degree of structural heterogeneity. However, the effectiveness of the Condorcet fusion was increased slightly when structural sets of high diversity activities were being sought.
doi:10.1186/1758-2946-6-19
PMCID: PMC4026830  PMID: 24883114
Similarity searching; Virtual screening; Similarity coefficients; Data fusion
25.  Chemical named entities recognition: a review on approaches and applications 
The rapid increase in the flow rate of published digital information in all disciplines has resulted in a pressing need for techniques that can simplify the use of this information. The chemistry literature is very rich with information about chemical entities. Extracting molecules and their related properties and activities from the scientific literature to “text mine” these extracted data and determine contextual relationships helps research scientists, particularly those in drug development. One of the most important challenges in chemical text mining is the recognition of chemical entities mentioned in the texts. In this review, the authors briefly introduce the fundamental concepts of chemical literature mining, the textual contents of chemical documents, and the methods of naming chemicals in documents. We sketch out dictionary-based, rule-based and machine learning, as well as hybrid chemical named entity recognition approaches with their applied solutions. We end with an outlook on the pros and cons of these approaches and the types of chemical entities extracted.
doi:10.1186/1758-2946-6-17
PMCID: PMC4022577  PMID: 24834132
Chemical entities; Information extraction; Chemical names

Results 1-25 (633)