High-throughput screening (HTS) is one of the main strategies to identify novel entry points for the development of small molecule chemical probes and drugs and is now commonly accessible to public sector research. Large amounts of data generated in HTS campaigns are submitted to public repositories such as PubChem, which is growing at an exponential rate. The diversity and quantity of available HTS assays and screening results pose enormous challenges to organizing, standardizing, integrating, and analyzing the datasets and thus to maximize the scientific and ultimately the public health impact of the huge investments made to implement public sector HTS capabilities. Novel approaches to organize, standardize and access HTS data are required to address these challenges.
We developed the first ontology to describe HTS experiments and screening results using expressive description logic. The BioAssay Ontology (BAO) serves as a foundation for the standardization of HTS assays and data and as a semantic knowledge model. In this paper we show important examples of formalizing HTS domain knowledge and we point out the advantages of this approach. The ontology is available online at the NCBO bioportal http://bioportal.bioontology.org/ontologies/44531.
After a large manual curation effort, we loaded BAO-mapped data triples into a RDF database store and used a reasoner in several case studies to demonstrate the benefits of formalized domain knowledge representation in BAO. The examples illustrate semantic querying capabilities where BAO enables the retrieval of inferred search results that are relevant to a given query, but are not explicitly defined. BAO thus opens new functionality for annotating, querying, and analyzing HTS datasets and the potential for discovering new knowledge by means of inference.
Summary: PubChem is a public repository of chemical structures and associated biological activities. The PubChem BioAssay database contains assay descriptions, conditions and readouts and biological screening results that have been submitted by the biomedical research community. The PubChem web site and Power User Gateway (PUG) web service allow users to interact with the data and raw files are available via FTP.
These resources are helpful to many but there can also be great benefit by using a software API to manipulate the data. Here, we describe a Java API with entity objects mapped to the PubChem Schema and with wrapper functions for calling the NCBI eUtilities and PubChem PUG web services. PubChem BioAssays and associated chemical compounds can then be queried and manipulated in a local relational database. Features include chemical structure searching and generation and display of curve fits from stored dose–response experiments, something that is not yet available within PubChem itself. The aim is to provide researchers with a fast, consistent, queryable local resource from which to manipulate PubChem BioAssays in a database agnostic manner. It is not intended as an end user tool but to provide a platform for further automation and tools development.
PubChem (http://pubchem.ncbi.nlm.nih.gov) is a public repository for biological properties of small molecules hosted by the US National Institutes of Health (NIH). PubChem BioAssay database currently contains biological test results for more than 700 000 compounds. The goal of PubChem is to make this information easily accessible to biomedical researchers. In this work, we present a set of web servers to facilitate and optimize the utility of biological activity information within PubChem. These web-based services provide tools for rapid data retrieval, integration and comparison of biological screening results, exploratory structure–activity analysis, and target selectivity examination. This article reviews these bioactivity analysis tools and discusses their uses. Most of the tools described in this work can be directly accessed at http://pubchem.ncbi.nlm.nih.gov/assay/. URLs for accessing other tools described in this work are specified individually.
PubChem (http://pubchem.ncbi.nlm.nih.gov) is a public repository for biological activity data of small molecules and RNAi reagents. The mission of PubChem is to deliver free and easy access to all deposited data, and to provide intuitive data analysis tools. The PubChem BioAssay database currently contains 500 000 descriptions of assay protocols, covering 5000 protein targets, 30 000 gene targets and providing over 130 million bioactivity outcomes. PubChem's bioassay data are integrated into the NCBI Entrez information retrieval system, thus making PubChem data searchable and accessible by Entrez queries. Also, as a repository, PubChem constantly optimizes and develops its deposition system answering many demands of both high- and low-volume depositors. The PubChem information platform allows users to search, review and download bioassay description and data. The PubChem platform also enables researchers to collect, compare and analyze biological test results through web-based and programmatic tools. In this work, we provide an update for the PubChem BioAssay resource, including information content growth, data model extension and new developments of data submission, retrieval, analysis and download tools.
Huge amounts of high-throughput screening (HTS) data for probe and drug development projects are being generated in the pharmaceutical industry and more recently in the public sector. The resulting experimental datasets are increasingly being disseminated via publically accessible repositories. However, existing repositories lack sufficient metadata to describe the experiments and are often difficult to navigate by non-experts. The lack of standardized descriptions and semantics of biological assays and screening results hinder targeted data retrieval, integration, aggregation, and analyses across different HTS datasets, for example to infer mechanisms of action of small molecule perturbagens. To address these limitations, we created the BioAssay Ontology (BAO). BAO has been developed with a focus on data integration and analysis enabling the classification of assays and screening results by concepts that relate to format, assay design, technology, target, and endpoint. Previously, we reported on the higher-level design of BAO and on the semantic querying capabilities offered by the ontology-indexed triple store of HTS data. Here, we report on our detailed design, annotation pipeline, substantially enlarged annotation knowledgebase, and analysis results. We used BAO to annotate assays from the largest public HTS data repository, PubChem, and demonstrate its utility to categorize and analyze diverse HTS results from numerous experiments. BAO is publically available from the NCBO BioPortal at http://bioportal.bioontology.org/ontologies/1533. BAO provides controlled terminology and uniform scope to report probe and drug discovery screening assays and results. BAO leverages description logic to formalize the domain knowledge and facilitate the semantic integration with diverse other resources. As a consequence, BAO offers the potential to infer new knowledge from a corpus of assay results, for example molecular mechanisms of action of perturbagens.
The PubChem BioAssay database (http://pubchem.ncbi.nlm.nih.gov) is a public repository for biological activities of small molecules and small interfering RNAs (siRNAs) hosted by the US National Institutes of Health (NIH). It archives experimental descriptions of assays and biological test results and makes the information freely accessible to the public. A PubChem BioAssay data entry includes an assay description, a summary and detailed test results. Each assay record is linked to the molecular target, whenever possible, and is cross-referenced to other National Center for Biotechnology Information (NCBI) database records. ‘Related BioAssays’ are identified by examining the assay target relationship and activity profile of commonly tested compounds. A key goal of PubChem BioAssay is to make the biological activity information easily accessible through the NCBI information retrieval system-Entrez, and various web-based PubChem services. An integrated suite of data analysis tools are available to optimize the utility of the chemical structure and biological activity information within PubChem, enabling researchers to aggregate, compare and analyze biological test results contributed by multiple organizations. In this work, we describe the PubChem BioAssay database, including data model, bioassay deposition and utilities that PubChem provides for searching, downloading and analyzing the biological activity information contained therein.
In recent years, the number of High Throughput Screening (HTS) assays deposited in PubChem has grown quickly. As a result, the volume of both the structured information (i.e. molecular structure, bioactivities) and the unstructured information (such as descriptions of bioassay experiments), has been increasing exponentially. As a result, it has become even more demanding and challenging to efficiently assemble the bioactivity data by mining the huge amount of information to identify and interpret the relationships among the diversified bioassay experiments. In this work, we propose a text-mining based approach for bioassay neighboring analysis from the unstructured text descriptions contained in the PubChem BioAssay database.
The neighboring analysis is achieved by evaluating the cosine scores of each bioassay pair and fraction of overlaps among the human-curated neighbors. Our results from the cosine score distribution analysis and assay neighbor clustering analysis on all PubChem bioassays suggest that strong correlations among the bioassays can be identified from their conceptual relevance. A comparison with other existing assay neighboring methods suggests that the text-mining based bioassay neighboring approach provides meaningful linkages among the PubChem bioassays, and complements the existing methods by identifying additional relationships among the bioassay entries.
The text-mining based bioassay neighboring analysis is efficient for correlating bioassays and studying different aspects of a biological process, which are otherwise difficult to achieve by existing neighboring procedures due to the lack of specific annotations and structured information. It is suggested that the text-mining based bioassay neighboring analysis can be used as a standalone or as a complementary tool for the PubChem bioassay neighboring process to enable efficient integration of assay results and generate hypotheses for the discovery of bioactivities of the tested reagents.
The importance of bioluminescence in enabling a broad range of high-throughput screening (HTS) assay formats is evidenced by widespread use in industry and academia. Therefore, understanding the mechanisms by which reporter enzyme activity can be modulated by small molecules is critical to the interpretation of HTS data. In this Perspective, we provide evidence for stabilization of luciferase by inhibitors in cell-based luciferase reporter-gene assays resulting in the counterintuitive phenomenon of signal activation. These data were derived from our analysis of luciferase inhibitor compound structures and their prevalence in the Molecular Libraries Small Molecule Repository using 100 HTS experiments available in PubChem. Accordingly, we found an enrichment of luciferase inhibitors in luciferase reporter-gene activation assays but not in assays using other reporters. In addition, for several luciferase inhibitor chemotypes, we measured reporter stabilization and signal activation in cells that paralleled the inhibition determined using purified luciferase to provide further experimental support for these contrasting effects.
Motivation: The comprehensive information of small molecules and their biological activities in PubChem brings great opportunities for academic researchers. However, mining high-throughput screening (HTS) assay data remains a great challenge given the very large data volume and the highly imbalanced nature with only small number of active compounds compared to inactive compounds. Therefore, there is currently a need for better strategies to work with HTS assay data. Moreover, as luciferase-based HTS technology is frequently exploited in the assays deposited in PubChem, constructing a computational model to distinguish and filter out potential interference compounds for these assays is another motivation.
Results: We used the granular support vector machines (SVMs) repetitive under sampling method (GSVM-RU) to construct an SVM from luciferase inhibition bioassay data that the imbalance ratio of active/inactive is high (1/377). The best model recognized the active and inactive compounds at the accuracies of 86.60% and 88.89 with a total accuracy of 87.74%, by cross-validation test and blind test. These results demonstrate the robustness of the model in handling the intrinsic imbalance problem in HTS data and it can be used as a virtual screening tool to identify potential interference compounds in luciferase-based HTS experiments. Additionally, this method has also proved computationally efficient by greatly reducing the computational cost and can be easily adopted in the analysis of HTS data for other biological systems.
Availability: Data are publicly available in PubChem with AIDs of 773, 1006 and 1379.
Contact: firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
Firefly luciferase (FLuc) is frequently used as a reporter in high-throughput screening assays owing to the exceptional sensitivity, dynamic range, and rapid measurement that bioluminescence affords. However, interaction of small molecules with FLuc has, to some extent, confounded its use in chemical biology and drug discovery. To identify and characterize chemotypes interacting with FLuc, we determined potency values for 360,864 compounds, found in the NIH Molecular Libraries Small Molecule Repository, available in PubChem. FLuc inhibitory activity was observed for 12% of this library with discernible SAR. Characterization of 151 inhibitors demonstrated a variety of inhibition modes including FLuc-catalyzed formation of multisubstrate-adduct enzyme inhibitor complexes. As in some cell-based FLuc reporter assays compounds acting as FLuc inhibitors yield paradoxical luminescence increases, data on compounds acquired from FLuc-dependent assays requires careful analysis as described in this report.
profiling; PubChem; luciferase; quantitative high-throughput screening; qHTS; firefly luciferase; reporter-gene assays; adenylate forming enzymes
Recent years have seen an explosion in the amount of publicly available chemical and related biological information. A significant step has been the emergence of PubChem, which contains property information for millions of chemical structures, and acts as a repository of compounds and bioassay screening data for the NIH Roadmap. There is a strong need for tools designed for scientists that permit easy download and use of these data. We present one such tool, PubChemSR.
PubChemSR (Search and Retrieve) is a freely available desktop application written for Windows using Microsoft .NET that is designed to assist scientists in search, retrieval and organization of chemical and biological data from the PubChem database. It employs SOAP web services made available by NCBI for extraction of information from PubChem.
Results and Discussion
The program supports a wide range of searching techniques, including queries based on assay or compound keywords and chemical substructures. Results can be examined individually or downloaded and exported in batch for use in other programs such as Microsoft Excel. We believe that PubChemSR makes it straightforward for researchers to utilize the chemical, biological and screening data available in PubChem. We present several examples of how it can be used.
Recent advances in high-throughput screening (HTS) techniques and readily available compound libraries generated using combinatorial chemistry or derived from natural products enable the testing of millions of compounds in a matter of days. Due to the amount of information produced by HTS assays, it is a very challenging task to mine the HTS data for potential interest in drug development research. Computational approaches for the analysis of HTS results face great challenges due to the large quantity of information and significant amounts of erroneous data produced.
In this study, Decision Trees (DT) based models were developed to discriminate compound bioactivities by using their chemical structure fingerprints provided in the PubChem system . The DT models were examined for filtering biological activity data contained in four assays deposited in the PubChem Bioassay Database including assays tested for 5HT1a agonists, antagonists, and HIV-1 RT-RNase H inhibitors. The 10-fold Cross Validation (CV) sensitivity, specificity and Matthews Correlation Coefficient (MCC) for the models are 57.2~80.5%, 97.3~99.0%, 0.4~0.5 respectively. A further evaluation was also performed for DT models built for two independent bioassays, where inhibitors for the same HIV RNase target were screened using different compound libraries, this experiment yields enrichment factor of 4.4 and 9.7.
Our results suggest that the designed DT models can be used as a virtual screening technique as well as a complement to traditional approaches for hits selection.
Importance of the field
PubChem is a public molecular information repository, a scientific showcase of the NIH Roadmap Initiative. The PubChem database holds over 27 million records of unique chemical structures of compounds (CID) derived from nearly 70 million substance depositions (SID), and contains more than 449,000 bioassay records with over thousands of in vitro biochemical and cell-based screening bioassays established, with targeting more than 7000 proteins and genes linking to over 1.8 million of substances.
Areas covered in this review
This review builds on recent PubChem-related computational chemistry research reported by other authors while providing readers with an overview of the PubChem database, focusing on its increasing role in cheminformatics, virtual screening and toxicity prediction modeling.
What the reader will gain
These publicly available datasets in PubChem provide great opportunities for scientists to perform cheminformatics and virtual screening research for computer-aided drug design. However, the high volume and complexity of the datasets, in particular the bioassay-associated false positives/negatives and highly imbalanced datasets in PubChem, also creates major challenges. Several approaches regarding the modeling of PubChem datasets and development of virtual screening models for bioactivity and toxicity predictions are also reviewed.
Take home message
Novel data-mining cheminformatics tools and virtual screening algorithms are being developed and used to retrieve, annotate and analyze the large-scale and highly complex PubChem biological screening data for drug design.
PubChem; cheminformatics; data-mining; virtual screening; toxicity; polypharmacology
Cathepsin B, a ubiquitous lysosomal cysteine protease, is involved in many biological processes related to several human diseases. Inhibitors targeting the enzyme have been investigated as possible diseases treatments. A set of 37 compounds were recently found active in a high throughput screening assay to inhibit the catalytic activity of Cathepsin B, with chemical structures and biological test results available to the public in the PubChem BioAssay Database (AID 820). In the present study, we compare these experimental activities to the results of theoretical predictions from binding affinity calculation with a LR-MM-PNSA approach based on docked complexes. Strong correlations (r2 = 0.919 and q2 = 0.887 for the best) are observed between the theoretical predictions and experimental biological activity. The models are cross-validated by four independent predictive experiments with randomly split compounds into training and test sets. Our results also show that the results based on protein dimer show better correlations with experimental activity when compared to results based on monomer in the in silico calculations.
Cathepsin B protein; free energy of binding; virtual screening; modeling; drug design; regression; PubChem
To improve the utility of PubChem, a public repository containing biological activities of small molecules, the PubChem3D project adds computationally-derived three-dimensional (3-D) descriptions to the small-molecule records contained in the PubChem Compound database and provides various search and analysis tools that exploit 3-D molecular similarity. Therefore, the efficient use of PubChem3D resources requires an understanding of the statistical and biological meaning of computed 3-D molecular similarity scores between molecules.
The present study investigated effects of employing multiple conformers per compound upon the 3-D similarity scores between ten thousand randomly selected biologically-tested compounds (10-K set) and between non-inactive compounds in a given biological assay (156-K set). When the “best-conformer-pair” approach, in which a 3-D similarity score between two compounds is represented by the greatest similarity score among all possible conformer pairs arising from a compound pair, was employed with ten diverse conformers per compound, the average 3-D similarity scores for the 10-K set increased by 0.11, 0.09, 0.15, 0.16, 0.07, and 0.18 for STST-opt, CTST-opt, ComboTST-opt, STCT-opt, CTCT-opt, and ComboTCT-opt, respectively, relative to the corresponding averages computed using a single conformer per compound. Interestingly, the best-conformer-pair approach also increased the average 3-D similarity scores for the non-inactive–non-inactive (NN) pairs for a given assay, by comparable amounts to those for the random compound pairs, although some assays showed a pronounced increase in the per-assay NN-pair 3-D similarity scores, compared to the average increase for the random compound pairs.
These results suggest that the use of ten diverse conformers per compound in PubChem bioassay data analysis using 3-D molecular similarity is not expected to increase the separation of non-inactive from random and inactive spaces “on average”, although some assays show a noticeable separation between the non-inactive and random spaces when multiple conformers are used for each compound. The present study is a critical next step to understand effects of conformational diversity of the molecules upon the 3-D molecular similarity and its application to biological activity data analysis in PubChem. The results of this study may be helpful to build search and analysis tools that exploit 3-D molecular similarity between compounds archived in PubChem and other molecular libraries in a more efficient way.
There are three main problems associated with the virtual screening of bioassay data. The first is access to freely-available curated data, the second is the number of false positives that occur in the physical primary screening process, and finally the data is highly-imbalanced with a low ratio of Active compounds to Inactive compounds. This paper first discusses these three problems and then a selection of Weka cost-sensitive classifiers (Naive Bayes, SVM, C4.5 and Random Forest) are applied to a variety of bioassay datasets.
Pharmaceutical bioassay data is not readily available to the academic community. The data held at PubChem is not curated and there is a lack of detailed cross-referencing between Primary and Confirmatory screening assays. With regard to the number of false positives that occur in the primary screening process, the analysis carried out has been shallow due to the lack of cross-referencing mentioned above. In six cases found, the average percentage of false positives from the High-Throughput Primary screen is quite high at 64%. For the cost-sensitive classification, Weka's implementations of the Support Vector Machine and C4.5 decision tree learner have performed relatively well. It was also found, that the setting of the Weka cost matrix is dependent on the base classifier used and not solely on the ratio of class imbalance.
Understandably, pharmaceutical data is hard to obtain. However, it would be beneficial to both the pharmaceutical industry and to academics for curated primary screening and corresponding confirmatory data to be provided. Two benefits could be gained by employing virtual screening techniques to bioassay data. First, by reducing the search space of compounds to be screened and secondly, by analysing the false positives that occur in the primary screening process, the technology may be improved. The number of false positives arising from primary screening leads to the issue of whether this type of data should be used for virtual screening. Care when using Weka's cost-sensitive classifiers is needed - across the board misclassification costs based on class ratios should not be used when comparing differing classifiers for the same dataset.
Summary: Promiscuity counts allow for a better understanding of a compound's assay activity profile and drug potential. Although PubChem contains a vast amount of compound and assay data, it currently does not have a convenient or efficient method to obtain in-depth promiscuity counts for compounds. PubChem promiscuity fills this gap. It is a Java servlet that uses NCBI Entrez (eUtils) web services to interact with PubChem and provide promiscuity counts in a variety of categories along with compound descriptors, including PAINS-based functional group detection.
With the rapidly increasing availability of High-Throughput Screening (HTS) data in the public domain, such as the PubChem database, methods for ligand-based computer-aided drug discovery (LB-CADD) have the potential to accelerate and reduce the cost of probe development and drug discovery efforts in academia. We assemble nine data sets from realistic HTS campaigns representing major families of drug target proteins for benchmarking LB-CADD methods. Each data set is public domain through PubChem and carefully collated through confirmation screens validating active compounds. These data sets provide the foundation for benchmarking a new cheminformatics framework BCL::ChemInfo, which is freely available for non-commercial use. Quantitative structure activity relationship (QSAR) models are built using Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), Decision Trees (DTs), and Kohonen networks (KNs). Problem-specific descriptor optimization protocols are assessed including Sequential Feature Forward Selection (SFFS) and various information content measures. Measures of predictive power and confidence are evaluated through cross-validation, and a consensus prediction scheme is tested that combines orthogonal machine learning algorithms into a single predictor. Enrichments ranging from 15 to 101 for a TPR cutoff of 25% are observed.
virtual screening; machine learning; quantitative structure-activity relations (QSAR); high-throughput screening (HTS); cheminformatics; PubChem; BCL
Adverse drug reactions (ADRs) can have severe consequences, such that the ability to predict ADRs prior to market introduction is desirable. Computational approaches applied to pre-clinical data might be one way to inform drug labeling and marketing with respect to potential ADRs.
Based on the premise that some of the molecular actors of ADRs involve interactions detectable in large, and increasingly public, compound screening campaigns, we generated logistic regression models that correlate post-marketing ADRs with screening data from the PubChem BioAssay database. These models analyze ADRs at the level of organ systems, the System Organ Classes (SOCs). Nine of the 19 SOCs under consideration were found to be significantly correlated with pre-clinical screening data. For 6 of the 8 established drugs for which we could retropredict SOC-specific adversities, prior knowledge was found that support these predictions. We conclude by predicting SOC-specific adversities for three unapproved or recently introduced drugs.
Adverse drug reactions; prediction; machine learning; compound screening; pharmacovigilance
The use of 3-D similarity techniques in the analysis of biological data and virtual screening is pervasive, but what is a biologically meaningful 3-D similarity value? Can one find statistically significant separation between "active/active" and "active/inactive" spaces? These questions are explored using 734,486 biologically tested chemical structures, 1,389 biological assay data sets, and six different 3-D similarity types utilized by PubChem analysis tools.
The similarity value distributions of 269.7 billion unique conformer pairs from 734,486 biologically tested compounds (all-against-all) from PubChem were utilized to help work towards an answer to the question: what is a biologically meaningful 3-D similarity score? The average and standard deviation for the six similarity measures STST-opt, CTST-opt, ComboTST-opt, STCT-opt, CTCT-opt, and ComboTCT-opt were 0.54 ± 0.10, 0.07 ± 0.05, 0.62 ± 0.13, 0.41 ± 0.11, 0.18 ± 0.06, and 0.59 ± 0.14, respectively. Considering that this random distribution of biologically tested compounds was constructed using a single theoretical conformer per compound (the "default" conformer provided by PubChem), further study may be necessary using multiple diverse conformers per compound; however, given the breadth of the compound set, the single conformer per compound results may still apply to the case of multi-conformer per compound 3-D similarity value distributions. As such, this work is a critical step, covering a very wide corpus of chemical structures and biological assays, creating a statistical framework to build upon.
The second part of this study explored the question of whether it was possible to realize a statistically meaningful 3-D similarity value separation between reputed biological assay "inactives" and "actives". Using the terminology of noninactive-noninactive (NN) pairs and the noninactive-inactive (NI) pairs to represent comparison of the "active/active" and "active/inactive" spaces, respectively, each of the 1,389 biological assays was examined by their 3-D similarity score differences between the NN and NI pairs and analyzed across all assays and by assay category types. While a consistent trend of separation was observed, this result was not statistically unambiguous after considering the respective standard deviations. While not all "actives" in a biological assay are amenable to this type of analysis, e.g., due to different mechanisms of action or binding configurations, the ambiguous separation may also be due to employing a single conformer per compound in this study. With that said, there were a subset of biological assays where a clear separation between the NN and NI pairs found. In addition, use of combo Tanimoto (ComboT) alone, independent of superposition optimization type, appears to be the most efficient 3-D score type in identifying these cases.
This study provides a statistical guideline for analyzing biological assay data in terms of 3-D similarity and PubChem structure-activity analysis tools. When using a single conformer per compound, a relatively small number of assays appear to be able to separate "active/active" space from "active/inactive" space.
Motivation: Most of the previous data mining studies based on the NCI-60 dataset, due to its intrinsic cell-based nature, can hardly provide insights into the molecular targets for screened compounds. On the other hand, the abundant information of the compound–target associations in PubChem can offer extensive experimental evidence of molecular targets for tested compounds. Therefore, by taking advantages of the data from both public repositories, one may investigate the correlations between the bioactivity profiles of small molecules from the NCI-60 dataset (cellular level) and their patterns of interactions with relevant protein targets from PubChem (molecular level) simultaneously.
Results: We investigated a set of 37 small molecules by providing links among their bioactivity profiles, protein targets and chemical structures. Hierarchical clustering of compounds was carried out based on their bioactivity profiles. We found that compounds were clustered into groups with similar mode of actions, which strongly correlated with chemical structures. Furthermore, we observed that compounds similar in bioactivity profiles also shared similar patterns of interactions with relevant protein targets, especially when chemical structures were related. The current work presents a new strategy for combining and data mining the NCI-60 dataset and PubChem. This analysis shows that bioactivity profile comparison can provide insights into the mode of actions at the molecular level, thus will facilitate the knowledge-based discovery of novel compounds with desired pharmacological properties.
Availability: The bioactivity profiling data and the target annotation information are publicly available in the PubChem BioAssay database (ftp://ftp.ncbi.nlm.nih.gov/pubchem/Bioassay/).
Contact: firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
PubChem is an open repository for small molecules and their experimental biological activity. PubChem integrates and provides search, retrieval, visualization, analysis, and programmatic access tools in an effort to maximize the utility of contributed information. There are many diverse chemical structures with similar biological efficacies against targets available in PubChem that are difficult to interrelate using traditional 2-D similarity methods. A new layer called PubChem3D is added to PubChem to assist in this analysis.
PubChem generates a 3-D conformer model description for 92.3% of all records in the PubChem Compound database (when considering the parent compound of salts). Each of these conformer models is sampled to remove redundancy, guaranteeing a minimum (non-hydrogen atom pair-wise) RMSD between conformers. A diverse conformer ordering gives a maximal description of the conformational diversity of a molecule when only a subset of available conformers is used. A pre-computed search per compound record gives immediate access to a set of 3-D similar compounds (called "Similar Conformers") in PubChem and their respective superpositions. Systematic augmentation of PubChem resources to include a 3-D layer provides users with new capabilities to search, subset, visualize, analyze, and download data.
A series of retrospective studies help to demonstrate important connections between chemical structures and their biological function that are not obvious using 2-D similarity but are readily apparent by 3-D similarity.
The addition of PubChem3D to the existing contents of PubChem is a considerable achievement, given the scope, scale, and the fact that the resource is publicly accessible and free. With the ability to uncover latent structure-activity relationships of chemical structures, while complementing 2-D similarity analysis approaches, PubChem3D represents a new resource for scientists to exploit when exploring the biological annotations in PubChem.
Natural products, as major resources for drug discovery historically, are gaining more attentions recently due to the advancement in genomic sequencing and other technologies, which makes them attractive and amenable to drug candidate screening. Collecting and mining the bioactivity information of natural products are extremely important for accelerating drug development process by reducing cost. Lately, a number of publicly accessible databases have been established to facilitate the access to the chemical biology data for small molecules including natural products. Thus, it is imperative for scientists in related fields to exploit these resources in order to expedite their researches on natural products as drug leads/candidates for disease treatment. PubChem, as a public database, contains large amounts of natural products associated with bioactivity data. In this review, we introduce the information system provided at PubChem, and systematically describe the applications for a set of PubChem web services for rapid data retrieval, analysis, and downloading of natural products. We hope this work can serve as a starting point for the researchers to perform data mining on natural products using PubChem.
natural products; drug discovery; PubChem; public database; data mining
Despite intense investment growth and technology development, there is an observed bottleneck in drug discovery and development over the past decade. NIH started the Molecular Libraries Initiative (MLI) in 2003 to enlarge the pool for potential drug targets, especially from the “undruggable” part of human genome, and potential drug candidates from much broader types of drug-like small molecules. All results are being made publicly available in a web portal called PubChem.
In this paper we construct a network from bioassay data in PubChem, apply network biology concepts to characterize this bioassay network, integrate information from multiple biological databases (e.g. DrugBank, OMIM, and UniHI), and systematically analyze the potential of bioassay targets being new drug targets in the context of complex biological networks. We propose a model to quantitatively prioritize this druggability of bioassay targets, and literature evidence was found to confirm our prioritization of bioassay targets at a roughly 70% accuracy.
Our analysis provide some measures of the value of the MLI data as a resource for both basic chemical biology research and future therapeutic discovery.
To develop efficient approaches for rapid evaluation of chemical toxicity and human health risk of environmental compounds, the National Toxicology Program (NTP) in collaboration with the National Center for Chemical Genomics has initiated a project on high-throughput screening (HTS) of environmental chemicals. The first HTS results for a set of 1,408 compounds tested for their effects on cell viability in six different cell lines have recently become available via PubChem.
We have explored these data in terms of their utility for predicting adverse health effects of the environmental agents.
Methods and results
Initially, the classification k nearest neighbor (kNN) quantitative structure–activity relationship (QSAR) modeling method was applied to the HTS data only, for a curated data set of 384 compounds. The resulting models had prediction accuracies for training, test (containing 275 compounds together), and external validation (109 compounds) sets as high as 89%, 71%, and 74%, respectively. We then asked if HTS results could be of value in predicting rodent carcinogenicity. We identified 383 compounds for which data were available from both the Berkeley Carcinogenic Potency Database and NTP–HTS studies. We found that compounds classified by HTS as “actives” in at least one cell line were likely to be rodent carcinogens (sensitivity 77%); however, HTS “inactives” were far less informative (specificity 46%). Using chemical descriptors only, kNN QSAR modeling resulted in 62.3% prediction accuracy for rodent carcinogenicity applied to this data set. Importantly, the prediction accuracy of the model was significantly improved (72.7%) when chemical descriptors were augmented by HTS data, which were regarded as biological descriptors.
Our studies suggest that combining NTP–HTS profiles with conventional chemical descriptors could considerably improve the predictive power of computational approaches in toxicology.
carcinogenesis; computational toxicology; high-throughput screening; QSAR