|Home | About | Journals | Submit | Contact Us | Français|
The progress of high throughput screening (HTS) techniques is changing the chemical data landscape by producing massive biological data from tested compounds. Public data repositories (e.g., PubChem) receive HTS data provided by various institutes and this data pool is being updated on a daily basis. The goal of these data sharing efforts is to let users quickly obtain the biological data of target compounds. Without a universal chemical identifier, the repositories (e.g. PubChem) provide users various methods to query and retrieve chemical properties and biological data by several different chemical identifiers (e.g., SMILES, InChIKey, IUPAC name, etc.). The major challenge for most users, especially computational modelers, is obtaining the biological data for a large dataset of compounds (e.g. thousands of drug molecules) instead of a single compound. This chapter aims to introduce the steps to access the public data repositories for target compounds with specific emphasis on the automatic data downloading for large datasets.
In the past decade, the use of high throughput screening (HTS) in drug discovery and chemical toxicity evaluations has greatly facilitated the progress of cheminformatics studies. The massive data generated from HTS studies provides scientists a new vision of the biological effects induced by the compounds being tested. As a consequence, data sharing efforts to make these data easily available to communities have been undertaken within the same period. Data repositories such as PubChem, ChEMBL, BindingDB, and the Comparative Toxicogenomics Database (CTD), have become daily-used scientific tools for many scientists. With a chemical identifier (e.g. InChIKey), the biological data for a target compound can be obtained by clicking a button on the web based search portals of these repositories.
The PubChem project (https://pubchem.ncbi.nlm.nih.gov), initiated and hosted by the National Center for Biotechnology Information (NCBI) of the National Institutes of Health (NIH), is the largest public chemical data source. The goal of PubChem is to provide information for the biological activities of small molecules, and as of September 2015, it has received over 60 million unique chemical structures and 1 million biological assays from over 350 contributors . The PubChem data repository consists of three primary databases: Substance, Compound, and BioAssay. The PubChem Substance database, indexed by the PubChem substance identifier (SID), contains chemical structures, synonyms, registration IDs, descriptions, related links, database cross-reference links to PubMed, protein 3D structures, and biological screening results. The PubChem BioAssay Database contains experimental testing results of the chemical substances described within the PubChem Substance database. Each HTS assay has a unique assay identifier (AID) in this database. The PubChem Compound database contains validated chemical depiction information that is provided to describe substances in PubChem Substance database. The chemical identifier, PubChem compound ID (CID), records unique structural information and, in turn, allows target compounds to be queried within the PubChem Compound database. Compound records are supplemented with textual structural identifiers such as Simplified Molecular-Input Line-Entry Systems (SMILES), the IUPAC International Chemical Identifier (InChI), and a more compact version, the InChI key[2–5].
PubChem HTS data were obtained from various sources including university, industry or government laboratories. One of the initial missions of PubChem is to function as the repository to host the data generated by the HTS projects supported by the NIH’s Molecular Libraries Program. A full listing of data sources can be found at http://pubchem.ncbi.nlm.nih.gov/sources#assay. The types of HTS data from these sources include results obtained from binding assays, functional cell assays, and Absorption, Distribution, Metabolism, Exclusion and Toxicity (ADMET) assays. The HTS data points can be qualitative (e.g., active or inactive), quantitative (e.g., the half-maximal effective concentration of a drug), or both. PubChem stores the HTS data of a compound in two fields: the activity outcome and the active concentration. The activity outcome either identifies the relevant compound as a chemical probe (i.e., a positive control of a HTS assay) or qualitatively transforms the experimental data into one of the following categories: active, inactive, unspecified/inconclusive, or untested. On the other hand, the active concentration stores the HTS data quantitatively as a concentration value in µM unit as well-defined biological endpoints, such as the half-maximal activity response (e.g., IC50, EC50, etc.).
There are two methods to obtain HTS data by accessing the data sharing repositories, such as PubChem. The data can be obtained by querying manually with individual compounds’ textual chemical identifiers. However, if the goal is to download all relevant HTS data for a large set of compounds, automatic data extraction is needed. This chapter will use PubChem as an example to show how to obtain HTS data for target compounds, especially for a large set of compounds.
To extract the HTS data for target compounds via public data repositories (i.e. PubChem), the following software needs to be downloaded/installed on the computer:
Similar to popular internet search portals (e.g. Google®), PubChem provides users a manual search function by which queries can be made using various chemical identifiers. Each unique compound in the PubChem Compound database has an individual page listing standardized chemical information and properties, including a list of all submitted biological testing results. For a target compound existing in PubChem database, its biological testing data can be exported and downloaded as a comma-separated values (CSV) file and managed using Microsoft Excel®. Figure 1 shows the screenshot of the plain text file for the biological data for aspirin, with PubChem CID 2244 (downloaded from PubChem on February 15, 2016). The biological data of a compound is summarized by including not only the bioassay identifier (AID) and the associated testing results, but also detailed information of the bioassays and the definitions of the activities. This file can be obtained by inputting various identifiers of aspirin to their appropriate categories. Figure 2 shows a screenshot of the homepage to the PubChem search function. More information on the appropriate search option for a given identifier can be found in the Notes section. The biological data of a single target compound can be accessed by the following steps:
The resulting bioassay information for that compound will be automatically retrieved as a plain text file.
If the goal is to download the HTS data for a large dataset (e.g. consisting of more than 1,000 compounds), automatic querying is needed by executing a coding script. To this end, PubChem offers specialized data retrieval services through a programmatic interface: PubChem Power User Gateway (PUG). The PUG provides quick access to PubChem data retrieval functions. Information on all the available PUG services can be found in the reference  as well as within the PubChem portal (https://pubchem.ncbi.nlm.nih.gov/pug/pughelp.html). The most broadly applicable function to retrieve HTS data for large chemical dataset is PUG-REST. PUG-REST, which uses a Representational State Transfer (REST)-style interface, allows users to construct Uniform Resource Identifiers (URLs) to retrieve data from PubChem. Through this way, PUG-REST is easily integrated with all programming/scripting languages that can post URLs (e.g., Java, Python, Perl, C#). Using PUG-REST, multiple records can be accessed automatically to fulfill the request of retrieving HTS data for large datasets. More information on all the features available through PUG-REST can also be found on the PubChem portal (https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html).
To construct a URL for PUG-REST data retrieval, a text URL string needs to be created and it contains four parts: base, input, operation, and output. The construction of this kind of URL is shown in Figure 3. The input section of the URL describes the target database (BioAssay, Compound, or Substance) to be queried, the category of the identifier, along with the identifier information. The operation section designates the information to be retrieved (“assaysummary” in this case to retrieve HTS data). The output section specifies the format of the output file. Figure 3 also shows several examples of URLs used to retrieve HTS data through PUG-REST.
Inputting a constructed URL into a web browser will result in the display of bioassay information for this target compound in the desired format. Within a programming script, individual URLs can be constructed for each compound in a large dataset in an automated fashion. In this case, the HTS data for all compounds in this dataset will be retrieved in the preferred output format.
For Windows users: open a Windows Explorer window by clicking the “Start” button and then clicking “Computer”. In the address bar, type ftp://ftp.ncbi.nlm.nih.gov/pubchem/Bioassay/.
For Mac users: Click “Go” then click “Connect to Server…”. In the “Server Address” field type: ftp://ftp.ncbi.nlm.nih.gov/pubchem/Bioassay/. When asked to enter name and password, choose to connect as “Guest”.
For Linux users: Open any file manager window. Click “File” and then click “Connect to Server…”. In the “Server” field type ftp://ftp.ncbi.nlm.nih.gov/pubchem/Bioassay/.
It is important to realize that PubChem updates its databases frequently.
The PubChem search portal accepts a variety of search options. For example, exact queries (under the “Name/Text” tab) can be performed based on general identifiers, such as CID, IUPAC name, etc.. Alternatively, queries can also be performed based chemical structure information. Under the “Identity/Similarity” search option, structural identifiers (i.e. SMILES or InChI) can be used. Commonly used chemical structure file formats, such as Structure Data File (SDF) format, can also be accepted under this search tab. If no identifier is available for a compound, there is also an option to draw the chemical structure using the drawing tool.
HTS data can be accessed through public repositories for target compounds. Since there are several existing methods to obtain HTS data, the individual researcher can choose the suitable way to get the HTS data based on the actual needs (e.g. the size of the target dataset). Meanwhile, the HTS data retrieval procedure needs necessary computer skills and background knowledge of the data sources, such as what have been described for PubChem data acquire in this chapter.