The PubChem BioAssay model is designed to allow unambiguous representation of data produced by various experimental procedures, to support the retrieval of individual information components and to track the biological target and the respective bioactivity outcome. An assay record, represented by a unique PubChem BioAssay accession AID, is organized in two parts, the assay description and the assay results. The assay description includes a name, data source, purpose, experimental protocol, tested reagent category (e.g. small molecule versus siRNA), comment and result/readout descriptions. The PubChem BioAssay archive format provides numerous ways for contributing organizations to annotate a given assay. These annotations include: textual descriptions; target information, including cross-references to GenBank (18
) records, name, description, molecule type (e.g. protein versus nucleotide) and taxonomy; qualified cross-references to PubMed citations, three-dimensional protein structures, biosystems and diseases; and URLs back to the depositor’s website. Each BioAssay record can contain as many comments and as much descriptive text as needed to provide the overall background of the assay, such as the biological system tested in the assay or the relationship between a disease and the selected therapeutic target. The assay protocol, similar to the method section of a journal publication, helps to explain the actual methodology of the assay.
Multiple test result fields may be specified per assay, each with a unique test identifier (TID), name, description, data type, data unit and annotation for cross-references. The PubChem assigned TID indicates a particular test result or readout when reporting results for a given substance. The number of test readouts are only limited by their potential usability.
Many biological assays employ a dose–response scheme, with a primary endpoint [e.g. IC50 (http://en.wikipedia.org/wiki/IC50
)]. PubChem requires this key readout, denoted as an ‘active concentration summary’, to have micro-molar units and requires the experimental concentrations for the corresponding dose–response readouts (also in micro-molar concentration and referred to as ‘tested concentrations’) to be designated on the respective test result fields as an attribution. These specialized readouts allow PubChem users to classify and rank hits of a screening test and search bioassay results with specific values or ranges of primary outcomes.
Biological screening data submitted to PubChem are diverse and assay specific. As such, there are no specific requirements on the presence of particular test readouts; however, PubChem requires a summary result for each tested chemical sample. The summary result is 2-fold: bioactivity outcome and bioactivity score. The ‘bioactivity outcome’ partitions results and includes five categories: chemical probe, active, inactive, inconclusive and unspecified. Criteria and rationale used by the testing organization for summary results, as well as description about possible factors of artifact, are often provided in the assay comment section, aiding the user’s interpretation and utilization of the biological data.
The assay result section includes the results for all tested substances. Results reported per substance can include both assay readout and annotations, including target description, comment on the individual biological test result, cross-links to other NCBI resources and URLs to the depositor’s website. Assay data are provided in a tabular format, with one tested substance per row and one assay test readout or annotation per column. A substance need not have results reported for all defined test readouts. There is no limit on the count of substance test results in an assay record.
The stage of the biological experiments in PubChem varies. Each assay is classified by the contributing organization according to the stage of the assay project, which is described as the ‘activity outcome method’ in PubChem. These methods include: ‘screening’ assay, usually a primary high-throughput assay where the activity outcome is based on percentage inhibition from a single dose; ‘confirmatory’ assay, typically a low-throughput assay where the activity outcome is based on a dose–response relationship with multiple tested concentrations; ‘summary’ assay, for validated chemical probes or small molecule leads, summarizing information from multiple assays; and ‘other’, those assays that do not fit the previous categories. For MLPCN projects, a summary assay is required for each biological screening project to describe the identified chemical probes, report screening steps that lead to the project progress and communicate the bottom line of the screening campaign to the scientific community. A summary assay, therefore, consists of a list of verified chemical probes if identified, a comprehensive text description of the screening campaign and links from this summary assay AID to all associated screening assays deposited in PubChem, to the targeted genes and proteins in GenBank and to the scientific publications describing the screening experiments available in PubMed. To make it easier for the MLP screening centers to create and update summary assays, the PubChem deposition system allows one to create a simple summary template at an early stage and to update the summary assay with any new information and experimental results as the screening project progresses.
It is essential to specify and track the information of the assay target and precisely group and annotate the biological tests based on the respective molecular target. PubChem BioAssay provides several models to do so. The traditional assay model allows for the specification of a single target for the entire assay record, along with associated annotations such as links to the respective gene, taxonomy and biological pathway information. In this model, the bioactivity outcomes provided in the entire assay dataset are solely for the specific target, for example, to describe the biological effect of the small molecules on the functionality of one enzyme.
PubChem also supports the presentation and annotation of multiple highly related bioactivity outcomes, such as a profiling assay against a panel of molecular targets, in a single assay. Such a panel-type PubChem BioAssay record can contain multiple test readouts and respective bioactivity outcome annotations for each individual target, as well as for an individual cell line or species defined within the ‘panel’. Each of such targets, cell lines or species is regarded as a ‘panel component’ in the data model and a unique panel component identifier (PID) is assigned to each. Description of the experiments, including a component name, general goal, specific experimental protocol and information of assay target, can be provided for each individual panel component. The test results for an individual panel component, which may be multiple, can be designated as a ‘bioactivity outcome’ or ‘active concentration’, if necessary, or otherwise treated as regular readouts.
Panel assay results are complex. This expansion of the PubChem BioAssay data model allows for the description of a compound profiling screening test and enables PubChem to record and annotate multiple related bioactivity outcomes under a single AID, which helps facilitate comparison and evaluation of compound bioactivities using the PubChem data analysis profiling tools. To see a panel assay example, one may examine the kinase-profiling assay for AID 1433 (http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=1433
A third bioassay data model allows one to specify a specific target for each individual tested sample. In this case, a specific test result is defined as containing the assay target. For example, one test result definition may be defined as containing the target identifier, a GenBank Protein GI number, while another test result definition may be defined as containing a short name of the target. This model was introduced originally to support the accommodation of siRNA screening results, where an entire genome may be screened with tens of thousands of siRNA reagents designed for thousands of gene targets of the genome, with one or several siRNA reagents corresponding to each of the targeted genes. Thus, in this situation, the nucleotide or gene target annotation needs to be siRNA specific, and associated with each tested sample. To see an example, one may look at AID 1622, a viability screen of human kinase and cell cycle genes (http://pubchem.ncbi.nlm.nih.gov
/assay/assay.cgi?aid=1622). This data model can also be employed to encode bioactivity information for multiple targets where substance data points across targets are sparsely populated, such as the data contributed by PDBbind in AID 1811 (http://pubchem.ncbi.nlm.nih.gov
In all, the PubChem BioAssay data model supports a comprehensive description of screening experiments and test results by providing flexible schemes to encode bioassay information. Such schemes facilitate public user discovery; for example, the ‘bioactivity score’ provides the relative activity rank to enable more interesting results to be shown first, while the provision of the bioactivity outcome summary to provide a shortcut from each assay to a list of ‘hits’ discovered in each screen, empowers the PubChem user to rapidly identify and partition biological assay results of interest. More importantly, the required standard bioassay metrics and annotation of assay targets allow PubChem to provide powerful tools for users to promptly classify and compare results across disparate assays and targets for a given set of chemical samples.