The field of high throughput screening (HTS) is rapidly advancing through the development of sophisticated robotics and liquid handling systems, sensitive and versatile detection technologies, and powerful informatics systems that enable miniaturization and increased throughput.1
Furthermore, HTS is being used to interrogate increasingly complex biological systems and processes, driven by advancements in molecular and cellular biology in combination with innovative assay designs.
In an effort to find novel entry points for drug discovery programs, countless HTS campaigns comprising large commercial and proprietary compound libraries have produce massive data sets – primarily in pharmaceutical companies. The NIH Molecular Libraries Roadmap Initiative2
and the availability of more affordable “out of the box” screening systems and reagents have facilitated a dissemination of HTS capabilities into academic institutes and universities, where they are now relatively common and available to researchers.
HTS datasets, which consist of experimental results and assay metadata, are typically stored in data warehouses using relational database schemas.3;4
The fast pace of innovation in assay designs and detection technologies, as well as the increasing complexity of the biological targets under investigation, pose challenges to “static” database schemas to capture and manage the diversity of screening experiments and their outcomes. To optimize the value of HTS efforts beyond any individual HTS campaign and to facilitate more informed decision-making as compounds progress in the value chain, systematic knowledge management is receiving increased attention from informatics organizations.5
In this context, a formal, well-structured, knowledge-based, and extensible description of biological assays is required. Expert biocuration to organize and annotate existing data is also a critical component of any HTS knowledge management solution.
PubChem is a public repository of HTS assay descriptions, small molecule compounds, and HTS results (which we refer to as endpoints).6;7
Originally put in place as part of the Molecular Libraries Program (MLP), it serves to host data generated at the MLP centers as well as that from other NIH funded projects. As of September 2010, there were over 2,100 bioassays from the MLP deposited in PubChem. In addition to PubChem, there are several other publically available sources of screening data, including, ChEMBL8
, which contains structure activity relationship (SAR) data curated from the medicinal chemistry literature; the Psychoactive Drug Screening Program (PDSP);9;10
In addition, private resources, such as Collaborative Drug Discovery (CDD),13;14
also make large screening data sets publically accessible.
Despite recommendations from industry and government work groups, there is currently no agreed upon standard for the representation of HTS assay data. Such a representation is vital for researchers to meaningfully interpret and compare diverse assay results.15
Because HTS data repositories lack detailed annotations using standardized terms, seemingly trivial queries such as “list the biochemical vs. cell-based assays”, or “list assays that use a luciferase reporter construct” are not possible. In addition, the lack of a formal description of biological assays hinders the integration of HTS data from different sources as well as with other life science databases (e.g. biological pathways).
PubChem’s already large and diverse set of deposited assay results along with several other accessible screening data repositories form a large corpus of data that can serve as a starting point to develop a systematic categorization of HTS assays. The exponential growth of public data repositories indicates that we are only beginning to explore the space of possible assay designs. The development of a clearly structured and standardized formal description of concepts that are relevant to interpreting HTS results is therefore very timely.
In this report we demonstrate how such a formalized terminology can facilitate analyses across multiple diverse assays to identify promiscuous compounds. These compounds are traditionally problematic for HTS and it is desirable to identify them as early as possible in a campaign. Compound promiscuity can be related to an assay technology, detection method or interaction with biological targets and often the specific mechanisms of action are not fully understood. There have been attempts at the identification of compound classes that can interfere with specific assay technologies, but these studies are usually focused on a small number of biological assays and did not make use of the large numbers of data sets currently available.16; 17
Here we attempt for the first time to identify promiscuous behavior on a large- scale using a curated data set that allowed us to interrogate compound behavior across certain assay categories and sub-categories.