A number of bioactive compounds of current interest are discovered by phenotypic screening,
1,2 most of which are functional in nature through analyzing the compound-induced effects in cells, tissues, and model organisms. These assays, however, can hardly provide immediate target information for tested compounds, imposing grand challenges on follow-up target identification for drug discovery.
3−5 The recent findings that many drugs act on multiple physiological targets to exert therapeutic effects and/or side effects have attracted intensive interest in exploring the promiscuity and polypharmacology of drugs,
6,7 in which identifying compound-target associations is a premise.
Experimentally, two major techniques are used for target identification.(
3) Direct techniques, such as affinity chromatography
8,9 and protein microarray,(
10) detect the binding of a compound to its target. Their applications are often hampered by the need to label a compound without affecting its functionality. Indirect techniques infer targets from the compound-induced cellular or physiological patterns through genomics,
11,12 proteomics,(
13) metabolite profiling,(
14) and other technologies. However, genome-wide or proteome-wide data could be very difficult and expensive to obtain.
Moreover, wet-lab experiments for target identification are often slow, whereas computational approaches can be efficient complements.(
15) For example, molecular modeling studies have been reported for target prediction by virtually docking a compound of interest to a list of potential targets with known three-dimensional (3D) structures.
16,17 The primary limitation of this method is the need for high-resolution 3D structures of targets as well as accurate docking/scoring algorithms.
18,19 Statistical models also have been built for target prediction employing various machine learning methods including Bayesian analysis
20,21 and Support Vector Machines.(
22) The common drawbacks of these models are that the real predictability beyond training space cannot always be guaranteed. In addition, the similarity principle,
23,24 despite its exceptions,(
25) has been the basis for target identification using similarity metrics such as ligand chemical similarity
5,7,26 and drug side effects similarity.(
4) On the other hand, with the rapid growth of public biological databases, such as the Protein Data Bank(
27) (PDB), PubChem,(
28) ChEMBL (
http://www.ebi.ac.uk/chembl), DrugBank,
29,30 and Therapeutic Targets Database
31,32 (TTD), abundant bioactivity data of small molecules and their targets are now available to the entire research community. It is thus getting critical to develop
in silico methods to identify compound-target associations and infer targets for drugs and bioactive compounds by aggregating and integrating valuable target information from multiple resources.
End points of bioactivity data obtained from a panel of assays (i.e., bioactivity profile) may provide distinct insight to the biological function of compounds and their targets. For example, the COMPARE algorithm,(
33) by the Developmental Therapeutics Program (DTP) of the US National Cancer Institute (NCI), could be used to suggest possible mechanism of action for a respective compound from related compounds or identify novel compounds that act by a similar mechanism of interest.
34−36 This tool compares the bioactivity patterns derived from the anticancer drug screening data across 60 human tumor cell lines (commonly known as the NCI-60 data set). By incorporating additional gene expression data, target information may be inferred.(
34)
The NCI-60 data set was also used in our previous work,(
37) where we observed in a few model systems that the target networks of small molecules were well-correlated with their bioactivity profiles. Here, given the rapid growth in available compound-target annotations in several public databases, we further investigated whether such correlations could be utilized to benefit the identification of new targets for drugs and bioactive compounds on a larger scale. To this end, we first constructed a database of bioactivity profiles for 4296 compounds tested in the NCI-60 data set. Second, we used each compound as a query to search against the entire bioactivity profile database to identify neighbor compounds with similar bioactivity profiles. Third, we collected target information from four public databases (DrugBank, TTD, ChEMBL and PubChem) for both query compounds and their neighbor compounds to evaluate our approach for predicting compound-target associations. The underlying assumption is that compounds with similar bioactivity profiles may share common targets. We were able to verify a remarkable portion of our predictions retrospectively.