|Home | About | Journals | Submit | Contact Us | Français|
Molecular target identification is of central importance to drug discovery. Here, we developed a computational approach, named bioactivity profile similarity search (BASS), for associating targets to small molecules by using the known target annotations of related compounds from public databases. To evaluate BASS, a bioactivity profile database was constructed using 4296 compounds that were commonly tested in the US National Cancer Institute 60 human tumor cell line anticancer drug screen (NCI-60). Each compound was used as a query to search against the entire bioactivity profile database, and reference compounds with similar bioactivity profiles above a threshold of 0.75 were considered as neighbor compounds of the query. Potential targets were subsequently linked to the identified neighbor compounds by using the known targets of the query compound. About 45% of the predicted compound-target associations were successfully verified retrospectively, suggesting the possible application of BASS in identifying the targets of uncharacterized compounds and thus providing insight into the study of promiscuity and polypharmacology. Furthermore, BASS identified a significant fraction of structurally diverse compounds with similar bioactivities, indicating its feasibility of “scaffold hopping” in searching novel molecules against the target of interest.
A number of bioactive compounds of current interest are discovered by phenotypic screening,1,2 most of which are functional in nature through analyzing the compound-induced effects in cells, tissues, and model organisms. These assays, however, can hardly provide immediate target information for tested compounds, imposing grand challenges on follow-up target identification for drug discovery.3−5 The recent findings that many drugs act on multiple physiological targets to exert therapeutic effects and/or side effects have attracted intensive interest in exploring the promiscuity and polypharmacology of drugs,6,7 in which identifying compound-target associations is a premise.
Experimentally, two major techniques are used for target identification.(3) Direct techniques, such as affinity chromatography8,9 and protein microarray,(10) detect the binding of a compound to its target. Their applications are often hampered by the need to label a compound without affecting its functionality. Indirect techniques infer targets from the compound-induced cellular or physiological patterns through genomics,11,12 proteomics,(13) metabolite profiling,(14) and other technologies. However, genome-wide or proteome-wide data could be very difficult and expensive to obtain.
Moreover, wet-lab experiments for target identification are often slow, whereas computational approaches can be efficient complements.(15) For example, molecular modeling studies have been reported for target prediction by virtually docking a compound of interest to a list of potential targets with known three-dimensional (3D) structures.16,17 The primary limitation of this method is the need for high-resolution 3D structures of targets as well as accurate docking/scoring algorithms.18,19 Statistical models also have been built for target prediction employing various machine learning methods including Bayesian analysis20,21 and Support Vector Machines.(22) The common drawbacks of these models are that the real predictability beyond training space cannot always be guaranteed. In addition, the similarity principle,23,24 despite its exceptions,(25) has been the basis for target identification using similarity metrics such as ligand chemical similarity5,7,26 and drug side effects similarity.(4) On the other hand, with the rapid growth of public biological databases, such as the Protein Data Bank(27) (PDB), PubChem,(28) ChEMBL (http://www.ebi.ac.uk/chembl), DrugBank,29,30 and Therapeutic Targets Database31,32 (TTD), abundant bioactivity data of small molecules and their targets are now available to the entire research community. It is thus getting critical to develop in silico methods to identify compound-target associations and infer targets for drugs and bioactive compounds by aggregating and integrating valuable target information from multiple resources.
End points of bioactivity data obtained from a panel of assays (i.e., bioactivity profile) may provide distinct insight to the biological function of compounds and their targets. For example, the COMPARE algorithm,(33) by the Developmental Therapeutics Program (DTP) of the US National Cancer Institute (NCI), could be used to suggest possible mechanism of action for a respective compound from related compounds or identify novel compounds that act by a similar mechanism of interest.34−36 This tool compares the bioactivity patterns derived from the anticancer drug screening data across 60 human tumor cell lines (commonly known as the NCI-60 data set). By incorporating additional gene expression data, target information may be inferred.(34)
The NCI-60 data set was also used in our previous work,(37) where we observed in a few model systems that the target networks of small molecules were well-correlated with their bioactivity profiles. Here, given the rapid growth in available compound-target annotations in several public databases, we further investigated whether such correlations could be utilized to benefit the identification of new targets for drugs and bioactive compounds on a larger scale. To this end, we first constructed a database of bioactivity profiles for 4296 compounds tested in the NCI-60 data set. Second, we used each compound as a query to search against the entire bioactivity profile database to identify neighbor compounds with similar bioactivity profiles. Third, we collected target information from four public databases (DrugBank, TTD, ChEMBL and PubChem) for both query compounds and their neighbor compounds to evaluate our approach for predicting compound-target associations. The underlying assumption is that compounds with similar bioactivity profiles may share common targets. We were able to verify a remarkable portion of our predictions retrospectively.
The NCI-60 data set contains anticancer screening results for more than 40,000 compounds. It is publicly available in the PubChem BioAssay database(38) as 73 bioassays with the name of “NCI human tumor cell line growth inhibition assay” under the “DTP/NCI” data source. In this work, only the top 60 bioassays (referred hereafter as NCI-60) with the largest number of tested compounds were selected (Supporting Information, Table S1). Relevant bioactivity data were downloaded at the PubChem FTP site (ftp://ftp.ncbi.nlm.nih.gov/pubchem/Bioassay, accessed on December 9, 2010). A total of 5083 compounds were found commonly tested in all of the 60 bioassays. The bioactivity profile of each compound was derived by extracting the log(GI50) values obtained from the NCI-60 cell lines, where GI50 is the concentration required for the 50% growth inhibition of tumor cells. 631 compounds with missing log(GI50) value in one or more of the NCI-60 cell lines were discarded. Additionally, 156 compounds were further discarded, because they exhibited identical bioactivity in all NCI-60 cell lines, which made them less informative and unsuitable for bioactivity profile similarity calculation (see below). As a result, 4296 compounds were collected and used for constructing the bioactivity profile database. The original bioactivity profile data for these compounds are available in Supporting Information, Table S2. Additional data set characteristics are summarized in Supporting Information, Figure S1 with respect to six physiochemical properties: molecular weight, octanol–water partition coefficient,(23) number of hydrogen bond donors, number of hydrogen bond acceptors, number of rotatable bonds, and topological polar surface area.
The BASS approach consists of three major steps (Figure (Figure1).1). For a given query compound in the NCI-60 data set, we first searched against the entire bioactivity profile database and calculated pairwise bioactivity profile similarity for each reference compound in the data set and the query compound. Second, a neighbor compound was identified if its bioactivity profile similarity is above a selected threshold. Finally, the known target of the query compound is predicted as the potential target of its neighbor compounds or vice versa. A critical step of BASS is to identify the neighbor compounds for a given query compound based on the similarity of bioactivity profiles (Simbio), which is defined as Pearson correlation coefficient (Rp)
where N equals 60, and xi and yi are the log(GI50) values of the ith NCI-60 cell line for compound Q and compound S, respectively. In this work, S is considered as a neighbor compound of Q when Simbio is above 0.75. This similarity threshold was chosen based on a statistical test, which was carried out by randomly selecting two compounds from the entire bioactivity profile database for 100,000 times and recording each time the bioactivity profile similarity. A probability (p-value) was subsequently calculated for obtaining a bioactivity profile similarity above a certain threshold. For the similarity threshold of 0.75 (p-value = 2.28e-3), we found a good balance between prediction accuracy and the number of predictions.
Target annotations for all the compounds in the bioactivity profile database were primarily collected from four public databases: DrugBank, TTD, ChEMBL, and PubChem. For DrugBank (http://www.drugbank.ca) and TTD (http://bidd.nus.edu.sg/group/cjttd/TTD_HOME.asp), compound-target associations were downloaded from original Web sites (both accessed on December 9, 2010). For ChEMBL, the mirrored version of ChEMBL_08 in PubChem was used (http://pubchem.ncbi.nlm.nih.gov, accessed on December 9, 2010), and we considered a compound-target association when a respective compound exhibited an effective activity concentration ≤1 μM against its directly assigned target. For PubChem, the bioactivity outcome specifications from original bioassay depositors were adopted to establish compound-target associations. Additionally, we also manually collected the target annotations for a number of compounds from precedent literatures using the ‘Literature Keyword Mining Tool’ provided at PubChem. From a list of MeSH terms (http://www.ncbi.nlm.nih.gov/mesh) returned by this tool, we looked into the most relevant ones to the compound and/or target of interest and then followed the links to full-text literature and extracted evidence therein whenever possible. All protein targets were uniformly stored as UniProtKB identifiers (http://www.uniprot.org, accessed on February 4, 2011). Other molecular targets, such as DNA and RNA, were stored as target names. As a result, 237 compounds with known target annotations in one or more of the above four databases were identified (Table (Table11).
Using the above 237 compounds with known target annotations as queries, BASS predicted a total of 4693 compound-target associations for neighbor compounds, i.e., the known targets of a respective query compound were considered as the potential targets of its neighbor compounds. In this work, if at least one potential target was also annotated in any of the above four databases, a successful prediction of the compound-target association was counted. It should be noted that only a part of such predictions could be evaluated when both query compound and neighbor compound had target annotations available. 634 out of the 4693 compound-target associations turned out to be verifiable. For a systematic evaluation of the predicted associations, a stringent criterion was first used by checking the identity of targets of the query compound and its neighbor compound. As a result, a success rate of 44.8% (284 successful predictions) was achieved, which accounted for 103 out of the 237 query compounds. When the identified targets were proteins and there was no exact match among that of a respective compound and its neighbor compound, a less stringent criterion of target identity was applied if protein target sequences were significantly related. In this work, two protein targets that showed an E-value <1e-12 in the BLAST(39) protein–protein sequence alignment were considered as biologically related. Under these conditions, the performance was further improved to 48.6% (308 predictions in total), which accounted for 108 out of the 237 query compounds. The above evaluation suggested that BASS, when combined with searching target information using public databases, may be used to identify targets for biological neighbor compounds with similar bioactivity profiles to a query compound. Detailed results are described for the following examples, with the complete results provided in Supporting Information, Table S3.
Microtubules are composed of α- and β-tubulin heterodimers. They are cytoskeletal elements involved in many cellular processes, such as mitosis, cytokinesis, and vesicular transport.40−42 Small molecules that bind to tubulin can interfere with microtubule dynamics, resulting in microtubule stabilization or destabilization, which induces cell cycle arrest and ultimately leads to apoptosis. Out of the 15 new molecular entities approved by FDA in 2010, two are targeting microtubule.(43) Considering its key roles in mitosis and cell division, microtubule continues to be a very important chemotherapeutic target of anticancer drugs.(44)
According to DrugBank (primary accession number, PAN: DB01229), Paclitaxel (PubChem Compound identifier, CID: 36314) is an FDA-approved drug to treat various cancers, including ovarian cancer and breast cancer. It promotes the assembly of microtubules from tubulin dimers and stabilizes them by preventing depolymerization. In this work, using Paclitaxel as a query for BASS retrieved seven neighbor compounds (Figure (Figure2A).2A). These included five closely related analogues of Paclitaxel, showing an average two-dimensional chemical similarity (Simchem) of 0.924 as characterized by PubChem fingerprint (ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt) and Tanimoto score.(45) This is consistent with previous observations that structurally similar compound may exhibit comparative bioactivities.46,47 However, due to limited target annotations available to us at the time, we were not able to verify tubulin as a target for these structural analogs.
On the other hand, tubulin was verified as a target for one neighbor compound Vinblastine (CID: 241902; Simbio = 0.785; p-value = 1.31e-3) which was structurally unrelated to Paclitaxel (Simchem = 0.560, Figure Figure2A).2A). Vinblastine is an approved anticancer drug (PAN: DB00570) which is thought to play a key role in mitosis inhibition at metaphase via its interaction with tubulin. The crystal structure of Vinblastine-tubulin complex reveals that Vinblastine binds at the interface between two tubulin heterodimers,(40) in contrast to Paclitaxel which binds at the taxol site of β-tubulin.(42) Furthermore, using Vinblastine as a query, BASS identified a number of neighbor compounds that were common to those of Paclitaxel. Interestingly, this second search identified two additional neighbor compounds which were previously reported as tubulin inhibitors (CID: 24933248,49 and 347381;(50) Simbio = 0.753 and 0.756; p-value = 2.25e-3 and 2.12e-3; Simchem = 0.984 and 0.526, respectively). In addition, BASS identified another non-Paclitaxel neighbor compound NSC355256 (CID: 434718; Simbio = 0.789; p-value =1.14e-3; Simchem = 0.671) using Paclitaxel as a query (Figure (Figure2A).2A). Due to limited target annotation available to us, we were unable to verify tubulin as a target for this compound. However, we noticed that it shared the chemical scaffold of an approved drug Colchicine (PAN: DB01394; CID: 6167) with a significant structural similarity (Simchem = 0.878). As indicated by the crystal structure of Colchicine-tubulin complex, Colchicine binds to the β-tubulin subunit of microtubule at the interface with α-tubulin.(41) This example indicated that BASS had the potential to discover novel inhibitors and explore new starting points for lead optimization, demonstrating the advantage of BASS for identifying compounds with various chemical scaffolds, which may provide insight to ‘scaffold hopping’ against the target of interest.51,52
In the above example, we demonstrated that the targets of biological neighbor compounds could be inferred from the known targets of a drug molecule. It would be more practical and interesting to investigate, from a reverse perspective, whether BASS could be used to suggest new targets for a drug molecule by gathering known target information from its neighbor compounds (Figure (Figure1).1). Dihydrofolate reductase (DHFR) converts dihydrofolate into tetrahydrofolate. The latter is a methyl group shuttle required for the de novo biosynthesis of purines, thymidylates, and certain amino acids, which are essential for DNA synthesis and cell multiplication.(53)
In this example, we used the experimental drug Metoprine (CID: 24466) as a query. According to DrugBank (PAN: DB04655), its annotated target is Histamine N-methyltransferase (HNMT).(54) For its 13 neighbor compounds identified by BASS (Figure (Figure2B),2B), none was found targeting HNMT according to the available target annotations. On the other hand, further investigation indicated that three neighbor compounds, Pyrimethamine (CID: 4993), NSC302325 (CID: 327404), and Methylbenzoprim (CID: 7243855,56) had been previously reported targeting DHFR. According to DrugBank (PAN: DB00205), Pyrimethamine was an FDA-approved antimalarial drug through a mode of action by inhibiting DHFR.(53) Based on ChEMBL annotation (PubChem BioAssay identifier, AID: 55830), NSC302325 was a DHFR inhibitor with an IC50 of 0.85 μM.(57) The direct annotation of DHFR as a target of Methylbenzoprim was not available in any of the above four databases. However, its annotated target in ChEMBL (AID: 56179 and 56314), bifunctional dihydrofolate reductase-thymidylate synthase (DHFR-TS), was found to be closely related with DHFR (BLAST E-value = 6e-136). The binding of Methylbenzoprim to DHFR was further supported by previous NMR experiments(55) as well as molecular modeling studies.(56) Using either one of the three compounds Pyrimethamine, NSC302325 and Methylbenzoprim as a query, BASS could identify Metoprine as a neighbor compound (Simbio = 0.800, 0.845, and 0.829; p-value =9.4e-4, 3.7e-4, and 4.8e-4, respectively). Moreover, all three compounds were structurally related to the query Metoprine (Simchem = 0.950, 0.707, and 0.848, respectively). Therefore, it is natural to consider DHFR as a potential target of Metoprine, which was confirmed by further investigation into the target annotation in TTD (DrugID: DCL000304) and precedent literatures.58−60
Polypharmacology is receiving increasing attention in drug discovery for exploring both side effects and new therapeutic opportunities.(61) As a step forward, BASS can be readily applied for predicting the polypharmacology of a given compound by collecting known targets from its neighbor compounds. Here, we presented such an example using the approved drug Amsacrine (CID: 2179) as a query (Figure (Figure3).3). A total of 67 neighbor compounds were identified by BASS. More than a dozen of them were known DNA intercalators or cross-linkers according to DrugBank annotations and/or precedent literatures. There were also several neighbor compounds that were previously reported as inhibitors of topoisomerase, type II alpha (TOP2A). The two targets of DNA and TOP2A were also annotated for Amsacrine in DrugBank (PAN: DB00276). Additionally, Amsacrine together with one neighbor compound (CID: 2708; PAN: DB00291) were confirmed to interact with the enzyme glutathione S-transferase A2 (GSTA2). In a quantitative high-throughput screening assay (AID: 886) launched by the US National Institutes of Health Chemical Genomics Center (NCGC), both Amsacrine and its neighbor compound (CID: 3246719) demonstrated inhibitory activity against hydroxyacl-coenzyme A dehydrogenase, type II (HADH2). In another bioassay (AID: 410) conducted by NCGC, Amsacrine and two neighbor compounds (CID: 24360 and 148869) were both found active against cytochrome P450, family 1, subfamily A, polypeptide 2 (CYP1A2). Therefore, it is straightforward to depict a polypharmacological graph of Amsacrine by gathering available target information predicted from its neighbor compounds (Figure (Figure33).
The promising results from the overall evaluation of the predicted compound-target associations and those shown in the above examples demonstrated that bioactivity profile similarity search (BASS) may be applied to predict new targets for drugs and bioactive compounds from the target annotations of their neighbor compounds that are available in public databases. Nevertheless, for a larger number of target predictions, we were not able to verify them due to insufficient target annotations in public databases or due to difficulty in literature searching. It thus remains interesting for further (experimental) studies to verify the targets predicted here, especially for those resulting from significant bioactivity profile similarity. For those completely uncharacterized bioactive compounds, BASS may also be helpful to target identification by suggesting potential targets aggregated from their biological neighbor compounds. To facilitate the readers of interest, we included a list of query compounds which yet have no target annotation in any of the above four public databases or precedent literatures and their neighbor compounds with known target annotations (Supporting Information, Table S4).
It should be mentioned that the compound-target associations identified in this work were verified retrospectively by taking advantage of the target annotations derived from public databases or by literature searching, and we emphasize that this work could not have been done without the open access to public databases which now contain vast amount of chemical biology data. For a number of cases (e.g., microtubule example), the predictions were strongly convincing as supported by the crystal structures of ligand-target complexes. Nevertheless, for other cases, the reported compound-target annotations in relevant databases or literatures may require further investigation to better understand the underlying mechanism of binding. For example, though Paclitaxel and Vinblastine both bind to microtubule, they actually bind at very different sites, which may be responsible for their different modes of action. To address these issues, structural biology studies, such as NMR or X-ray diffraction experiments, would be particularly persuasive. With the growing availability of public databases containing ligand-target annotations, such as DrugBank, TTD, ChEMBL, and PubChem, the accuracy of BASS may be further improved.
Lead optimization based on chemical scaffold has been broadly embraced by medicinal chemists(62) as a central guiding principle to design ligands with higher potency and/or more desirable physicochemical properties.(63) It will be interesting to look into the chemical space of the neighbor compounds identified by BASS (Figure (Figure44 and Supporting Information, Figure S2). Figure Figure4A4A shows the number of neighbor compounds within certain range of chemical similarity as a function of bioactivity profile similarity using all query-neighbor pairs in the bioactivity profile database. As one can see, BASS was able to identify not only structurally similar neighbor compounds but also a considerable number of structurally dissimilar ones with related bioactivities. These may provide novel molecules or new starting points for future ligand design, which would not have been discovered by conventional medicinal chemistry efforts. Therefore, BASS has the appealing capability of ‘scaffold hopping’, as demonstrated in the above microtubule example and the data shown in Figure Figure4B4B as a whole. It thus represents a new strategy for identifying candidate compounds with diverse chemical scaffolds that are biologically relevant to the aimed target. It is worth stressing that the threshold of bioactivity profile similarity for defining a neighbor compound is user adjustable, though a conservative threshold of 0.75 was used in this work. In fact, when a less stringent threshold of 0.70 (p-value = 5.22e-3) or lower was applied in BASS, we could still verify a number of predictions.
The idea of using bioactivity profile (pattern or fingerprint) is, of course, not entirely new. Other similar ideas have been proposed. Nevertheless, computational approaches making use of different profiling data may vary, in particular, toward achieving different research goals. For example, the “Connectivity Map” approach developed by Lamb et al.(12) employs mRNA expression profiles to establish connections between small molecules and with diseases. The “biospectra analysis” approach by Fliri et al.64,65 aims to group compounds with related inhibitory bioactivities against a panel of protein targets and correlate to biological functions. Our specific goal in this work is to associate compounds with targets based on the similar NCI-60 cell lines bioactivity profiles of small-molecule compounds and their target annotations in public databases. We anticipate that BASS may be of benefit to the target identification for anticancer drug discovery. Analysis using BASS could generate hypothesis to understand both the mode of action and mechanism of binding for bioactive compounds by suggesting new targets from well-characterized neighbor compounds. Our work could contribute to the target prediction and the state-of-art drug repositioning. The free-of-charge screening service provided by the DTP/NCI would make BASS more appealing. By submitting their own compounds of interest, researchers could obtain high-quality and confidential bioactivity profiles, which in turn can be used as inputs for BASS to identify potential targets by consulting the known targets of compounds in the bioactivity profile database. Nevertheless, before additional experiments are done, it should be mentioned that BASS may only be applicable for identifying the targets of the compounds which can cause cellular responses in the NCI-60 cell lines.
We have presented a computational approach, BASS, for mutually identifying compound-target associations by comparing the bioactivity profiles that are derived from the NCI-60 cell lines. When two compounds share similar bioactivity profiles, the targets of either compound may be considered as the potential targets for the other compound. To evaluate BASS, each compound in the bioactivity profile database was used as a query to search against the entire database for neighbor compounds that may share common targets. An overall success rate of 44.8% was achieved for the predicted compound-target associations by using the prior knowledge of target annotations from public databases, and it was further improved to nearly 50% when considering related protein targets. Analysis shows that BASS not only could identify structurally similar bioactive compounds that are biological relevant to the target of interest but also had the power of suggesting novel chemical scaffolds for the aimed target. Moreover, BASS may represent an efficient strategy for integrating experimental data and target information newly emerged for any of the neighbor compounds. Therefore, BASS may be applied to suggest new targets for old drugs and provide insight into anticancer drug discovery, facilitating the study of the toxicity, promiscuity, and polypharmacology of drugs and bioactive compounds.
We thank the Intramural Research Program of the National Institutes of Health (NIH), National Library of Medicine (NLM) for funding support. We also thank the NIH Fellows Editorial Board (FEB) for manuscript revision.
The 60 bioassays (NCI-60) used in this study. The original data of bioactivity profile for 4296 compounds. The complete results of the 284 predicted compound-target associations using the 237 compounds with known targets as queries. The predicted compound-target associations for the query compounds which have no target annotations. The characteristics of the data set used in this study. The global view of chemical similarity as a function of bioactivity profile similarity. This material is available free of charge via the Internet at http://pubs.acs.org.