|Home | About | Journals | Submit | Contact Us | Français|
Quantitative analysis of known drug–target interactions emerged in recent years as a useful approach for drug repurposing and assessing side effects. In the present study, we present a method that uses probabilistic matrix factorization (PMF) for this purpose, which is particularly useful for analyzing large interaction networks. DrugBank drugs clustered based on PMF latent variables show phenotypic similarity even in the absence of 3D shape similarity. Benchmarking computations show that the method outperforms those recently introduced provided that the input data set of known interactions is sufficiently large—which is the case for enzymes and ion channels, but not for G-protein coupled receptors (GPCRs) and nuclear receptors. Runs performed on DrugBank after hiding 70% of known interactions show that, on average, 88 of the top 100 predictions hit the hidden interactions. De novo predictions permit us to identify new potential interactions. Drug–target pairs implicated in neurobiological disorders are overrepresented among de novo predictions.
Drug discovery and development has become increasingly challenging in recent years, evidenced by the estimated cost of around $1.8 billion for the development of a novel molecular entity with suitable pharmacological properties.1 This cost increase partly originates from the failure of many drug candidates in phase II or III clinical trials due to their toxicity or lack of efficacy.2 The efficiency of drug discovery and development might be improved by adopting a systemic approach that takes into consideration the interaction of existing drugs and candidate compounds with the entire network of target proteins and other biomolecules in a cell.3 Indeed, the “one gene, one drug, one disease” paradigm is widely recognized to fail in describing experimental observations.4 Many drugs act on multiple targets, and many targets are themselves involved in multiple pathways. For example, β-lactam antibiotics and most antipsychotic drugs exert their effect through interactions with multiple proteins.5,6 Biological networks are highly robust to single-gene knockouts, as recently shown for yeast where 80% of the gene knockouts did not affect cell survival.7 Similarly, 81% of the 1500 genes knocked out in mice did not cause embryonic lethality, further corroborating the robustness of biological networks against single target perturbagens.8 These results suggest that quantitative systems pharmacology strategies that take account of target (and drug) promiscuities can present attractive alternative routes to drug discovery.
Recent years have seen many network-based models adopted to reduce the complexity of, and efficiently explore, drug–target interaction systems.2,5,6,9 In particular, the development of computational methods that can efficiently assess potential new interactions became an important goal. In this regard, the important role that machine learning approaches such as active learning (AL) can play has been recently been highlighted.10 Computational approaches used to predict unknown drug–target interactions can be divided into roughly four categories: chemical-similarity-based methods,11−13 target-similarity-based methods,14−16 integrative (both target- and chemical-similarity-based) methods,17−23 and holistic approaches.24−29 The first two posit that if two entities are chemically or structurally similar they will share interactions. The integrative approaches combine the chemical- and target-similarity methods. While the intuition behind these approaches is very reasonable, their performance has been observed to be tied to the underlying similarity computation method. We also note that the utility of different methods may depend on the size of the data set being analyzed, e.g., computing chemical–chemical and target–target similarity matrices can be problematic for large databases like STITCH30 (that contains information on the interactions between more than 2.6 million proteins and 300000 chemicals). To overcome these limitations, holistic methods have been introduced, which utilize a number of different data sources such as gene expression perturbation25,26 or high-throughput screening.28
In this study, we propose a novel approach by using a collaborative filtering algorithm to predict interactions without reliance on chemical/target similarity or external data collection. We validate the utility of probabilistic matrix factorization (PMF) for predicting unknown drug–target interactions with the help of a detailed investigation of its performance. The method is shown to group drugs according to their therapeutic effects, irrespective of their three-dimensional (3D) shape similarity. Benchmarking computations show that the method outperforms recent methods17,20,22 when applied to large data sets of protein–drug associations, such as those of enzyme– and ion channel–drug pairs; whereas the performance falls short of these methods with decreasing size of the examined data set (e.g., GPCR- and nuclear receptor-drug data sets). The ability of the method to efficiently analyze and make inferences from large data sets of protein–drug interactions suggests that, with growing sizes of those data sets, the utility (and accuracy) of the method will further improve.
Application of the same benchmarking procedure to DrugBank31 confirms its ability to disclose hidden data: 88 out of the top 100 predictions (or 587 out of 1000) are found to hit known (but hidden) interactions, when only 30% of the entire data is used for training. Finally, when the method is trained on the entire data set of drug–target interactions compiled in DrugBank, de novo predictions for drug repurposing can be made along with the corresponding confidence levels. Top de novo predictions include many drugs indicated for neurodegenerative diseases or neurobiological disorders, including drug–target pairs apparently supported by previous experiments (not reported in DrugBank), e.g., ergotamine–serotonin receptor 1A (5HT1A),32 amoxapine-5-HT2A,33 and verapamil–calmodulin.34 In conclusion, the newly introduced computational method provides an efficient approach for identifying potential drug–target association between chemicals and targets and formulating new hypotheses for repurposable drugs or side effects, thus complementing those deduced from chemical–chemical or target–target similarities.
The drug–target interaction network is a bipartite graph with two types of nodes: drugs and targets (Supporting Information Figure S1). Each edge represents an interaction between a drug and a target. The drug–target interaction identification problem is to determine the missing edges that are likely to exist given all nodes and some of the edges in the network.
We used DrugBank (version of September 20, 2011) as the database.31 All drugs annotated therein as approved, along with their annotated targets, are included in our data set (i.e., we excluded compounds annotated as withdrawn or nutraceutical), resulting in N = 1413 drugs and M = 1050 targets with 4731 interactions among them. The interaction network displays small-world characteristics: many nodes have low degree and a few, very high degree, as illustrated in panels b and c of Supporting Information Figure S1, in line with previous studies on drug–target networks.35 On average, there are 3.35 interactions per drug, and 4.50 interactions per target.
PMF is a member of the collaborative filtering family of machine learning algorithms that decomposes the connectivity matrix, RN×M, of a bipartite graph of N drugs and M targets as a product of two matrices of latent variables (LVs).36,37RN×M is defined as
The matrix RN×M is modeled as the product of two matrices UTN×D and VD×M, that express each drug/target in terms of D LVs. Our objective is to find the best approximation for LVs, while avoiding overfitting. The predicted N×M is then expressed as
where UT and V are composed of N rows uiT and M columns vj, respectively, each being D-dimensional. The PMF adopts a probabilistic linear model with Gaussian noise to model the interaction. Therefore, the conditional probability over observed interactions is represented as
where f(x|μ, σ2) is the Gaussianly distributed probability density function for x, with mean μ and variance σ, and Iij is the indicator function equal to 1 if the entry Rij is known and 0 otherwise. Therefore, p(R|U, V, σ2) gives us a probabilistic representation of the connectivity matrix, R.37 Using zero-mean, spherical Gaussian priors on LVs, we can write
which leads to the log-likelihood of U and V given by
Here C is a term that does not depend on LVs; the first term on the right-hand side is the squared error function to be minimized; and the two summations over the square magnitudes of ui and vj are regularization terms that favor simpler solutions and penalize overfitting. The above log-likelihood directly follows from the Bayes’ rule where R stands for data, and U and V represent the model (see the Supporting Information for details). To learn an optimal model means to find the U and V matrices, or the D-dimensional LV vectors, ui (1 ≤ i ≤ N) and vj (1 ≤ j ≤ M), that maximize the log-likelihood function.
The PMF method yields the optimal ui and vj vectors corresponding to each drug, di, and each target tj, respectively. The basic idea is that the model is forced toward making a “no-interaction” prediction by the regularization—i.e., there is a penalty associated with any nonzero value in the LV matrices. However, there is also a penalty for failing to capture known interactions—i.e., if the dot product of the LV vectors corresponding to an interacting drug–target pair is close to zero. Therefore the learning of a model means to optimally balance out two objectives: developing a sufficiently complex model to describe the known interactions, but not overly complex to end up in overfitting. In this study, we use gradient descent for optimization. The adoption of higher D values usually yields more accurate results, although beyond a certain limit the increase in complexity and decrease in efficiency may not warrant the marginal improvement, if any, in prediction accuracy. D = 50 is adopted here as an optimal dimensionality for prediction runs. The method is highly efficient: a 50-dimensional model is trained on the entire DrugBank in approximately 2 s using a 2.00 GHz AMD Opteron processor. Moreover, the computing time to learn a PMF model scales linearly with the number of interactions, and as such, the method can be advantageously used for much larger data sets.
The AL strategy adopted in the present study is, in part, motivated by the success reported by Warmuth et al.38 who demonstrated that hit maximization is a viable AL strategy applicable to predicting drug–target interactions. The AL strategy adopted here also prioritizes the discovery of unknown interactions. Our method differs in that we aim at capturing the interactions between all drugs and targets, as opposed to predicting activity against a single target.
The procedure is the following: We begin with the set of N drugs and M targets, and known associations, schematically shown in Figure Figure11 by black connectors. The purpose is to identify new associations, indicated by red connectors. For each candidate interaction, say the possible interaction between di and tj, we compute the model’s estimate, f(Rij|uiTvj, σ2) (eq 3). The dot product uiTvj serves as a weight ωij for the edge/connector between di and tj. Clearly, ωij, or the likelihood of association between di and tj, is high when ui and vj have both large values of the same sign at the same dimension(s). For example, a relatively large weight may originate from the second component of both ui and vj, which means that the predicted association is mainly due to latent variable 2. We evaluated the statistical weights ωij(di, tj) for the N × M pairs of drug targets for two purposes: (i) benchmarking the methodology via an iterative AL scheme and (ii) making de novo predictions. In the former case, the method is benchmarked by hiding 70% of known interactions and examining whether the top-ranking prediction is a “hit”, i.e., whether it corresponds to a known (but hidden) interaction. The outcome from this test is fed back to the model, to repeat the calculation for the next prediction. Therefore, the AL model is updated at each iteration using the newly acquired “hit” or “miss” data until a predetermined number (m) of predictions are made. The passive learner (PL) makes the m predictions simultaneously without updating its model.
Qualitative illustration of the method for identifying drug–target interactions. The known interactions between drugs and targets (indicated by the black lines) are used to learn the LV vectors (shown adjacent to each node) that describe each ...
In the case of de novo predictions, all DrugBank data were used as input. De novo predictions also lend themselves to an AL scheme provided that the top-ranking prediction is experimentally tested and then the new hit or miss data are incorporated in the model to perform a new prediction, and so on, until the experimentation budget is exhausted.
To assess whether the LVs provide us with a pharmacologically meaningful metric, we examined the clustering of drugs in the D-dimensional space of the latent vectors. The clustering was performed for D = 30—the value that gave the lowest Akaike information criterion,39 using as basis the drug–drug distance L1(di, dj) = ∑k|uik – ujk| where uik designates the kth component of ui, and the summation is performed over D components.
Inasmuch as our method evaluates drugs based on their interaction profiles with targets, which in turn refer to specific therapeutic or phenotypic actions, the similarity of a pair of drugs should be high when their therapeutic effects are comparable and vice versa. Thus, the method will tend to cluster drugs that exhibit similar patterns of interactions (with target proteins), which we term as functionally similar drugs.
The heat map in Figure Figure2a2a displays the resulting organization of drugs in 30 clusters (indicated by different colors and indices along the axes). Supporting Information Table 1 lists the dominant therapeutic action associated with each cluster. The dark regions on the map indicate high functional similarity. The dark blocks along the diagonal show that most clusters include highly similar members, except for two (clusters 29 and 30), which apparently combine the outliers.
Given that (promiscuous) proteins present more than one site for ligand-binding, different functionalities may be modulated by chemical-structurally different drugs, depending on the binding site on the target (e.g., catalysis, substrate recognition, or allosteric signaling). Furthermore, a shared phenotype may arise from the targeting of different proteins along a given pathway. In order to make a better assessment of the properties of drugs grouped in those clusters, we examined their 3D structural similarities. High similarities would suggest that they bind similar epitopes, if not similar (or identical) structural domains or proteins. If, on the contrary, they are structurally dissimilar, this might indicate a different site on the same protein, a different target on the same pathway, or other indirect effect due to drug–target network connectivity.
The extent of 3D structure similarity between pairs of drugs was computed using the OpenEye Scientific software (http://www.eyesopen.com/). 3D similarity was reported to be a better predictor than 2D methods for off-target interactions and to perform equally well in on-target interactions,40 although 3D methods may suffer from more noise due to the conformational flexibility of the small molecule. We generated for each drug all possible stereoisomers using OpenEye FLIPPER41 and up to 200 conformers per stereoisomer using OpenEye OMEGA.41 All combinations of conformers accessible to the examined pair of drugs were examined using OpenEye Shape23 toolkit; and the best matching pair was adopted to assign a 3D similarity score. This computationally expensive task led to the heat map presented in panel b of Figure Figure2.2. The drugs (along the axes) are ordered as in panel a to enable visual comparison.
The comparison of Figure Figure22 shows that some clusters of functionally similar drugs (panel a) also exhibit some 3D similarities (panel b), whereas others display little structural similarity. We examined more closely the individual clusters to see if shared therapeutic functions were captured even when 3D similarities were absent. Figure Figure33 illustrates the results for cluster 14. This cluster essentially consists of antianxiety drugs, the majority of which are both functionally (panel b) and structurally (panel c) similar. However, the cluster also includes a structurally dissimilar drug, ethchlorvynol (panel a), which shares the same type of phenotypic action (as a sedative) as the majority of the cluster membership (mostly targeting GABA receptors). The present approach thus detects chemically or structurally distinctive drugs that share common activities, which would have been missed by methods based on ligand fingerprint similarities.
Another interesting observation concerns the cross-correlations between different clusters (i.e., the off-diagonal regions of the heat maps). We note for example that cluster 11 also contains a set of sedatives. LVs are able to capture the commonality between the clusters 11 and 14 as may be seen by the strong signal (dark region) at the off-diagonal region enlarged in Figure Figure4b.4b. The 3D similarity, on the other hand, cannot recognize the functional similarity and potential interference/side effects between these drugs in these two clusters (Figure (Figure4c).4c). See Supporting Information Figure S2, which illustrates the same behavior for another cluster, whose members are mostly antineoplastic agents, albeit with various 3D structures. The LVs thus provide information on drug groups that potentially share pathways or exhibit similar activity patterns despite their distinct physicochemical properties.
Strong cross-correlations between different clusters of drugs, consistent with their similar therapeutic functions. Cluster 11, color-coded cyan, is essentially composed of hypnotics and sedatives. Cluster 14 (dark gray) contains antianxiety drugs. The ...
To evaluate the performance of the method in comparison to previous work, we considered three important studies in this area, one recently published by Gonen22 and two by Yamanishi et al.17,20 Gonen used a Kernel-based matrix factorization (KBMF) with chemical and genomic similarities to predict multiple targets. Yamanishi et al., on the other hand, integrated chemical, genomic, and pharmacological data to map all drugs and targets to the same unified feature space where each protein–compound pair closer than a predefined threshold was predicted to interact. Our approach differs from both studies, in that PMF assumes an independent LV for each row and column with Gaussian priors; whereas KBMF employs LVs spanning all rows and columns with Gaussian process priors, and Yamanishi et al. project drugs and targets into a pharmacological space based on the eigenvalue decomposition of the graph-based similarity matrix.
The benchmarking procedure that we adopted is a 5-fold cross-validation of drugs on four target classes: enzymes, ion channels, G-protein coupled receptors (GPCRs), and nuclear receptors. In order to achieve comparable results, we used the same protocol as that adopted earlier, i.e., we divided our data set into five subsets; each was used as a test set, and the others, as training sets. Due to the randomness involved in the selection of subsets, we repeated the cross-validation experiments 100 times with randomly selected subsets and evaluated the average AUC (area under the receiver operating curve) for each subset. The first four rows in Table 1 compare the results (columns 6–10) for the four classes, and the fifth row lists the average performances weighted by the size of the interaction space. Our method performs best when applied to large data sets (e.g., enzymes and ion channels); whereas Gonen’s performs best in the case of GPCRs, and Yamanishi et al.20 exhibits the highest performance for nuclear receptors, where the present method yields a relatively low (0.642) AUC value. Examination of the statistical significance of our results (Supporting Information Figure S3a) indicates that the mean AUC values obtained for all four sets are highly robust. Their covariances vary from 2% (enzymes and ion channels) to 11% (nuclear receptors). Finally, the application of the same benchmarking protocol to DrugBank yielded an accuracy rate of 79.4 ± 0.01% (Table 1, last row, and Figure S3), supporting the utility of the method when applied to large data sets.
Properties of the Examined Space of Proteins–Drugs and Performance of the Present Method in Comparison to Others (*)
In principle, it might be intrinsically harder to make accurate predictions for larger data sets as the size of the potential interaction space N × M grows quadratically when the number of drugs and targets grow linearly, particularly if the number of known interactions is small. The occupancy of the N × M interaction matrix is only 1.5% in the enzyme class, which could make it difficult to learn an informative model. The present PMF technique, however, successfully learned an informative model and handled the complexity of interactions in this space of interactions, apparently due to the availability of a sufficiently large (absolute) number of known interactions (Supporting Information Figure S3b).
The ion channel drug class has the second largest number of known interactions among the four. Although the size of interaction space is 1 order of magnitude smaller than the enzyme class, there are 776 known interactions leading to a percent occupancy of 5.37% of all possible ion channel-drug associations. The success of our method in this case may be attributed to both the relatively large number of known interactions and the rich annotation of that class of interactions.
The two other classes, GPCRs and nuclear receptors, are significantly smaller in terms of their interaction space and/or occupancy of that space. Nuclear receptors comprise only 27 drugs, 22 targets, and 44 interactions. A method that relies solely on connectivity, like ours, cannot presumably formulate an informative model when the set of “edges” to construct the network connectivity matrix is incomplete. In those cases, the data that come from other sources, e.g. chemical similarity and genomic patterns, amend this lack of information. Consequently, methods that incorporate such features20,22 outperform ours.
To further examine the effect of scarcity of known interactions on the performance of the present method, we performed additional tests by varying the fraction of hidden interactions. The results are presented in Supporting Information Figure S4. Panels a–d show the performance on ion channels, enzymes, GPCRs, and nuclear receptors, respectively. These results show that the performance depends on the fraction of known interactions. To put the results into perspective, we indicated by a vertical dashed line in each panel the fraction of data (80%) used in previous studies20,22 for training purposes. Consistent with the above findings, ion channels yield the best result: previous AUC values22 (of 0.799; Table 1) are matched with about only 35% of the data. In the enzyme group, we match the performance of Yamanishi et al.20 (AUC of 0.845) with roughly 70% of the data used for training. GPCRs and nuclear receptors yield AUC values lower than those previously attained,20,22 irrespective of the fraction of hidden interactions.
In summary, the method is particularly suitable for screening and inferring repurposable drugs or potential side effects from large data sets where computational assessment of structure similarity kernels become prohibitively expensive. In cases where the data set of known interactions is too small, on the other hand, 2D or 3D similarity metrics provide more accurate assessments.
As a more stringent test, 3318 (70%) of the known 4731 interactions in DrugBank were randomly hidden, reducing the average number of interactions per drug from 3.35 to 1. The resulting “incomplete” interaction matrix was then used to predict the hidden interactions, one at a time (rank-ordered by statistical weights ωij(di, tj)) as described in Materials and Methods. The outcome was checked in a simulated experiment to assess whether the predicted interaction is a true positive (TP) or a false positive (FP). If the prediction is an existing, but hidden, interaction, the result is considered a TP (or hit), otherwise a FP (or miss). Then the model is updated in line with our AL scheme, and this loop is repeated until the completion of m = 1000 predictions. At that point, the simulation is halted and the overall performance of the model, or the hit ratio, is evaluated. Note that this method gives us a lower bound for hit ratio because the predictions are labeled as hits only if they are annotated in DrugBank, although they can be true but not yet observed experimentally or annotated in DrugBank.
The results are presented in Figure Figure5.5. The figure displays the number of hits as a function of the number of predictions, obtained with three approaches: active learning (dark blue curves), passive learning (dark red curves), and random (green). The approach is able to achieve, on average, 587 hits out of 1000 predictions via AL, 407 hits, via PL; and the corresponding variances (indicated by the dashed curves) are 35 and 46, respectively. Compared to the random probability of 2.23 hits per 1000 predictions, the AL result is a 263-fold improvement over random. The improvement of AL over PL is 1.44 fold. The AL improvement over random was reported to be up to 3.19-fold in a previous SVM-based study for predicting the activity of 1316 drugs against a single target.38 The same study also reported 1.59-fold improvement between passive and active learners. Closer examination of the results from the top 100 predictions (enlarged in the inset) further shows that hit ratios of 88.0 ± 4.7% and 82.2 ± 6.4% are obtained by the respective AL and PL protocols. The results are obtained with D = 50, which yields optimal results, as can be seen from Supporting Information Figures S5 and and66 display the dependence of the results on D, in support of the choice of D = 50.
Ability of the method to recapitulate hidden drug–target interactions. The number of drug–target interactions per drug was reduced from 3.35 (average) to 1 by hiding 70% of known interactions, selected randomly. Simulations were repeated ...
These results permit us to draw two conclusions. First, a hit ratio of 88% is attainable in the top 100 predictions (and 59% in top 1000) upon adopting a PMF-based AL strategy for identifying hidden/unknown interactions in a sparse (0.32% occupancy) data set of about 1.5 million potential interactions. Second, the AL method outperforms random by 2 orders of magnitude and PL by a ratio of 1.5 approximately, in support of AL strategy for predicting new interactions.
We used our method to predict new (potential) drug–target interactions after training our model on the latest available version of DrugBank (September 10, 2013) comprised of 5041 interactions between 1502 approved drugs and 1138 targets. The highest confidence pairs obtained for twenty distinct drugs are presented in Table 2. The pairs therein were observed to lie frequently among the top-ranking 10 pairs (in the space of N × M = 1.7 × 106 potential interactions), deduced from 104 independent runs initiated with different random numbers.
De novo Predictions, Rank-Ordered Based on Confidence
We note that the list of de novo predictions in Table 2 is dominated by drugs used for neurological or psychiatric disorders, consistent with the known pharmacological promiscuity of this of drugs. The last column in Table 2 lists the experimental support from the literature, if any, for the possible interaction of the drug–target pair in each row. Among pairs supported by previous experiments, we note ergotamine–serotonin receptor 1A (5HT1A),32 amoxapine-5-HT2A,33 verapamil–calmodulin,34 paliperidone-5-HTc,42 meperidine–sodium channels,43 and cinnarizine–calmodulin.44 We also note that chronic treatment with paroxetine has been recently reported to increase the mRNA levels of histamine receptor H1, indicating an association between paroxetine and Histamine receptor H1.45 Although paroxetine is a potent serotonin reuptake inhibitor, its weight gain side effect has been attributed to its medication action on histamine receptors.46
Many others (indicated as “indirect”) are interactions known to occur either with subtypes of the listed targets or proteins implicated in the same phenotype (e.g., citalopram induces norepinephrine receptor hypoactivity,47which may relate to norepinephrine transport by NET). Yet, the validity of these predictions need to be established by experiments. Here, for exploratory purposes, we examined the potential binding pose and energetics of verapamil–calmodulin. Verapamil is known as a Ca2+ channel entry blocker. Its interaction with calmodulin is supported by the fluorescence experiments32 and by the inhibition of calmodulin-stimulated (Ca2+ + Mg2+) ATPase activity.48 We used as template the Protein DataBank (PDB)49 structure of the cocrystal between calmodulin and trifluoperazine (PDB identifier 1CTR)50 and examined the binding affinity of verapamil relative to that of trifluoperazine, using the module SMINA,51 based on AutoDock VINA.52 The software yielded an attractive energy of −4.8 kcal/mol for trifluoperazine binding to calmodulin, which after energy minimization, becomes −6.8 kcal/mol. Docking of verapamil to the same location and energy minimization led to −5.8 kcal/mol (Figure (Figure7).7). While docking software with fixed target often fails to provide a quantitative assessment of binding affinity, and more elaborate simulation methods have been developed for druggability assessment (see, e.g., our recent work53), these results support, qualitatively, the shape complementarity between calmodulin and verapamil, as well as their favorable electrostatic interaction. Establishment of this and other predicted interactions, however, awaits experimental verifications by essays designed to test the specific drug–target interactions.
Structural model for possible binding pose of verapamil onto calmodulin. The molecular surfaces of both the drug and calmodulin are colored by electrostatic potential (a and b) and a close-up view is shown in panel c. Calmodulin surface, especially at ...
Over the last couple of years, there have been a number of computational studies performed to identify targets of existing drugs and drug candidates other than those originally known/proposed to be targeted. A pioneering study is that of Roth, Shoichet, and co-workers11,13 based on compound chemical similarities. Dudley et al. focused on inverse correlations between gene expression profiles in the presence of a drug and in a disease state.25 Yamanishi and his colleagues represented drugs and targets in an integrated “pharmacological space”.17,20 Gonen used a KBMF method where chemical and genomic similarities were integrated.22 We proposed a PMF-based AL methodology that can be advantageously used for large data sets.
The applicability of the method to large data sets is worth further attention, given that we will increasingly have access to bigger data (e.g., STITCH Database30), which will be exploited for repurposable drug identification. The software developed here, made accessible in http://www.csb.pitt.edu/Faculty/bahar/files/, is readily scalable. For very large data sets, which typically have more known interactions, the PMF is able to construct a better model using the plethora of available data; whereas when the number of known interactions is limited, the use of chemical and genomic kernels allows KBMF to outperform PMF. The application of KBMF to large data sets may, however, become challenging, For example, STITCH contains on the order of 106 proteins and 105 compounds, implying that 1012 sequence and 1010 chemical similarity comparisons are needed to make predictions. However, the PMF method is independent of chemical, structural or other similarity metrics, and its computation time scales linearly with the number of known interactions; and it proves to perform well on large data sets. The data sets reporting drug–target interactions are constantly improving in quality and quantity and, therefore, expected to give even better results when analyzed by an efficient tool. Finally, the extension of the method to analyzing big data (with millions of nodes) is foreseeable in the near future. The recently introduced GraphChi tool54 can be used for optimized and parallelized model learning for further performance improvements.
The fact that the PMF is independent of 2D/3D shape comparison methods commonly employed in drug–target pair inferences implies that the derived LVs capture similarities based on the interaction patterns of drugs at the cellular level, even if their molecular structures are dissimilar (see Figures Figures22 and and3).3). As such, the method may be advantageously used for lead hopping, thus complementing those (e.g., SVM classification algorithms) used in conjunction with 2D or 3D pharmacophoric fingerprints (see the work of Saeh et al.55). Inasmuch as the currently proposed method does not require structural data for proteins but knowledge of drug–target interactions, it can be advantageously applied to membrane proteins (major drug targets) for which structural data still remain sparse. It can also be used to make predictions across major drug or target classification boundaries. One implication is that the de novo predictions are not restricted to major drug or target classification boundaries.
A major utility of the developed tool is the ability to deliver testable hypotheses with regard to repurposable drugs, thus significantly reducing the search space for identifying potent applications of existing drugs (that proved to meet ADMET requirements). The number of experiments that can be efficiently conducted is usually limited, e.g. of the order of 102 if not 101 for high-confidence assays as opposed to the complete space of ~1.5 million combinations for the data set used in this study. The fact that the top-ranking predictions exhibit a hit ratio of 59% (for the top 1000 predictions; or 88% for top 100 predictions) suggest that de novo predictions made by the presently introduced method of approach applied to increasingly large data sets are likely to provide useful guidance for experimentally testing, streamlining or prioritizing existing or investigational drugs or new compounds. Another important byproduct is the probabilistic assessments on potential side effects, a topic that will become increasingly important with advances in personalized medicine.
Support from NIH grant nos. U19 AI068021 and PO1 DK096990 is gratefully acknowledged by I.B. M.C.C. and I.B. are thankful to Dr. Ahmet Bakan for useful ideas, and they benefited from insightful discussions with Prof. Lans Taylor. A scholarship award to F.H. from the China Scholarship Council is gratefully acknowledged. We are grateful to Professor Olaf G. Wiest for careful reading of the manuscript and for making valuable comments and suggestions, especially concerning the medicinal chemistry aspects of the study.
National Institutes of Health, United States
Details on the methods, five figures, S1–S5, and Table S1. The figures illustrate the following: the drug–target interaction network and histograms denoting degree distribution for drug and target nodes (Figure S1), the structural heterogeneities of drugs despite their functional similarity (S2), running average of AUC over the cross-validation steps, and the correlation between AUC and number of known interactions (S3), the dependence of performance on data set size and fraction of known interactions (S4), and a comparison of the predictions from AL and PL methods for various dimensionalities of LV space (S5). Table S1 lists the therapeutic function characteristic of each of the 30 PMF-predicted clusters of drugs. This material is available free of charge via the Internet at http://pubs.acs.org.
§ C.L. and F.H. provided equal contribution.
The authors declare no competing financial interest.
Visiting scholar from Tsinghua University.