Our first goal, to achieve statistical generalization (i.e.
, how well a test set drawn from the same class can be predicted from a training set), is best explained in the context of an RNAi study: the original dataset for which we developed our first generation analysis method.7
In this initial study, 29 shRNAs targeting diverse aspects of the DNA damage response and programmed cell death biology were validated for knockdown and examined for their ability to affect drug response in lymphoma cells in vitro.
Drug response was assessed by a mixed population of shRNA-infected cells in which 10–20% of the population harbors the shRNA (along with GFP as a marker). An RNAi signature of a drug was derived as the pattern across all shRNAs of how each shRNA affects cell response to the drug relative to uninfected cells. We found that we could use as few as 8 shRNAs to accurately predict new examples of drugs for topoisomerase I and II poisons, spindle poisons, alkylating agents, nucleotide depletion agents, and transcription/translation inhibitors, as well as histone-deacetylase inhibitors (HDACs) that the model had not been trained on.
Given that K
-NN worked well in this original study, we wished to test here whether it outperforms other machine learning algorithms in RNAi datasets. We initially chose to simulate drug categories instead of the drugs themselves. To do this we chose four drug categories: topoisomerase I poisons, alkylating/alkylating like agents, spindle poisons and nucleotide depletion drugs. These categories were defined by multiple examples of different drugs possessing distinct chemistries within each individual category. A category-based average and standard deviation were generated, and these parameters were then used to simulate “theoretical drugs” from these categories to generate 100 different training sets and 1000 tests per drug category. In these simulated RNAi training datasets, K
-NN dramatically outperformed all other techniques when 3 training drug examples were present (Fig. S1A
). However, as more training drug examples were added (N
= 10) the performance gap between K
-NN and other algorithms decreased (Fig. S1B
). This indicated that for RNAi studies based on small numbers of validated drug training examples K
-NN is a good choice.
Next, to probe utility for statistical generalization, we examined how K
-NN predictions would perform over the progression of time and with actual experimental error. Following the original study, we have generated many signatures across a variety of drug categories. Focusing on the deepest coverage, we chose four of those categories: topoisomerase II poisons, alkylating/alkylating like agents, spindle poisons and nucleotide depletion drugs for which 7–31 replicate drug signatures were tested over three years. We now randomly sampled these much larger datasets from the expected cumulative distribution function of individual shRNAs for various specific drugs using inverse transform sampling. Using 1000 simulated Chlorambucil (an alkylating agent), Doxorubicin (topoisomerase II poison), 5-FU (a nucleotide depletion agent), and vincristine (a spindle poison) signatures, we estimated the robustness of the K
-NN predictions for the more recent drug additions to the dataset. We observed that K
-NN had an accuracy of 98–100% (Fig. S2
). These results offered confidence that a K
-NN approach achieves comfortably strong statistical generalization in RNAi datasets.
Having shown statistical generalization, we next probed categories of drug mechanisms-of-action that the original shRNA signature had not been trained upon. Given drugs whose mechanisms-of-action were represented in the training set plus those whose mechanisms-of-action were not (kinase inhibitors, phosphatase inhibitors, HSP90 inhibitors, and others) we sought to examine the capability for discerning between these two cohorts of drugs – representing true positives and false positives, where a false positive is a “New” drug example that is not in the original training set.
We implemented a variety of machine learning techniques with the goal of identifying “New” drugs that did not belong to the specified training set categories, and found that many classic techniques failed in doing so (). Most algorithms produce a false positive prediction when a drug’s mechanism is not represented in the training set. The best of the unmodified approaches, the K-NN approach, could achieve a true positive ratio (TPR) of 100% but its false positive ratio (FPR) was 100% concomitantly.
We thus considered whether an ensemble algorithm approach might improve performance, so we tested two such methods. The first method employed an ensemble of binary support vector machines (SVMs) each trained separately on each of the distinct drug mechanisms of action in the training set. SVMs seek to define decision boundaries on the basis of multi-variate “support vectors”. By making a binary (single class, “in” versus
“out”) model for each drug category, we could impose rules on the family of predictions made by the ensemble of models. If a drug was predicted to belong to two groups, or was never predicted to belong to a training group, it was classified as having a new mechanism-of-action. The second method was an ensemble of the individual algorithms previously tested (), as we had noticed that correlation between algorithms for different test sets is low (Fig. S1
). This raised the notion that the different algorithms were intrinsically relying on different features in the data. With this in mind, we created a scoring criterion whereby drugs with prediction disagreements among diverse algorithms are counted as “New” mechanisms of action. Both kinds of ensemble approaches improved prediction performance by reducing false positives modestly, though with high losses of true positives. The SVM and algorithm ensembles had TPRs of 88% and 42% and FPRs of 67% and 0% respectively ().
Toward a yet more effective approach, we recognized that the signature-based drug mechanism-of-action subnetworks exhibit diverse sizes (i.e., as characterized by mean edge length) (). The range of weighted pairwise distances between nodes in a given subnetwork produces complex shapes along with heterogeneous sizes in molecular signature space. Moreover, different subnetworks should be able to tolerate the inclusion of novel drugs to a different extent based upon their size and relative spacing to each other in subnetworks, but a sound theoretical estimate of the intra-subnetwork distribution for a given drug network is elusive because the “true” shape of the mechanistic subnetwork across all possible efficacious chemistries on a given biochemical target is unknown. Furthermore, the generally small number of distinct chemistries per class for model training precludes firm empirical estimation of a distribution within a subnetwork. Accordingly, we proceeded to take advantage of the “negative control” drugs (i.e., those not falling into a given category) to build distributions of false positive predictions; the negative controls offer a reasonable approximation of a background distribution because they are sampled broadly across many mechanisms of action with diverse biological functions. The empirical cumulative distribution function of these false positive network expansions is plotted for each category ( bottom), and a kernel density estimate of the cdf permits estimate of the probability that a predicted linkage ratio could be generated in the negative control dataset ( bottom). Implementing this methodology across all drug categories clearly separates false positives into a “New” drug category. Calculated results shown in show this methodology to be a major improvement over the algorithms tested in , including the ensemble algorithms (). Using a p-value cutoff of 0.05, this advanced method yielded a TPR of 100% and an FPR of 8.3%; a more stringent cutoff of 0.01 yielded a TPR = 94% and an FPR of 0%.
In our original RNAi study we had found that there was predictive resolution among HSP90 inhibitors as well as some classes of kinase inhibitors, which represented entirely new biological classes of drug mechanisms-of-action that the shRNA signature had not been trained on. This was a surprising result, but we had performed that study in an ad hoc manner by simply adding two new drug examples to the trained signature and predicting a third. Nevertheless, it suggested a potential for signatures of molecular predictors to be sufficiently broad in their biological comprehension to garner predictive resolution for untrained classes. We thus sought to apply our new network-based signature strategy to other datasets as a systematic way to test this potential more broadly.
While our network based statistic utilizes the negative control distribution for out-of-network expansion and is clearly able to define drugs not belonging to a training set subnetwork, the main challenge becomes identifying high confidence functional groupings between these “New” compounds such that the groupings have similar mechanisms-of-action. These new functional groupings must then be incorporated back into the training set and integrated with our supervised methodology so that during an iterative process new examples of these unknown subnetwork mechanisms might be identified and compiled. Because “New” groupings must be found, unsupervised learning is a natural choice. However, one such method, hierarchical clustering, will always produce clusters, and it requires the user to decide how to interpret the hierarchy. Furthermore, the structure between groups in a cluster tree is often significantly altered as neighboring branches are combined in the next level of hierarchy (often using average or nearest linkage measurements). Using the “New” mechanisms of action set of drugs from , we, see that hierarchical clustering () can raise multiple options for interpreting natural groupings in the data.
Fig. 3 A strategy to update new drug mechanism subnetworks. (A) A hierarchical clustergram with average linkage. This clustergram contains all of the “New” drugs in an attempt to test an approach to updating new subnetworks to the training set. (more ...)
In an alternative approach, networks of pairwise distances can also be used to examine relationships between nodes (here drugs), especially if the topology is thresholded at distance cutoffs.19
This network based representation offers the advantage of displaying all edges of a certain distance, and grouped distances are not recalculated as a hierarchy changes. In our data set, as the topology threshold is increased (), the network loses nodes and coalesces around the more densely interconnected subnetworks. Hence, the network representation may more faithfully represent some aspects of the natural data connections within and across cluster groupings even if it fails to provide an unbiased or non-trivial stopping point. Given these strengths and weaknesses, we developed a consensus method that combines the natural representation of thresholded network topologies and the discrete interpretations of hierarchical clustering. Under the assumption that two or more clusters exist in the data, we vary cluster size and the number of edges included in the network topology. In this procedure, we reward consensus solutions that maintain within cluster edges and eliminate between cluster edges; the best scoring solution is selected. Finally, we refine the groupings in an iterative fashion using our supervised approach, and test statistical and biological generalization.
Given that we had previously examined inhibitors of HSP90 and EGFR in our earlier study, we wanted to see if our proposed methodology could identify them a priori. Here we are not looking for “New” mechanisms in the singular, but “New” subnetworks in a more unbiased and systematic manner. We performed the search for “New” subnetworks in the presence of control out of category compounds (sunitinib, gliotoxin, and cantharidin and AA2). Here, network consensus clustering identified an optimal solution comprising 4 clusters (). This solution separated gliotoxin into its own 1-member category; the other three clusters were: a 2-member cluster containing cantharidin and AA2; a 3-member cluster with the 3 EGFR inhibitors; and, a 6-member cluster containing all of the HSP90 inhibitors plus sunitinib. Using this solution we ran our iterative supervised refinement. Upon refinement, AG1478 and erlotinib could accurately predict gefitinib and no other drugs, settling on a 3-member EGFR inhibitor cluster. The closest four HSP90 inhibitors could accurately predict 17AAG, but all five failed to accurately predict sunitinib. This created a subnetwork update that only included the 5 HSP90 inhibitors. Finally, AA2 could not identify cantharidin as its nearest neighbor, so this 2-member cluster –one a direct activator of the apoptosis downstream of the mitochondria, and the other a phosphatase inhibitor – was not updated as a new subnetwork. Thus, using network consensus clustering and supervised prediction refinement we could identify the “New” drug subnetworks for HSP90 and EGFR inhibitors that had appeared in our previous study ().
With this success we wanted to test our approach in two other datasets, the modulatory profiling approach of Wolpaw13
and the Connectivity Map of Lamb.3
We aimed to test the generalization of the analysis strategies in different data types, and to examine a more natural workflow in datasets where more powerful analysis strategies were not fully developed.
In the Wolpaw dataset, small molecule tool compounds and some RNAi perturbations that affect broad areas of biology were used to create drug action signatures. To measure how an effect modulates small molecule action, Wolpaw quantified shifts across an entire dose effect curve. This spectrum of the dosing effects across different compounds and cell lines formed a dose response modulation signature able to hierarchically cluster drugs by putative mechanisms-of-action. Interestingly, while their clustering did not resolve topoisomerase II and I poisons, or a coherent DNA damage cluster (defined by the presence of topoisomerase poisons and akylating agents), it could identify clusters that possessed similar chemical structure (hydrophobic amines), acted in a Bax/Bak-independent manner, and seemed to kill cells by non-specific reactive mechanisms. This suggested to us that their dose effect modulation approach could provide effective resolution, albeit with a disparate biological focus. We decided to examine their data with our new analysis strategy to compare capabilities.
Following our RNAi-based network signature approach, we found that the Wolpaw potency modulation data alone was sufficient for accurate classification. We constructed a training set comprising topoisomerase II poisons, topoisomerase I poisons, spindle poisons, proteasome inhibitors, and alkylating agents that was able to accurately predict “New” mechanism-of-action classes, including erastin (a novel Bax/Bak-independent death-inducing compound), compounds sharing structures with hydrophobic amines, and mitochondrial disruptors azide and valinomycin (Fig. S3A
). Given these “New” compounds we employed network consensus clustering and received a three-way tie in the network–cluster solution. Given a tie, the most parsimonious solution included only 2 clusters (Fig. S3
). Supervised refinements on the most parsimonius solution reinforced this two cluster solution with 2 nodes and 1 edge (Fig. S3C
). These 2 clusters contained the erastin profiles, and the hydrophobic amines (Fig. S3B–D
). With these new subnetworks defined, an updated dataset incorporating these 2 subnetworks was generated to examine the statistical and biological generalization of our approach in the Wolpaw dataset. With respect to statistical generalization, our method was able to predict Mitoxantrone as a topoisomerase II poison, as well as numerous novel spindle poisons that were also accurately predicted to be spindle poisons. To examine biological generalization, we tested our extended model’s predictive capability on a Bax/Bak-independent erastin-like compound along with diverse new hydrophobic amines that were not part of the subnetwork update (Fig. S3E
). At p
-values of 0.1 and 0.05, our algorithm produced a TPR of 88% and 75% respectively, both with an FPR of 0%. Thus, our analysis methodology can be gainfully applied to different data types.
A current standard in the molecular signature field is the Connectivity Map (CMap).3
In this approach, human cancer cell lines are treated with drugs, then at 6 hours harvested for genome-wide microarray mRNA expression measurement. The CMap employs a rank-based pattern matching methodology to connect a query signature of mRNAs to other microarrays in the database. While in theory the lists generated by querying a signature can provide many potential hypotheses, in practice the lists require post hoc analysis to gain confidence in a proposed prediction.
As a first step toward testing our approach on the CMap data set, we examine the associated CMap analysis search tool by exploring lists thereby generated; we evaluated a variety of inclusion and exclusion criteria and present here a post hoc analysis of the best-case search results for two categories (see ESI†
and Supplemental Data
). To score the searches, a drug was counted in the top 5 or 25 even if it was the drug used to search the database (to aid in producing best case results). HDAC inhibitors are one of the best categories with respect to results, and are featured in the Lamb publication.3
Nonetheless, even within good prediction categories such as the HDAC inhibitors, the HSP90 inhibitor 17-AAG and PI3K inhibitor LY-294002 present substantial false positive issues. 17AAG and LY-294002 are always present in the top 25 results, and they are present more often than Scriptaid (a true HDACi) in the top 5 results. Accordingly, if one were to examine a TPR and FPR given the quantification of the top 5 or 25 results, the top 25 tally for HDACs would yield a best case TPR of 100% with an FPR of 66%. This threshold would allow one to accurately group Scriptaid with the other HDAC inhibitors, but at the expense of also grouping LY-294002 and 17AAG with the HDAC inhibitors (). Using the top 5 as a quantification test eliminates these false positives, but concomitantly also eliminates Scriptaid. This results in a best case TPR of 75% and an FPR of 0%. Examining another category, topoisomerase II poison searches appear much poorer than HDAC signatures. The HDACi, trichostatin A, and the topoisomerase I poison camptothecin occur more often than all of the topoisomerase II poisons in the top 25 search results, and all but doxorubicin in the top 5 search results (). For topoisomerase II poisons this generates a top 25 TPR of 0% and FPR of 100%, or a top 5 TPR of 20% and FPR of 0%. These results display a clear resolution gap between the Connectivity Map approach and the previously examined functional RNAi approaches.
Given this performance for the Connectivity Map, we wondered whether our previous successes in similar drug categories could be attributed to the disparate biology associated for functional cell death measurement vs
global mRNA measurement, or instead is a direct attribute of our analysis strategy. Thus, we aimed to improve the Connectivity Map performance for mRNA expression signatures using our drug network analysis method. Since the Connectivity Map data clusters by batch rather than by drug mechanism-of-action3
we took a “bottom-up” approach to feature selection (a deviation from our original technique in ref. 7
) (). We seek to select n features from each drug subnetwork based upon the entire drug mechanism of action subnetwork, and not single agents. To do this we add 1, 10, or 100 highly predictive mRNA subnetwork features at a time. To train the model, we added features for topoisomerase II poisons, topoisomerase I poisons, alkylating agents, nucleotide depletion agents, and spindle poisons. As we iteratively added features we assessed the leave-one-out cross-validation (LOOCV) across only these training classes. We found that 10 mRNAs per training category could adequately train a K
-NN classifier across all five training categories ().
Based on this training of the CMap dataset using our approach, we asked whether “New” mechanisms-of-action might now be identified, and with statistical and biological generalization as we earlier proved for RNAi and potency modulation datasets. To pose a stringent test set, we selected four drug categories: PI3K inhibitors (wortmanin), HDAC inhibitors (trichostatin A, valproic acid), MTOR inhibitors (sirolimus) and HSP90 inhibitors (geldanamycin, alvespimycin), as well as negative control microarrays chosen randomly across CMap batches that could be assumed to be random hits with no common mechanism. The latter should represent “New” mechanisms-of-action for which no partner compound exists in the limited set of 5 mechanisms of action that we trained our mRNA classifier on. All of the test set compounds were predicted to have “New” mechanisms-of-action (Fig. S5
), so we went on to perform network consensus clustering with the goal of finding new subnetworks. Network consensus clustering was able to identify four clusters corresponding to PI3K, HDAC, and HSP90 inhibitors. Supervised refinement reinforced this cluster selection and did not alter the subnetwork assignments (). As two of these clusters were uniformly comprised of experimental replicates of wortmanin, they were merged. Importantly, PI3K and HSP90 inhibitors – which had represented false positives for HDAC inhibitors when using the Connectivity Map’s analysis approach – were properly separated in our network consensus solution.
Fig. 5 Examining statistical and biological generalization in the Connectivity Map data. (A) The hierarchical clustering of “New” drugs from the Connectivity Map data. Connectivity Map instance ids are appended to the front of the drug name. (more ...)
To probe statistical and biological generalization, we updated the training set with wortmanin, alvespimycin, geldanamycin, trichostatin A, and valproic acid, and found that we could produce a statistically generalized result. We accurately predicted mitoxantrone and ellipticine as topoisomerase II poisons and podophylotoxin (along with others), as spindle poisons. Further, we found that our analysis produces a Connectivity Map mRNA signature that is biologically generalizable. We could use the same updated model that was trained on only 5 drug categories to accurately predict the identity of LY-294002, SAHA, Scriptaid, and 17AAG (examples of PI3k, HDAC, and HSP90 inhibitors respectively) while still recognizing the randomly chosen mRNA signatures from diverse batches as “New” drugs. Using our analysis approach with the Connectivity Map mRNA data, we can achieve a TPR of 90% and an FPR of 4% at a p value of 0.1 (). Furthermore, LY-294002, scriptaid, and 17AAG, which presented difficult tradeoffs when using the Connectivity Map methodology, were correctly predicted using our analysis.
Because we were able to enhance analysis of the Connectivity Map dataset for anti-cancer drug discovery, we sought an explanation for our methodological advance. It has been previously shown that the distribution of noise across a microarray compendium dataset can be highly correlated with the distribution of noise across experimental replicates.6
To examine the noise distribution in the Connectivity Map dataset, we plotted the error within a drug category against the error across the entire subset of the data we examined. Reminiscent of the Hughes compendium paper, the Connectivity Map yields a within-category error distribution that maps predominantly linearly with the error distribution across the data set (). This suggests that the majority of the error is due to experimental issues typically common to microarray measurements. However, some points lie off this distribution; they have a large distribution of error across the dataset, but low error within a drug category. The best Connectivity Map searches (i.e.
, those for HDAC inhibitors) have a larger proportion of search features within this second distribution. Compared to even the best topoisomerase II Connectivity Map search feature sets, there were many more low category noise features in the HDAC search set (feature sets shown are from the Fig. S4
searches) (). Features that were selected using our bottom up approach within the topoisomerase II poisons are dramatically enriched in this low noise section of the data set. Surprisingly, this enrichment extends beyond topoisomerase II poisons to HDAC inhibitors, which they were not trained upon (). We suggest that sampling from this low within category error distribution could be part of the statistical rationale for the better performance of our approach. Furthermore, the high error distribution was enriched for phosphorylation or acetylation keywords, neither of which appeared as functional annotations in our highly predictive feature sets ().
Since we were able to produce surprisingly accurate predictions of drug mechanism of action in the Connectivity Map, we aimed to interpret that predictive ability in a biological manner. Starting with DAVID, an online functional annotation tool, we were able to develop descriptions of the signature set, with respect to gene ontology/keyword designations, encompassing approximately 50% of the probes in the mRNA set. Next we plotted some of the average values across an entire drug category in the mRNA data set for mRNAs that were clearly associated with a consistent biology or GO term (). It is important to note that these interpretations are rationalizations, and only represent a snapshot of the mRNA cell state at the time point that they were taken (6 hours). Surprisingly, features trained on 5 of the 8 total drug mechanism networks generated biological interpretations in all drug categories.
Fig. 7 An examination of the mechanisms of biological generalization. (A) Left: Keywords for the features that were trained solely upon the alkylating agents. Right: A heatshock protein signature heatmap where entries are the average rank for a gene across all (more ...)
For the alkylating agents, functional annotation revealed many terms related to protein stress and heat shock. A heat shock protein (HSP) response representing three Affymetrix probes that measure heat shock protein mRNA levels (HSP70A/B and HSP105) appeared to be an analog gauge, seemingly measuring the drug induced stress response state in a continuous fashion across the various mechanism of action categories in the dataset (). A similar analog gauge was observed for the transcriptional regulation GO term genes identified in the topoisomerase II trained feature set () with drug categories displaying differences in the relative ordering of the amount of this transcriptional regulation signature in a continuous fashion.
Further inspection of the data revealed that in order to provide a broader interpretation of two of the training categories, we had to combine them. The nucleotide depletion and spindle poison features appear to form a molecular signature of cell cycle state. 9 of 10 Spindle poison features were associated with a mitotic cell cycle state, and included canonical mitotic kinases, as well as regulators of the mitotic checkpoint and cytokinesis. Spindle poisons exhibited clear evidence of mitotic arrest, with all of these 9 probes greatly upregulated. Nucleotide depletion agents (GO term: DNA replication) exhibited clear evidence of an S-phase arrest (Primase and CPT1 are involved in lagging strand synthesis and pre-replication complex assembly respectively and are highly upregulated). Furthermore, adding more resolution, the G1/S cyclin Cyclin E2 appeared most strongly in the S phase arrested nucleotide depletion category (consistent with a peak in expression at S-phase). The presence of Cyclin E2 in the absence of the core replication components (as well as the absence of the mitotic signature) suggested a G1 cell cycle phase for Alkylating agents and HDACs. Absence of Cyclin E2, and mitotic genes would suggest topoisomerase I and topoisomerase II poisons are arresting the MCF7 cells in G2 by 6 hours. Interestingly the heterogeneous nature of the PI3K data may suggest asynchronous cells at the time the mRNA measurements are taken (). Importantly these cell cycle rationalizations are supported by the literature since trichostatin A has been shown to induce Cyclin D1 degradation and a G1 arrest in MCF-7 cells.20
In contrast, the most obvious feature in the RNAi dataset is the presence or absence of a functional signature of DNA damage (p53 + CHK2). Within the DNA damage category, the character of the response (i.e., whether non homologous end joining (DNAPK) is necessary for repair or apoptosis) seems to add predictive value. Furthermore, the lack of p53 and the presence of BCL-2 family members helps predictive resolution outside of DNA damage category in RNAi datasets. Thus in the RNAi dataset, the type of cell death appears to be interpretable as a set of sensitivities that is either present or absent, whereas the Connectivity Map contains an mrna impression of cell cycle state and a cell stress/transcription gauge ().