Home  About  Journals  Submit  Contact Us  Français 
Traditional drug discovery practice usually follows the “one drug – one target” approach, seeking to identify drug molecules that act on individual targets, which ignores the systemic nature of human diseases. Pathwaybased drug discovery recently emerged as an appealing approach to overcome this limitation. An important first step of such pathwaybased drug discovery is to identify associations between drug molecules and biological pathways. This task has been made feasible by the accumulating data from highthroughput transcription and drug sensitivity profiling. In this paper, we developed “iPaD”, an integrative Penalized Matrix Decomposition method to identify drugpathway associations through jointly modeling of such highthroughput transcription and drug sensitivity data. A scalable biconvex optimization algorithm was implemented and gave iPaD tremendous advantage in computational efficiency over current stateoftheart method, which allows it to handle the evergrowing largescale data sets that current method cannot afford to. On two widely used real data sets, iPaD also significantly outperformed the current method in terms of the number of validated drugpathway associations that were identified. The Matlab code of our algorithm publicly available at http://licongjason.github.io/iPaD/
Identification of potential therapeutic drug molecules for a disease of interest, which is known as “drugdiscovery”, is a critical step in new drug development. Traditional drug discovery practice in the pharmaceutical industry usually follows the “one drug – one target” approach, which seeks to identify drugs that act on individual molecules that are involved in the pathological process [1]. However, this approach ignores the systemic nature of human diseases and may account for a significant increase in the attrition rate of new candidate drugs due to low efficacy and serious side effects [2], [3]. Specifically, the pathological process of a disease usually involves extremely complex interplays between many functionally related biological molecules within certain diseaserelated pathways [4]. Targeting a single component in a diseaserelated pathway may not provide sufficient interference to the whole pathological process, which may result in unsatisfactory therapeutic efficacy. Moreover, it does not allow researchers to investigate the effects of drug molecules at a systems level, limiting its utility in evaluating drug safety and toxicity in early drug development stages [5].
Pathwaybased drug discovery recently emerged as an appealing approach to overcome the limitations of the one drug – one target approach [6]–[8]. Instead of focusing on the interference of drug molecules on individual targets, pathwaybased analysis seeks to understand the effect of a drug molecule on a whole biological pathway. The knowledge about the effects of drug molecules upon the diseaserelated pathways would allow researchers to identify more effective candidate drugs. However, currently not much is known about the complex relationships between drug molecules and biological pathways, and an important first step is to infer associations between them before delving into their complex relationships.
Highthroughput transcription and drug sensitivity profiles have become routine products of pharmacogenomics studies [9]–[11]. For example, the NCI60 project [12], [13] performed whole genome transcription profiling of 60 cancer cell lines and measured their sensitivities to several thousands of drug molecules. Assuming the transcription and drug sensitivity levels are both associated with the activity levels of certain biological pathways, joint modeling of these two types of data could provide valuable information regarding the pathway activities, drugpathway and genepathway relationships. With this mindset, [14] proposed an integrative statistical framework named “iFad”, which is built on a Bayesian sparse factor analysis model, to infer drugpathway associations by jointly analyzing the gene expression data and drug sensitivity data. They applied their method to the NCI60 gene expression and drug sensitivity data set and identified significant drug associations for eight types of cancers separately. Despite its satisfactory power in identifying drugpathway associations, iFad performs statistical inferences using Markov Chain Monte Carlo (MCMC), which is very computationally expensive. Moreover, its results might be sensitive to several prior parameters that need to be specified by the users.
Besides the NCI60 project, a more recent effort to characterize the transcription and drug sensitivity profiles of various cancer cell lines is the Cancer Cell Line Encyclopedia (CCLE) project [15]. Compared with the NCI60 data set, a noteworthy feature of the CCLE data set is its large sample size – more than 400 cancer cell lines have been profiled for their transcription levels and drug sensitivities. Analyzing such a large data set using iFad would incur very huge computational cost. In fact, in [14], iFad was only used to analyze each cancer type separately (each cancer type has only 6 ~ 9 samples). Analyzing all the cell lines together would cost tens to hundreds of hours depending on the number of MCMC iterations. Apparently, the everincreasing data sizes of paired transcription and drug sensitivity data sets present an pressing need for more computationally efficient methods for drugpathway association analysis. In this article, we introduce “iPaD”, an integrative Penalized Matrix Decomposition method for drugpathway association analysis. Compared with iFad, our method has two unique features: 1) the association analysis is formulated into a biconvex optimization problem which can be solved with a very efficient algorithm and hence greatly reduces the computational cost compared to the MCMC algorithm in iFad; 2) our method is almost tuningfree – the only tuning parameter can be easily determined through crossvalidation. We demonstrated the practical utility of our method on two relatively large real data sets, the pooled NCI60 data set with all eight types of cancers and the CCLE data set. We also compared power to detect drugpathway associations between iPaD and iFad through comprehensive simulations. We organize the rest of this article as follows. We explain the iPaD model and its optimization algorithms in the next section, followed by real data applications and results for simulation studies. We summarize the conclusions in the final section.
Consider N samples (usually cell lines) with both transcription and drug sensitivity profiles available. Let ${Y}^{(1)}\in {\mathrm{\mathbb{R}}}^{N\times {G}^{(1)}}$ and ${Y}^{(2)}\in {\mathrm{\mathbb{R}}}^{N\times {G}^{(2)}}$ denote the transcription and drug sensitivity profiles respectively, where G^{(1)} is the number of genes that are measured and G^{(2)} is the number of drug molecules that are assayed. We consider the activity levels of biological pathways as a set of common latent factors that underly the gene expression and drug sensitivity profiles. The activity levels of K pathways for the N cell lines are denoted as a matrix X ^{N}^{×}^{K} and we use B^{(1)} and B^{(2)} to denote the loading matrices for pathwaygene associations and pathwaydrug associations. Then the gene expression and drug sensitivity profiles are modeled as follows:
where ${E}^{(1)}\in {\mathrm{\mathbb{R}}}^{N\times {G}^{(1)}}$ and ${E}^{(2)}\in {\mathrm{\mathbb{R}}}^{N\times {G}^{(2)}}$ are residuals. Instead of adopting a Bayesian inference approach, we seek to estimate the latent factors X and the loading matrices B^{(1)} and B^{(2)} by minimizing the following sum of squared residuals,
where ‖·‖_{F} stands for the Frobenius norm, i.e. ${\Vert A\Vert}_{F}=\sqrt{{\displaystyle {\sum}_{i}{\displaystyle {\sum}_{j}{A}_{ij}^{2}}}}$. We introduce an indicator matrix ${L}^{(1)}\in {\left\{0,1\right\}}^{K\times {G}^{(1)}}$ to represent the prior knowledge about genepathway relationships. Specifically, we let ${L}_{i,j}^{(1)}=1$ indicate that the jth gene is involved in the ith pathway and ${L}_{i,j}^{(1)}=0$ otherwise. Because genepathway relationships are relatively well defined in the public databases compared to drugpathway relationships, we assume that the known genepathway relationships are complete and accurate and focus on identifying novel drugpathway associations. Moreover, B^{(2)} matrix is typically sparse because a drug molecule is usually only associated with a few pathways, and vice versa. Based on these two assumptions, we formulate the following optimization problem:
where ‖·‖_{1} stands for the _{1} norm, i.e. ‖A‖_{1} = Σ_{i} Σ_{j} A_{i j}. This _{1} norm penalty is an effective and popular device to achieve sparse solutions in highdimensional regression which was first introduced in the socalled “lasso” (Least Absolute Shrinkage and Selection Operator) regression [16], which attempts to solve a highdimensional regression problem while applying an _{1} norm constraint on its solution. With such a constraint, some coefficients in the solution may be shrunk to exact zeros. The constraint on B^{(1)} is used to incorporate known genepathway relationships. The constraint on the columns of X ensures the identifiability of the model.
We note that problem (2) is a biconvex optimization problem, i.e., when X is fixed, optimizing B^{(1)} and B^{(2)} are both convex problems and when B^{(1)} and B^{(2)} are fixed, optimizing X is also a convex problem. This property naturally motivates the following alternating optimization algorithm (Algorithm).
Next, we describe the optimization algorithm for X, B^{(1)} and B^{(2)} separately.
The optimization problem for X can be rewritten as follows:
where Y = [Y^{(1)} Y^{(2)}] and B = [B^{(1)} B^{(2)}]. We used an iterative projected gradient descent algorithm to solve this problem. Specifically, at each iteration, the gradient of the objective function is calculated and we let X take a step along the negative direction of the gradient and projected it to the feasible region: $\{X:{\sum}_{i}{X}_{i,j}^{2}\le 1,\forall j=1,\dots ,K\}$. We use the Nesterov’s method to accelerate its convergence [17]. In Nesterov’s method, after each gradient descent iteration, the solution is moved further in the direction and step size given by the the last two steps, achieving a convergence rate of order 1/t^{2} instead of order 1/t of the ordinary gradient descent method.
Algorithm I: The alternating optimization algorithm for iPaD 

Data Input: Y^{(1)}, Y^{(2)}, L^{(1)} 
Parameter: λ 
Initialization: Set B^{(1)} = L^{(1)} and set B^{(2)} = 0 

Optimization: 
1. Optimize X: $\begin{array}{ll}X=\underset{X}{\mathrm{arg}min}& {\Vert {Y}^{(1)}X{B}^{(1)}\Vert}_{F}^{2}+{\Vert {Y}^{(2)}X{B}^{(2)}\Vert}_{F}^{2}\\ \phantom{\rule{2em}{0ex}}\text{subjectto}& {\displaystyle \sum _{i}{X}_{i,j}^{2}\le 1,\forall j=1,\dots ,K}\end{array}$ 
2. Optimize B^{(1)}: $\begin{array}{ll}{B}^{(1)}=\underset{{B}^{(1)}}{\mathrm{arg}min}& {\Vert {Y}^{(1)}X{B}^{(1)}\Vert}_{F}^{2}\\ \phantom{\rule{2.7em}{0ex}}\text{subjectto}& {B}_{i,j}^{(1)}=0,\forall (i,j):{L}_{i,j}^{(1)}=0\end{array}$ 
3. Optimize B^{(2)}: ${B}^{(2)}=\underset{{B}^{(2)}}{\mathrm{arg}min}\phantom{\rule{0.5em}{0ex}}{\Vert {Y}^{(2)}X{B}^{(2)}\Vert}_{F}^{2}+\mathrm{\lambda}{\Vert {B}^{(2)}\Vert}_{1}$ 
4. Repeat steps 1, 2 and 3 until convergence. 

The optimization of B^{(1)} is relatively trivial because we can decompose the problem into G^{(1)} separate ordinary least squares (OLS) problems by observing that each column of B^{(1)} can be optimized separately:
where ${Y}_{:,g}^{(1)}$ refers to the gth column of Y^{(1)}, ${X}_{:,{L}_{:,g}^{(1)}}$ refers to the submatrix of X consisting of the columns corresponding to the nonzero entries of ${L}_{:,g}^{(1)}$ and ${B}_{{L}_{:,g}^{(1)},g}^{(1)}$ refers to a subvector of the gth column vector of B^{(1)} corresponding to the nonzero entries of ${L}_{:,g}^{(1)}$.
Similar to B^{(1)}, we observe that each column of B^{(2)} can also be optimized separately, which decomposes the original problem into G^{(2)} separate lasso regressions:
We use the readily available and very efficient coordinate descent algorithm to solve each lasso problem [18], [19].
We note that this setup does not allow incorporating known drugpathway associations. When such prior knowledge is available and needs to be incorporated, we modify problem (5) as follows,
where ${L}^{(2)}\in {\{0,1\}}^{K\times {G}^{(2)}}$ is the indicator matrix representing the known drugpathway associations similarly to L^{(1)}.
In this section, we describe how missing values in the transcription data and drug sensitivity data are dealt with in iPaD. As mentioned earlier, when optimizing B^{(1)} and B^{(2)}, each column of these two matrices can be solved separately. Therefore, missing values can be simply removed from each column of Y^{(1)} and Y^{(2)} while solving each column of B^{(1)} and B^{(2)}.
Optimizing X is not straightforward because it needs to be solved as a whole matrix. Suppose that the observed values in Y are indexed by $\mathrm{\Omega}\in {\{0,1\}}^{N\times ({G}^{(1)}+{G}^{(2)})}$. We define an operator ${\mathcal{P}}_{\mathrm{\Omega}}$ that projects matrix X onto the linear space supported by Ω,
Then when Y^{(1)} or Y^{(2)} contains missing values, the optimization problem for X can be written as:
Define Ω_{} = 1 – Ω as the index matrix for the missing values in Y and rewrite the objective function in problem (7):
This simple observation motivates the following iterative algorithm to solve problem (7), in which the estimate of X at each iteration is then plugged into ${\mathcal{P}}_{{\mathrm{\Omega}}_{\perp}}(XB)$ for optimizing (8) in the next iteration (Algorithm 2). This is essentially the spirit of the softimpute algorithm from the incomplete matrix learning problem [20].
We prove that Algorithm 2 indeed gives the global optimal solution for problem (7) in the Supplementary Materials.
Algorithm II: Optimizing X with missing values in Y 

Data Input: Y, B, Ω 
Initialization: Initialize X 

Optimization: 
1. $\text{Set}\phantom{\rule{0.2em}{0ex}}Y={\mathcal{P}}_{\mathrm{\Omega}}(Y)+{\mathcal{P}}_{{\mathrm{\Omega}}_{\perp}}(XB)$ 
2. Optimize X: $\begin{array}{ll}X=\underset{X}{\mathrm{arg}min}\hfill & {\Vert YXB\Vert}_{F}^{2}\hfill \\ \phantom{\rule{1.8em}{0ex}}\text{subjectto}\hfill & {\displaystyle \sum _{i}{X}_{i,j}^{2}\le 1,\forall j=1,2,\dots ,K}\hfill \end{array}$ 
3. Repeat steps 1 and 2 until convergence. 

In our method, λ is a critical parameter that controls the sparsity of the drugpathway association matrix, B^{(2)}. Similar to many other penalized methods such as lasso, one way to assess the relative importance of the coefficients in B^{(2)} is to solve the problem for a decreasing sequence of λ values and record the order of the coefficients in which they became nonzero – the more important coefficients are supposed to become nonzeros earlier than the less important ones. However, this procedure does not allow evaluation of the significance of the coefficients. Alternatively, we can first find an appropriate value of λ and then perform permutation test to assess the significance of the coefficients.
We use tenfold crossvalidation to find an appropriate λ value. Specifically, the Y^{(2)} matrix is evenly partitioned into n folds as illustrated in Figure 1. In each round of crossvalidation, each of the ten folds is masked as missing values while the other entries are treated as observed values. Then we solve the iPaD problem for a sequence of λ values. The residual sum of squares (RSS) of the masked values for the ten folds are evaluated for each λ. The λ that gives the smallest RSS is chosen. Given the λ value, the significance of the coefficients in B^{(2)} can be assessed via permutation test. In each permutation, the rows in Y^{(2)} are shuffled whereas Y^{(1)} is kept unchanged. After the estimates of the B^{(2)} in the permuted data sets are obtained, the pvalue of each coefficient in the B^{(2)} matrix is calculated as follows,
where T is the total number of permutations, ${\stackrel{\sim}{B}}^{(2)(t)}$ is the B^{(2)} estimate in the tth permutation and ${\widehat{B}}^{(2)}$ is the B^{(2)} estimate in the original data.
We applied iPaD to analyze the same NCI60 data set that was analyzed in [14]. The detailed data preprocessing procedure can be found in their paper. The processed data set consists of 57 cell lines from eight different cancer types with gene expression levels of 1863 genes covering 58 KEGG pathways [21] and drug sensitivity profiles for 101 chemical compounds/drugs as measured by GI_{50} values, which is a commonly used measure for drug potency and is defined as the concentration required for 50% of maximum cell growth inhibition. The genepathway association matrix L^{(1)} has a density of 3.95% whereas the drugpathway association prior knowledge matrix L^{(2)} has a density of 0.51%. [14] advocated analyzing the eight types of cancer cell lines separately by demonstrating the heterogeneity among the inferred drugpathway association patterns across different cancer types. Therefore, we first used our method to analyze the eight types of cancer separately. However, we did not find much consistency between our results and the results reported in [14] (results not shown). A possible explanation is that the very limited sample size (only 6 ~ 9 cell lines per cancer type) for each cancer type probably could not provide reliable statistical inference. Then we pooled the data for the eight cancer types and analyzed them using both iFad and iPaD. For iFad, we ran 10,000 iterations with the first 7,000 iterations discarded as burnin. MCMC samples were recorded every tenth iteration. Both η_{0} and η_{1} parameters were set to zero for genepathway associations. η_{1} was set to zero and η_{0} was set at 0.05 for drugpathway associations. As for iPaD, we first used tenfold crossvalidation to choose the λ value and then performed 10,000 permutations to evaluate the significance levels of the identified associations. For the purpose of comparison, we also analyzed the data set with a Gene Set Enrichment Analysis (GSEA) method. Specifically, for each drug, we tested for its association with each gene using a simple linear regression and ranked the genes according to their significance levels of association. Then we performed a KolmogorovSmirnov test for the enrichment of significant genes in each pathway.
Due to the stochastic nature of the MCMC algorithm in iFad, we repeated the analysis for five times. On average, 123 drugpathway pairs had posterior probabilities greater than or equal to 0.9. Among them, 25.2 associations were validated in the CancerResource database [22], which was also used in [14] to validate their inferred drugpathway associations. Whereas for iPaD, although the optimization algorithm itself is deterministic, the permutation test is stochastic. Therefore we also ran iPaD for five times. On average, 277.2 associations had pvalues less than or equal to 0.05, with 81 validated in the CancerResource database. If we set the pvalue cutoff at 0.005, 59.4 associations were identified, with 20.2 associations validated in the CancerResource database. Among the top 100 associations identified by GSEA, 23 were validated. Apparently, the drugpathway pairs identified by iPaD are more enriched for validated associations than those identified by the other two methods. To better compare the results from these two methods, we performed the following analysis. First, the list of all drugpathway pairs were ranked based on their significance in both iFad, iPaD and GSEA results. Then we slid a bin that consists of 20 consecutive drugpathway pairs from the top to the bottom of the list. For example, the first bin consists of the 20 most significant drugpathway pairs and the next bin starts from the second most significant drugpathway pair and runs through the 21st most significant pair, and so on. Then the number of validated drugpathway pairs in each bin was counted. The results averaged for the five runs for the first 100 bins were shown in Figure 2a. Clearly, among the top drugpathway pairs identified by the three methods, iPaD had higher enrichment for validated associations. The top 50 drugpathway pairs for each run were listed in Table S1–S5 and Table S6–S10 for iFad and iPaD respectively. We also list the top 100 drugpathway associations identified through GSEA in Table S11.
We also applied both iFad and iPaD to the data set from the CCLE project. This data set contains 480 cancer cell lines with transcription profiles for 18,988 genes and drug sensitivities profiles for 24 chemical compounds/drugs. The drug sensitivities were measured by EC_{50}, IC_{50}, maximum activity (‘A_{max}’) and area over the doseresponse curve (‘activity area’). Like GI_{50}, both EC_{50} and IC_{50} are commonly used measures of drug potency. ‘A_{max}’, instead, is a measure of drug efficacy. In this paper, we chose the ‘activity area’ as our measure for drug sensitivity due to two considerations: 1) it simultaneously captures both drug potency and efficacy; 2) it had much fewer missing values than the other measures in the CCLE data set. We took logtransformation of both the transcription and drug sensitivity profile data before further analysis. Because this is also a cancerspecific data set, the genes in the CCLE data set were intersected with the 1,863 genes used in [14] and only 1,802 genes were left. The same 58 KEGG pathways used in [14] were also used in our analysis. The CCLE data set also provided the known targets of each of the 24 drugs. However, the target genes of two drugs were not annotated in any KEGG pathways and thus we were excluded them from our analysis. For the remaining 22 drugs, a pathway is defined as the target of a drug if any member of the pathway is a target of the drug. The known drugpathway associations were not used as prior knowledge but as postanalysis validations. Because the CCLE data set is relatively large, running 10,000 iterations for iFad is infeasible in practice. Therefore we ran only 2,000 iterations, with the first 1,000 iterations discarded as burnin. Correspondingly, only 2,000 permutations were performed in the iPaD analysis. However, the estimated pvalues are of limited resolutions with only 2,000 permutations. To overcome this limitation, we estimated the pvalues by approximating the null distribution of each coefficient as a mixture of a point mass at zero and a normal distribution.
Again, the analysis was repeated five times for iPaD and iFad. Moreover, we also analyzed the CCLE data set using the GSEA approach described earlier. On average, iFad identified 39.4 drugpathway pairs that had posterior probabilities no smaller than 0.9, with 4.8 of them being validated associations. Whereas for iPaD, 82 drugpathway pairs had pvalues no greater than 0.05 and 23 pairs were among the validated associations. Again, iPaD outperformed iFad in identifying the validated associations. As for GSEA results, among the top 100 associations, only 1 was validated. As a more comprehensive comparison of the results of the iFad and iPaD, we also performed similar analysis as we did on the NCI60 results. The enrichment curves for the first 100 bins for the two methods are shown in Figure 2b. The top 10 associations identified by iPaD were listed in Table 1. In the iPaD results, two EGFR inhibitors, Erlotinib and Lapatinib are associated with the ErbB signaling pathway. It is well known that EGFR is a key component of the ErbB signaling pathway and the latter plays a key role in the regulation of cell proliferation, migration, differentiation, apoptosis. Misregulations of these activities are hallmarks most cancer cells. In addition, two Raf kinase inhibitors, Sorafenib and PLX4720 are associated with the Chronic myeloid leukemia pathway. In the Chronic myeloid leukemia pathway, Raf kinase is a critical player that regulates the MAPK signaling pathway and hence controls cell proliferation. We also note that Panobinostat, a histone deacetylase (HDAC) inhibitor, was found to be associated with the regulation of actin cytoskeleton pathway. Although HDAC and cytoskeleton are seemingly unrelated, we found that a study published in 2012 [23] suggested that Panobinostat could also induce hyperacetylation of tublin, a major component of the microtubule cytoskeleton, and affects microtubule dynamics and organization. As for the iFad results, there was not much consistency between the top associations from the five repeats. The identified drugpathway pairs for each run are listed in Table S12–S16 and Table S17–S21 for iFad and iPaD, respectively. We also list the top 100 drugpathway associations identified through GSEA in Table S22.
We note that although pooling together data from different cancer types allows us to increase the sample size and better identify the pancancer common signals, it may also dilute the signals that are specific to certain cancer types. Therefore we also performed an analysis on the lung cancer cell lines of the CCLE data set. We chose lung cancer between it has the largest sample size (N=91) among all the cancer types. Similarly, iFad, iPaD and GSEA were separately used to analyze this data set. Similar to previous results, iPaD achieved the highest enrichment for validated associations, followed by iFad, and then GSEA. The enrichment curves for the three methods are shown in Figure 2c. The 15 associations with pvalues smaller than 0.01 in iPaD were listed in Table 2. Similar to previous results, EGFR inhibitors such as Lapatinib and Erlotinib continue to appear in validated association. In addition, multikinase inhibitor Sorafenib was found to be associated with the Focal adhesion pathway. Focal adhesions are subcellular structures that link cells to the extracellular matrix. Several studies have already reported the upregulation or activation of focal adhesion kinases in human lung cancer cells [24], [25]. In the focal adhesion pathway, the focal adhesion kinase is regulated by several target kinases of Sorafenib such as KDR, FLT4, PDGFRB and so on. A more comprehensive list of the identified associations on this lung cancer data set is in the supplementary materials (Table S23–S33).
The permutation test allows us to evaluate the significance of the identified drugpathway associations at the expense of substantially increased computational cost. However, we note that such cost is still much more affordable than the MCMC in iFad. For example, on the NCI60 data set, 10,000 MCMC iterations cost around 230 hours for iFad. Whereas for iPaD, even with 10,000 permutations, it took only about 30 hours. For the CCLE data set, 2,000 MCMC iterations cost about 150 hours. Yet it took only 6 hours on average to run 2,000 permutations with iPaD. All the computation times in this article were estimated on a standard laptop computer (2.4GHz dual core CPU with 4G memory running the Mac OS X 10.9 operating system). In addition, if the permutation test is still not affordable, iPaD could offer an alternative way to evaluate the relative importance of the identified associations without performing the costly permutation test. When a decreasing sequence of λ values are solved, the order in which the drugpathway association coefficients become nonzero can be used as indications of their relative importances, i.e., the coefficients that becomes nonzeros at larger λ values can be considered as more important. In the real data applications, we also plotted the enrichment curve obtained through this procedure in Figure 2 (denoted as ‘iPaD_seq’). Compared with their rankings of pvalues, the drugpathway pairs may be ranked differently using this procedure. But it is noteworthy that it still achieved better overall enrichment for validated associations than iFad on the two real data sets. Moreover, the computational cost was further greatly reduced – it took only about 15 and 30 minutes to obtain the rankings of the first 50% of the drugpathway pairs for the NCI60 data set and the CCLE data set, respectively.
We first generated a pathway activity level matrix X ^{N}^{×}^{K} with each entry simulated from a standard normal distribution independently. To better mimic real applications, we generated an indicator matrix ${L}^{(1)}\in {\{0,1\}}^{K\times {G}^{(1)}}$ by randomly choosing K pathways from the 58 KEGG pathways used in the previous section, where G^{(1)} is the number of unique genes that are involved in any of the K pathways. Given a predefined η value that controls the sparsity of B^{(2)}, we generated another indicator matrix ${L}^{(2)}\in {\{0,1\}}^{K\times {G}^{(2)}}$ by randomly choosing ηKG^{(2)} entries to be ones. After these two indicator matrices were generated, we simulated ${B}^{(1)}\in {\mathrm{\mathbb{R}}}^{N\times {G}^{(1)}}$ and ${B}^{(2)}\in {\mathrm{\mathbb{R}}}^{N\times {G}^{(2)}}$ by simulating their entries corresponding to the ones in their indicator matrices from independent standard normal distributions while keeping other entries as zeros. Finally, we simulated noise matrices ${E}^{(1)}\in {\mathrm{\mathbb{R}}}^{N\times {G}^{(1)}}$ and ${E}^{(2)}\in {\mathrm{\mathbb{R}}}^{N\times {G}^{(2)}}$ based on a predefined signaltonoise ratio (SNR). Specifically, we first obtained ${\stackrel{\sim}{Y}}^{(1)}=X{B}^{(1)}$ and ${\stackrel{\sim}{Y}}^{(2)}=X{B}^{(2)}$ and calculated the variance of each column of E^{(1)} and E^{(2)} as $\text{VAR}({E}_{:,g}^{(1)})=\text{VAR}({\stackrel{\sim}{Y}}_{:,g}^{(1)})/\text{SNR}$ and $\text{VAR}({E}_{:,g}^{(2)})=\text{VAR}({\stackrel{\sim}{Y}}_{:,g}^{(2)})/\text{SNR}$. Then each column of the two noise matrices was simulated from a normal distribution with zero mean and its calculated variance. The simulated transcription data and drug sensitivity data were obtained by ${Y}^{(1)}={\stackrel{\sim}{Y}}^{(1)}+{E}^{(1)}$ and ${Y}^{(2)}={\stackrel{\sim}{Y}}^{(2)}+{E}^{(2)}$. When a whole column of Y^{(2)} is zero (this will happen when a drug has no target pathway), we use the sample variance of all the entries in the nonzero columns as its variance and simulate its entries from a meanzero normal distribution.
After the simulation data were generated, we used both iPaD and iFad to analyze the data and measured their performances by the AUC (area under the receiver operating characteristic curve). For both methods, the genepathway relationships in L^{(1)} matrix was treated as prior knowledge whereas L^{(2)} was treated as unknown. For iPaD, we solved a decreasing sequence of λ values until 70% of the coefficients in B^{(2)} were nonzero and compared the order in which the coefficients in B^{(2)} became nonzero against their true association status in L^{(2)} to obtain the ROC (receiver operating characteristic) curve and then calculated the area under this curve (AUC). For iFad, both η_{0} and η_{1} parameters were set to zero for genepathway associations so that their association statuses were fixed during the MCMC iterations. The η_{1} parameter was set to 0 for drugpathway associations so that the known associations were never “turned off” whereas the η_{0} for drugpathway associations were set to 0.1 to allow novel associations to be inferred. The default values were used for the other parameters. The total number of MCMC iterations was set to 500 (burnin = 250). MCMC samples were recorded every tenth iteration. The AUCs for iFad were calculated by comparing the association posterior probabilities for the drugpathway pairs and their true association statuses. We only ran 500 iterations for two reasons: 1) the AUC almost plateaued at 500 iterations; 2) the computational cost is too high. Because the performance of iFad also depends on the number of MCMC iterations, we also evaluated its performance with 10 and 50 iterations. For the purpose of comparison, we also analyzed the simulated data set with GSEA.
The performances in terms of AUC and comprehensive time cost profiles of three methods are shown in Table 3 and and44 (the time cost for GSEA is not shown because its AUC is not competitive). We let the sample size N = 480, the number of drugs G^{(2)} = 22, the number of pathways K = 58, η = 0.1, and varied the SNR from 0.1 to 1. We also evaluated the effect of sample size by fixing SNR = 0.5 and varied N from 120 to 480. As expected, the AUC increases as the sample size and SNR increase. We also simulated data with unequal SNRs for transcription levels and drug sensitivities and with different η values. The AUCs of iFad and iPaD are always similar to each other regardless of the sample size and SNR when we run enough number of iterations for iFad. Both methods substantially outperformed GSEA. Although the performances of the two methods were similar, iPaD cost much less time than iFad. For example, when N = 480, it took about 5 minutes to run iPaD on the simulated data set. Whereas 500 MCMC iterations would cost more than one day for iFad. Even only 10 iterations would cost as much as half an hour for iFad.
We note that although the two methods had similar performances on the simulated data sets, iPaD surprisingly achieved higher enrichment for validated associations in the real data applications. One possible explanation is that the MCMC algorithm in iFad did not converge with the number of MCMC iterations used in our analyses. However, further increasing the number of iterations will make the computation time even more unaffordable in practice. Another explanation is that the default prior parameters of iFad might not be appropriate for the real data sets in our analyses. If this is the case, the results might indicate that iFad is not very robust with respect to the choice of its prior parameters in practice. It could also be explained by the potential difference between the simulation settings and the unknown nature of the real data sets. Finally, it might also be the case that the known drugpathway associations are not 100% complete and accurate. We admit that these associations are probably not complete, which may limit the interpretability of the results to some extent. But we think their accuracy should be relatively reliable since the drug targets are either well established in the literature or carefully curated in the specific databases by experts. Therefore we argue that our results still provide compelling evidence that iPaD performs better than iFad on the two real data sets.
Various highthroughput experiment data have become increasingly available to facilitate drug target identification, which have also posed a major challenge for effectively integrating these different types of data. We have developed a method called iPaD, which is based on penalized matrix decomposition, to jointly analyze transcription and drug sensitivity data to identify drugpathway associations. Thanks to its scalable biconvex optimization algorithm, iPaD has tremendous advantage in computational efficiency over the existing stateoftheart method called iFad, which has been demonstrated through both simulation studies and real data applications. Another notable advantage of iPaD is that only one parameter needs to be tuned and we have derived an efficient crossvalidation procedure to tune this parameter. Yet iFad has quite a few prior parameters that need to be specified by users. In terms of the accuracy of the identified associations, iFad and iPaD had similar performances on simulated data sets. In real data applications, iPaD seemed to outperform iFad, although the assessment might be complicated by the completeness and accuracy of the known associations that were used for validation. We also note for the two data sets we analyzed, the compounds in the CCLE data set are better characterized than those in the NCI60 data set in general. Therefore the known associations are more reliable for the CCLE data set and interpretation of its results is easier than that of the NCI60 data set. Therefore in the real data analysis, we focused on the interpretation of the results from the CCLE data set.
We note that statistical inference/test is notoriously difficult for problems with lassotype penalty and therefore we had to use a permutation test to evaluate the significance levels of the identified associations. Although the permutation test substantially increased the computational cost, iPaD still takes much less time than iFad in the real data applications. In addition, compared with MCMC, permutation is more amenable to parallel computing, and therefore the computation time of iPaD can be further reduced if the permutation procedure was parallelized. Moreover, as demonstrated in both simulation studies and real data applications, a much more computationally affordable and still useful way to evaluate the relative importance the identified drugpathway association is to rank them based on the order in which their coefficients become nonzeros. Indeed, this strategy cannot provide rigorous statistical inference as in iFad. But it should be kept in mind that the posterior probabilities provided by iFad need to be interpreted with caution as they can be affected by the several prespecified prior parameters, especially the prior probability of each drugpathway association coefficient being nonzero.
From a broader perspective, integration of distinct highthroughput data sets has also become a trending theme in biomedical research. For example, in the cancer genomics community, much effort has been spent on integrating highthroughput genomic data generated from various platforms (e.g. somatic mutations data, gene expression data, epigenomics data, etc). A commonly used strategy is also to treat various types of observed data as manifestations of a set of shared underlying latent factors [26], [27]. As these data sets become larger and more heterogeneous, a major task in the future is to develop useful and computationally efficient methods to perform joint analysis upon them. Compared with a previous method for drugpathway association analysis, our method has greatly increased the computational efficiency without much, if any, sacrifice in statistical power. However, as more types of highdimensional data sets become available in the future, their integrative analysis will call for more novel methods to address the forthcoming statistical and computational challenges.
This study was supported by NIH grants GM59507 and CA154295 and NSF grant DMS 1106738. We also thank Yale University High Performance Computing Center (funded by NIH RR19895) for the computation resource and data storage.
Cong Li received his bachelor’s degree in Bioinformatics from Huazhong University of Science and Technology, Wuhan, China in 2010. He also just received his Ph.D. degree in Computational Biology and Bioinformatics from Yale University, New Haven, Connecticut in 2015. His current research interests include statistical genetics and genomics, bioinformatics, pharmacogenomics, highdimensional statistics and data integration.
Can Yang received the bachelor’s and master’s degrees in Automatics from Zhejiang University, China, in 2003 and 2006, respectively. He received the Ph.D. degree in electronic and computer engineering from the Hong Kong University of Science and Technology in 2011. He worked as a postdoctoral associate (2011–2012) and an associate research scientist (2012–2014) at Yale University, New Haven, Connecticut. Now he is working as an assistant professor at department of mathematics, Hong Kong Baptist University. His research interests include statistical learning and bioinformatics.
Greg Hather holds a M.A. in Physics and a Ph.D. in Statistics from the University of California, Berkeley. Since graduating in 2008, he has worked for the healthcare and pharmaceutical industry, primarily supporting drug discovery and manufacturing. His most recent employers include Takeda Pharmaceuticals and Pfizer. His current research interests cover optimal experimental design, empirical Bayes methods, and genomic analysis.
Ray Liu is the head of Statistical Innovation and Consultation group at Takeda Pharmaceutical Company. His group is responsible for providing statistical consultation and project support to various functional areas in pharmaceutical R&D, including Discovery, CMC, Translational Research, and Outcome Research. Ray received his PhD degree from Columbia University. He is a member of ASA, PSTC Stats working group, IQ consortium, and NCBLF. His current research interests include analysis of pharmacogenomics data, semantic analysis and integrated analysis.
Hongyu Zhao is the Ira V. Hiscock Professor of Biostatistics and Professor of Statistics and Genetics, Chair of the Biostatistics Department, and the CoDirector of Graduate Studies of the InterDepartmental Program in Computational Biology and Bioinformatics at Yale University. He received his B.S. in Probability and Statistics from Peking University in 1990 and Ph.D. in Statistics from the University of California at Berkeley in 1995. His research interests are the developments and applications of statistical methods in molecular biology, genetics, drug developments, and personalized medicine. Some of his recent projects include large scale genome wide studies to identify genetic variants underlying complex diseases (Crohns disease, schizophrenia, bipolar, autism, and substance abuse), biological network modeling and analysis, disease biomarker identifications, proteomics, genome annotations, microbiome analysis, and systems biology study of herbal medicine. He has published over 370 articles on statistics, human genetics, bioinformatics, and proteomics, and edited two books on human genetics analysis and statistical genomics. Dr. Zhao is a CoEditor of Statistics in Biosciences, and serves on the editorial boards of several leading statistical and genetics journals. He was the recipient of the Mortimer Spiegelman Award for a top statistician in health statistics under the age of 40 awarded by the American Public Health Association. His research has also been recognized by the Evelyn Fix Memorial Medal and Citation by UC Berkeley, a Basil O’Connor Starter Scholar Award by the March of Dimes Foundation, election to the fellowship of the American Association for the Advancement of Science, the American Statistical Association, and the Institute of Mathematical Statistics.
Cong Li, Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA.
Can Yang, Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA. Department of Mathematics, Hong Kong Baptist University, Hong Kong, China.
Greg Hather, Takeda Pharmaceuticals International Co., Cambridge, MA, USA.
Ray Liu, Takeda Pharmaceuticals International Co., Cambridge, MA, USA.
Hongyu Zhao, Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA. Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA.
PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. 