To assess the quality of the EM training procedure, we compared the convergence of EM using the actual patient data relative to a null dataset in which tuples of gene expression and copy number (E,C) were permuted across the genes and patients. As expected, PARADIGM converged much more quickly on the true dataset relative to the null. As an example, we plotted the IPAs for the gene AKT1 as a function of the EM iteration (). One can see that the activities quickly converge in the first couple of iterations. EM quickly converged to an activated level when trained with the actual patient data, whereas it converged to an unchanged activity when given random data. The convergence suggests that the pathway structures and inference are able to successfully identify patterns of activity in the integrated patient data.
Fig. 4. Learning parameters for AKT1. IPAs are shown at each iteration of the EM algorithm until convergence. Dots show IPAs from permuted samples and circles show IPAs from real samples. The red line denotes the mean IPA in real samples and the green line denotes (more ...)
We next ran PARADIGM on both breast cancer and GBM cohorts. We developed a statistical simulation procedure to determine which IPAs are significantly different than what would be expected from a negative distribution. We constructed the negative distribution by permuting across all of the patients and across the genes in the pathway. Empirically, we found that permuting only among genes in the pathway was necessary to help correct for the fact that each gene has a different topological context determined by the network. In the breast cancer dataset, 56 172 IPAs (7% of the total) were found to be significantly higher or lower than the matched negative controls. On average, NCI pathways had 497 significant entities per patient and 103 out of 127 pathways had at least one entity altered in 20% or more of the patients. In the GBM dataset, 141 682 IPAs (9% of the total) were found to be significantly higher or lower than the matched negative controls. On average, NCI pathways had 616 significant entities per patient and 110 out of 127 pathways had at least one entity altered in 20% or more of the patients.
As another control, we asked whether the integrated activities could be obtained from arbitrary genes connected in the same way as the genes in the NCI pathways. To do this, we estimated the false discovery rate and compared it with SPIA (Tarca et al.
). Because many genetic networks have been found to be implicated in cancer, we chose to use simulated ‘decoy’ pathways as a set of negative controls. For each NCI pathway, we constructed a decoy pathway by connecting random genes in the genome together using the same network structure as the NCI pathway. We then ran PARADIGM and SPIA to derive IPAs for both the NCI and decoy pathways. For PARADIGM, we ranked each pathway by the number of IPAs found to be significant across the patients after normalizing by the pathway size. For SPIA, pathways were ranked according to their computed impact factor.
We found that PARADIGM excludes more decoy pathways from the top-most activated pathways compared with SPIA (). For example, in breast cancer, PARADIGM ranks 1 decoy in the top 10, 2 in the top 30 and 4 in the top 50. In comparison, SPIA ranks 3 decoys in the top 10, 12 in the top 30 and 22 in the top 50. The overall distribution of ranks for NCI IPAs are higher in PARADIGM than in SPIA, observed by plotting the cumulative distribution of the ranks (P<0.009, Kolmogorov-Smirnov test).
Fig. 5. Distinguishing decoy from real pathways with PARADIGM and SPIA. Decoy pathways were created by assigning a new gene name to each gene in a pathway. PARADIGM and SPIA were then used to compute the perturbation of every pathway. Each line shows the receiver-operator (more ...)
We sorted the NCI pathways according to their average number of significant IPAs per entity detected by our permutation analysis and tabulated the top 15 in breast cancer () and GBM (). Several pathways among the top 15 have been previously implicated in their respective cancers. In breast cancer, both SPIA and PARADIGM were able to detect the estrogen- and ErbB2-related pathways. In a recent major meta-analysis study, authors from Wirapati et al.
) found that estrogen receptor and ErbB2 status were two of only three key prognostic signatures in breast cancer. PARADIGM was also able to identify an AKT1-related PI3K signaling pathway as the top-most pathway with significant IPAs in several samples (). The anti-apoptotic AKT1 serine–threonine kinase is known to be involved in breast cancer and interacts with the ERBB2 pathway (Ju et al.
). In GBM, both FOXM1 and HIF-1-alpha transcription factor networks have been studied extensively and shown to be overexpressed in high-grade glioblastomas versus lower-grade gliomas (Liu et al.
; Semenza, 2000
Top PARADIGM pathways in breast cancer
Fig. 6. Patient sample IPAs compared with ‘within’ permutations for Class I PI3K signaling events mediated by Akt in breast cancer. Biological entities were sorted by mean IPA in the patient samples (red) and compared with the mean IPA for the (more ...)
To visualize the results of PARADIGM inference, we developed a ‘CircleMap’ visualization to display multiple datasets centered around each gene in a pathway (). In this display, each gene is associated with all of its data across the cohort by plotting concentric rings around the gene, where each ring corresponds to a single type of measurement or computational inference. Each tick in the ring corresponds to a single patient sample while the color corresponds to activated (red), deactivated (blue) or unchanged (white) levels of activity. We plotted CircleMaps for a subset of the ErbB2 pathway and included ER status, IPAs, expression and copy number data from the breast cancer cohort.
Fig. 7. CircleMap display of the ErbB2 pathway. For each node, ER status, IPAs, expression data and copy-number data are displayed as concentric circles, from innermost to outermost, respectively. The apoptosis node and the ErbB2/ErbB3/neuregulin 2 complex node (more ...)
Gene expression data have been used successfully to define molecular subtypes for various cancers. Cancer subtypes have been found that correlate with different clinical outcomes such as drug sensitivity and overall survival. We asked whether we could identify informative subtypes for GBM using PARADIGM IPAs rather than the raw expression data. The advantage of using IPAs is that they provide a summarization of copy number, expression and known interactions among the genes and may therefore provide more robust signatures for elucidating meaningful patient subgroups. We first determined all IPAs that were at least moderately recurrently activated across the GBM samples and found that 1755 entities had IPAs of 0.25 in at least 75 of the 229 samples. We collected all of the IPAs for these entities in an activity matrix. The samples and entities were then clustered using hierarchical clustering with uncentered Pearson correlation and centroid linkage (). Visual inspection revealed four obvious subtypes based on the IPAs with the fourth subtype clearly distinct from the first three.
Clustering of IPAs for TCGA GBM. Each column corresponds to a single sample, and each row to a biomolecular entity. Color bars beneath the hierarchical clustering tree denote clusters used for .
The fourth cluster exhibits clear downregulation of HIF-1-alpha transcription factor network as well as overexpression of the E2F transcription factor network. HIF-1-alpha is a master transcription factor involved in regulation of the response to hypoxic conditions. In contrast, two of the first three clusters have elevated EGFR signatures and an inactive MAP kinase cascade involving the GATA interleukin transcriptional cascade. Interestingly, mutations and amplifications in EGFR have been associated with high grade gliomas as well as glioblastomas (Kuan et al.
). Amplifications and certain mutations can create a constitutively active EGFR either through self stimulation of the dimer or through ligand-independent activation. The constitutive activation of EGFR may promote oncogenesis and progression of solid tumors. Gefitinib, a molecule known to target EGFR, is currently being investigated for its efficacy in other EGFR-driven cancers. Thus, qualitatively, the clusters appeared to be honing in on biologically meaningful themes that can stratify patients.
To quantify these observations, we asked whether the different GBM subtypes identified by PARADIGM coincided with different survival profiles. We calculated Kaplan–Meier curves for each of the four clusters by plotting the proportion of patients surviving versus the number of months after initial diagnosis. We plotted Kaplan–Meier survival curves for each of the four clusters to see if any cluster associated with a distinct IPA signature was predictive of survival outcome (). The fourth cluster is significantly different from the other clusters (P < 2.11 × 10−5; Cox proportional hazards test). Half of the patients in the first three clusters survive past 18 months; the survival is significantly increased for cluster 4 patients where half survive past 30 months. In addition, over the range of 20–40 months, patients in cluster 4 are twice as likely to survive as patients in the other clusters.
Kaplan-Meier survival plots for the clusters from .
The survival analysis revealed that the patients in cluster 4 have a significantly better survival profile. Cluster 4 was found to have an upregulation of E2F, which acts with the retinoblastoma tumor suppressor. Upregulation of E2F is therefore consistent with an active suppression of cell cycle progression in the tumor samples from the patients in cluster 4. In addition, cluster 4 was associated with an inactivity of the HIF-1-alpha transcription factor. The inactivity in the fourth cluster may be a marker that the tumors are more oxygenated, suggesting that they may be smaller or newer tumors. Thus, PARADIGM IPAs provide a meaningful set of profiles for delineating subtypes with markedly different survival outcomes.
For comparison, we also attempted to cluster the patients using only expression data or CNA data to derive patient subtypes. No obvious groups were found from clustering using either of these data sources, consistent with the findings in the original TCGA analysis of this dataset (TCGA, 2008
), (Supplementary Fig. 1
). This suggests that the interactions among genes and resulting combinatorial outputs of individual gene expression may provide a better predictor of such a complex phenotype as patient outcome.