|Home | About | Journals | Submit | Contact Us | Français|
Recent studies have emphasized the importance of pathway-specific interpretations for understanding the functional relevance of gene alterations in human cancers. Although signaling activities are often conceptualized as linear events, in reality they reflect the activity of complex functional networks assembled from modules that each respond to input signals. To acquire a deeper understand of this network structure, we developed an approach to deconstruct pathways into modules represented by gene expression signatures. Our studies confirm that they represent units of underlying biological activity linked to known biochemical pathway structures. Importantly, we show that these signaling modules provide tools to dissect the complexity of oncogenic states that define disease outcomes as well as response to pathway-specific therapeutics. We propose that this model of pathway structure constitutes a framework to study the processes by which information propogates through cellular networks, and to elucidate the relationships of fundamental modules to cellular and clinical phenotypes.
The phenotypic heterogeneity of human cancers presents major challenges to advancing our understanding of disease mechanisms as well as to developing effective strategies for therapeutic design. This heterogeneity is also reflected at a molecular level in the variations in activity of cell signaling pathways that control cell growth and determine cell fate, processes critical for driving the cancer phenotype. Recent studies describing in-depth analyses of gene mutations in a number of human cancers have emphasized the importance of placing such data in pathway-specific contexts (Ding et al., 2008; Jones et al., 2008; Network, 2008; Parsons et al., 2008; Wood et al., 2007). Certain biological processes do represent relatively simple series of biochemical events linked in an orderly fashion, such as the known biochemical pathways associated with energy metabolism. However, the extension of this notion of a linear pathway is not useful nor appropriate as a description of the events associated with complex cellular responses to environmental inputs such as growth stimulation. Rather, the signaling events represent activities in complex networks of multiple signaling modules that each respond to given inputs (Segal et al., 2004). A module is the unit of signaling activity; one example is PI3K phosphorylating Akt to activate its kinase activity, another is cyclin D/Cdk4 phosphorylating Rb to eliminate its negative control of E2F. These modules are defined by the biochemical events that they mediate. They are assembled into pathways by virtue of the nature of the signaling processes, but this is fluid, variable and context-dependent. For instance, PI3K can be activated by Ras, but PI3K can also be activated by a variety of other signaling events, so the PI3K module is part of the Ras pathway in one setting but part of another pathway in a different setting. Ultimately, the complex assemblage of these signaling modules constitute the signaling network that is activated in response to a particular input under a defined set of conditions.
The Ras signaling network, frequently altered in human cancers, exemplifies modular structure. Ras controls numerous processes related to cell proliferation and fate through interactions with secondary effectors (Shaw and Cantley, 2006). Mutations in Ras can alter its ability to interact with specific effectors, decoupling the downstream activities into discrete modules that contribute complementary activities critical to the initiation and maintenance of tumors (Lim and Counter, 2005; White et al., 1995). Of nearly a dozen effectors identified, the Raf kinase, RalGEF, and phosphoinositide-3-kinase (PI3K) modules are studied most thoroughly (Mitin et al., 2005). Since particular modules are connected to specific characteristics of the tumor phenotype, having an unbiased catalog of the modules that comprise pathways, as well as the means to measure them, will prove valuable in efforts to pinpoint the precise modules that drive a tumor phenotype.
It is thus critical to develop methods to assay the activity of individual signaling modules as the basic units of signaling activity. While measures of protein phosphorylation could be an approach, this is limited by the availability of reagents to carry out the assay (usually antibodies), the sensitivity of the measurements, and the capacity to do this on a scale sufficient to eventually reconstruct the signaling network. Gene expression data represents one form of data that is an accessible, useful source for these measurements. Ultimately, cell signaling events lead to changes in gene expression and thus, regardless of whether or not the module directly involves transcriptional activity, the eventual result of the signaling process will be a change in gene expression. Further, whole genome measures of gene expression from DNA microarray analysis provide the complexity of data that can discern the subtle distinctions in signaling events.
Genome-scale expression data has proven ability to characterize the complex biological diversity in tumors or cells lines (Bild et al., 2006b; Segal et al., 2004). Multiple studies have shown that the activity of a pathway, such as amplification of MYC or mutation in RAF, leads to distinctive patterns in the expression of genes--the expression signatures of the pathways (Adler et al., 2006; Solit et al., 2006). Even pathways that operate primarily through post-translational mechanisms such as phosphorylation cascades leave recognizable gene expression signatures (Bild et al., 2006a; Huang et al., 2003; Sweet-Cordero et al., 2005). For these pathways, the genes in the signatures reflect the downstream transcriptional consequences of protein-level regulation; while those genes may not coincide with the ones in the primary cascades, they nevertheless provide measures of upstream pathway activity. This suggests that the complexity of pathway machinery is reflected in the complexity of the expression data; we then need analysis methods to deconvolute this complexity and identify contributions of fundamental pathway modules.
To address this central question of deciphering pathway complexity, we have developed an approach to deconstruct pathways into underlying modules based on structure observed in gene expression profiles (Bild et al., 2006a; Lamb et al., 2006). Our approach builds on statistical factor analysis methods (Brunet et al., 2004; Carvalho et al., 2008; Lucas et al., 2006; Seo et al., 2007). By centering the analysis on the genes in a pathway, this analysis produces a set of pathway-related signatures that we hypothesize represent the activities of the modules of the pathway. To exemplify and test the approach, we deconstruct the Ras signaling and E2F transcriptional regulatory pathways to reveal a series of module signatures that can predict drug sensitivity and dissect clinical outcomes in practically meaningful ways. This generates a deeper understanding of the complexity of pathway function by elucidating the modules reflected in natural variability of genomic expression structure. The analysis also leads to opportunity for therapeutic advances through the identification and characterization of clinically relevant pathway modules that may now be more specifically targeted with drugs.
To identify gene expression signatures that represent the activity of pathway modules, we first define an initial set of genes on which to focus the pathway analysis. Since Ras function is mediated through protein interactions, we define the Ras pathway to be the proteins that bind to Ras either directly or with one degree of separation in a protein-protein interaction network (Supplemental Table S1) (Rual et al., 2005). Then, we apply a strategy based on statistical factor analysis using the Bayesian Factor and Regression Modeling (BFRM) tools (Carvalho et al., 2008; Lucas et al., 2006; Wang et al., 2007). Statistical factor modeling deconvolutes a gene expression data set into a series of underlying signatures with a model of the form X = AΛ+Ψ where X is an n × m matrix of the gene expression data (n and m are the number of genes and samples in the dataset, respectively), A is a sparsely defined n × k matrix indicating the genes in the signatures (k is the number of signatures) and defining weights between gene-signature pairs, is a k × m matrix of the scores of the signatures across the data set, and Ψ reflects measurement error and residual biological noise in the data. The number of signatures k is estimated statistically. The analysis can thus identify underlying components of variation in expression that relate to multiple, intersecting sets of genes, whose signatures reflect subtle, modular aspects of expression variation related to the network under study.
In the context of pathway analysis, BFRM is applied to an initial, selected set of genes identified as core for the pathway. To aid in exploring the structure of an incompletely defined pathway, analysis then iterates through a two-step cycle in which the factor model decomposition is supplemented with an evolutionary search to identify additional genes that, in terms of expression variation, relate to the estimated factors; these gene are candidate contributers to pathway structure (Carvalho et al., 2008; Wang et al., 2007). Hence the statistical analysis allows for iterative expansion of the initial set of genes to enrich the core pathway gene list; this provides a step towards improved pathway understanding by now also reflecting contributions to the complex patterns expression variation across heterogeneous cancer data.
The analysis results in a collection of estimated statistical factors that define signatures –each signature is a set of genes with estimated weights (regression coefficients from the factor analysis). Any further expression sample, whether from tumors or cell lines, can then be scored for the level of activity of a signature by taking the weighted average of the expression levels of its genes. Cells or tumors with high scores for a given signature share similar activation levels, and those with low scores share the opposite levels; the magnitude of the scores differentiate levels of biological activity linked to the pathway module represented by the signature. For example, high scores coincide with high levels of Ras pathway activity measured on each of a number of factor scores related to multiple modules of the Ras pathway.
Here we use the NCI-60 cancer cell line data set as the source of expression data, since this diverse collection of cancer cell lines exhibits widely varying activity in the Ras pathway (Ross et al., 2000).
Using the strategy outlined in Figure 1, we generate a collection of 20 signatures derived from the Ras core pathway (Figure 2A; Supplemental Table S2). For comparison, we also show the Ras pathway activity predicted for each cell line from the expression signature defined in (Bild et al., 2006a). We reason that if these signatures represent units of Ras-related gene expression, a subset of these signatures should correlate with activities of known effectors. We explore this with two distinct indicators of pathway activation measured on the NCI-60 cell line panel: the presence of mutations in the Ras pathway genes, and measures of the sensitivity of the cells to drugs that target specific Ras pathway modules (Supplemental Table S3). We reason that the activation of a particular module of the Ras pathway will create sensitivity to a drug that targets activities within that module.
As shown in Figure 2B, the scores for Ras signature 14 show a significant association with mutations in Raf, providing evidence that this signature quantifies activity in the Raf module of the Ras pathway. Furthermore, this signature is also strongly related to sensitivity to a drug that inhibits ERK; this distinguishes this signature as being related to signaling down the Raf-MEK-ERK module. A similar analysis finds that Ras signature 11 denotes activity in PI3K-Akt signaling. Statistical significance is assessed (see Methods) with p-values less than 0.0001 for both Akt (signature 11) and Raf (signature 14) 0.000006, as well as for identifying two Ras pathway signatures by chance in an analysis with 20 signatures. These results show that the approach can identify precise signaling activities through specific downstream pathway modules.
To validate the capacity of the Ras pathway module signatures to predict pathway activity, we have taken advantage of the identification of Ras mutants that selectively activate the downstream effectors Raf, Ral, and PI3K (Lim and Counter, 2005). We generated RNA from cells expressing the mutant proteins and evaluated the gene expression data with each of the 20 previously derived Ras module signatures (GEO Accession GSE14934). As shown in Figure 3A, Ras signature 14 which was linked to the Raf effector arm based on the analysis in the NCI-60 dataset also identified the cells expressing the Ras mutant activating Raf signaling and distinguished these from the cells in which the other two Ras pathway effectors were activated. Conversely, Ras signature 11 which was previously linked to the PI3K arm also identified the cells expressing the Ras activating PI3K effector pathway and distinguished these from the other mutant cells. These findings provide a strong, independent validation of the capacity of these module signatures to accurately identify cells expressing the relevant Ras effector pathway.
Although the analysis derived from the NCI-60 dataset does not provide the opportunity to link a signature with the Ral effector arm given the lack of relevant drug sensitivity or mutation data, we did identify a Ras module signature that shows a link with this effector. Signature 17 is correlated with Ral and distinguishes these cells from those in which the other two Ras effector pathways have been activated (Figure 3B).
To further verify that these signatures recognize activation of Ras pathway modules, we predict that they will be able to distinguish cells in which the relevant module is activated. To assess this, we evaluated gene expression data sampled from prostates of transgenic mice expressing activated Akt (GSE1413). As shown in Figure 3C, the Akt signature (signature 11) accurately discriminates the Akt+ samples from the controls. In contrast, the Raf signature (signature 14) does not discriminate these two samples. Conversely, using expression data from a breast cancer cell line expressing Raf or its downstream effector MEK (GSE3542), the Raf signature (signature 14) discriminates against the controls whereas the Akt sub-signature does not (Figure 3D). Finally, in a more heterogeneous data set of 90 melanomas that is sequenced for Raf mutations (GSE4845), the score of the Raf, but not the Akt, signature is linked to Raf mutations (Figure 3E) (Hoek et al., 2006).
Further analysis demonstrates that the Ras signatures can also be derived from an independent Ras pathway based on a Ras overexpression microarray experiment rather than from genes with known protein interactions (Supplemental Table S1; Supplemental Figure S1). Furthermore, we have verified through simulations that the signatures cannot be derived from randomly selected initial gene sets (data not shown). Hence, while the approach is specific for the pathway being analyzed, it is not sensitive to a specific definition of the pathway genes used to initialise the analysis.
Taken together, these data provide strong evidence that pathway signatures can be identified using this approach, that they are specific to the pathway module being measured, and that they are robust in their capacity to predict the activation of the pathway related to that signaling module.
The Rb/E2F network provides a second context and several examples of the utility of the approach. Rb regulates the activity of the family of E2F transcription factors that in turn control expression of genes critical for the G1/S and G2/M transitions (Hernando et al., 2004; Ishida et al., 2001; Muller et al., 2001; Zhu et al., 2004) (Figure 4A). This dichotomy of E2F function provides an opportunity to explore the extent to which signature analysis can reveal pathway module signatures linked to these distinct roles of E2F proteins.
Using the same strategy as in the Ras investigation, we deconstruct the E2F pathway with BFRM analysis applied to the NCI-60 data; this identifies eight signatures (Supplemental Tables S4–S5). Of these, one is significantly associated with S phase in the cell cycle and two with mitosis based on their association to drugs that affect either S phase or mitotic events (Figure 4B). Cell lines with high scores on E2F signature 3 are correlated with sensitivity to three drugs that target S phase with distinct mechanisms of action (Koster et al., 2007; Weinstein et al., 1992). High scores for this signature are also correlated with mutations in p16 (CDKN2A), a component of the G1/S checkpoint.
Next, we find that cell lines with high scores on E2F signatures 4 and 6 are sensitive to drugs that target the mitotic spindle (Weinstein et al., 1992). As expected, the converse is not true; the S phase signature 3 is not associated with sensitivity to mitotic drugs, and the mitotic signatures 4 and 6 are not associated with sensitivity to S phase drugs. Using a similar approach as above, the p-value for the S phase signature signature 3 is 0.00003, and those for the mitosis signatures (signatures 4 and 6) are 0.000009 and 0.0009, respectively; and the p-value for the entire analysis is p<0.00001. Thus, modular deconstruction of the E2F pathway identifies specific processes known to be related to the pathway. This further supports the value of the strategy as exhibited in the Ras analysis, and the view that the decomposition approach is general and can be extended to pathways with divergent mechanisms of action.
Ultimately, the utility of the pathway module signatures lies in the capacity to better understand and dissect the complexities of signaling events underlying clinically-relevant phenotypes. To explore and exemplify this, we have analyzed pathway expression modules in relation to the response of colon cancer patients to the EGFR-specific therapeutic cetuximab (Khambata-Ford et al., 2007; GSE5851) (Figure 5A). The activation status of EGFR, including the use of an EGFR pathway signature, is simply incapable of discriminating responses to cetuximab. It is thus of interest to ask whether discrimination can be obtained from a refined understanding of the EGFR network in terms of pathway module signatures. Following the same approach used for the generation of the Ras and E2F signaling modules, we created a set of 20 signatures derived from an initial EGFR signature (Supplemental Tables S6–S7). We then evaluated the collection of pathway module signatures, derived from EGFR, Ras, and E2F, for their capacity to distinguish response to cetuximab. Evaluation of the EGFR modules revealed one (signature 18) that did significantly distinguish cetuximab responders from non-responders (Figure 5B). In contrast, neither the Ras nor the E2F pathway module signatures could effectively distinguish response to cetuximab (Figure 5C).
As a second example of the use of pathway module signatures in a clinical context, we focus on past work that has shown a capacity of pathway signatures to dissect the complex heterogeneity across tumor types (Bild et al., 2006a). An example arises from the analysis of 74 lung tumors comprised of an approximately equal number of squamous cell carcinomas and adenocarcinomas (GSE3141) (Figure 6A). This analysis demonstrates the power of pathway signatures to identify subgroups of tumors, including two large subgroups that are characterized by Ras pathway activation and E2F3 pathway activation. We used the sets of Ras and E2F pathway module signatures to determine if these broad groups can be further dissected into clinically meaningful subtypes based on the activity of module signatures. As shown in Figure 6B, this analysis reveals that Ras pathway signature 9 significantly distinguishes low and high-risk survival groups. This has been further extended to a second lung tumor dataset involving 89 adenocarcinoma samples (GSE3593-ACOSOG) (Potti et al., 2006) (Supplemental Figure S2A). The E2F signaling modules, and in particular E2F signature 6 which linked to the mitotic component of the E2F pathway, identifies a cohort of tumors with poor survival and is again reproducible in an independent data set (GSE5828) (Supplemental Figure S2B).
Taken together, these analyses demonstrate the capacity of pathway module signatures to identify properties of tumors that relate to significant clinical phenotypes, providing the ability to improve tumor classification to reveal more precise prognosis, or to predict response to pathway-specific drugs, driven by models that represent the complexity of the underlying biological activities. This suggests that a rational strategy to target therapeutics may be improved by using an approach that takes into account variations in the ways signals are propogated through pathways.
An overarching goal of systems biology is the ability to understand the functioning of cellular signaling pathways, not as isolated units or linear sets of events, but as networks of interconnected events. The importance of developing an understanding at this level is emphasized by the recent studies of human cancers detailing the complex array of mutations that arise and the importance of placing this data in a pathway-specific context (Ding et al., 2008; Jones et al., 2008; Network, 2008; Parsons et al., 2008; Wood et al., 2007). A key challenge in addressing this goal is the availability of tools that can measure the variation in activity and output of the pathways in response to diverse inputs and cellular contexts. Multiple studies have shown the capacity of gene expression data to describe such subtle characteristics of biology not achievable through other means of analysis hence our interest in taking such studies further to realise some of the potential to address the complexity of cell signaling events (di Bernardo et al., 2005; Ergun et al., 2007). The real challenge is to dissect the complexity of the gene expression information such that the resulting signatures reveal the discrete modules of the cell signaling pathways. By so doing, these signatures become tools that can provide a measure of the individual activities that foster pathway complexity.
Our initial investigations deconstruct the Ras, E2F, and EGFR pathways into collections of module signatures that describe refined, discrete or modular aspects of pathway function. This complements the classical view of pathways as wiring diagrams by providing structure in the form of modules that are measured by discrete gene expression signatures. As clearly demonstrated here, by relating various signatures to either drug sensitivity or presence of mutations it has been possible to link several of these to characterized modules of Ras and E2F pathway activity. A key strength of this approach is that it can uncover, via an unbiased and automated statistical analysis, signatures of pathway-related activites driven by unknown molecular mechanisms. As growing numbers of molecular activities are characterized with expression signatures, the ability to link these pathway module signatures with underlying mechanism will increase rapidly. Nevertheless, the ability to anchor the analysis on a pathway, combined with a rigorous methodology for exploring the surrounding functional landscape, sets each signature in an initial coarse pathway context. This proof-of-concept provided by our several oncogenic network examples suggests that the study of the additional pathways to provide further measures of modular activity will help to achieve the goal of deciphering cellular signaling on a genomic scale.
It is important to recognize that expression signatures also represent practical tools that can be of value in present day clinical practice, independently of their potential to contribute to improved understanding of pathway structure from a systems viewpoint. In particular, the ability to add value in a prognostic setting, as illustrated in the dissection of the squamous and adenocarcinoma samples, provides a very clear potential use of this information. Similarly, the capacity of module signatures to refine the prediction of therapeutic outcomes, such as in the example of drugs that target the EGFR signaling pathway, is clearly also relevant. We have previously demonstrated the potential of pathway-specific signatures to guide the use of various targeted therapies in a general sense. The work here now takes this an important step further by making use of signatures that quantify activity of more specific pathway modules for which drugs have been developed to then facilitate the development of strategies that can accurately identify those patients most likely to benefit from a given drug. The increased biological resolution and specificity of module signatures becomes critical when the complexity of signaling pathway alterations, resulting from the complex array of mutations and genome alterations in human tumors, otherwise obscures the ability of simple pathway analysis to accurately predict response.
The pathway module strategy offers an ability to unravel complex pathway structures and identify functional modules whose activities may be connected to molecular processes that mediate sensitivity to targeted therapeutics. Within this framework it is possible to explore the relationship between pathway function and subtle perturbations in the complex array of inputs, to investigate how this is influenced by the action of other signaling pathways and networks, and ultimately to provide more precise connections between molecular activities and their manifestations in disease.
The Ras and E2F pathway analyses used the gene expression data on the 59 NCI-60 cancer cell lines available on the NCI Developmental Therapeutics Program web site (Scherf et al., 2000). These cell lines represented nine tissues and included data on their sensitivities to almost 45,000 in terms of GI50 numbers (Shoemaker, 2006) as well as mutational status on key cancer genes (Ikediobi et al., 2006). The gene expression data was generated by Novartis on Affymetrix U95A microarrays in triplicate and averaged. To select the most important genes, we discarded probe sets that exhibited very low levels and variance of expression (Supplemental Table S8). We predicted the Ras and E2F pathway activity on the NCI-60 and lung cancer data sets using procedures described (Bild et al., 2006a).
To deconstruct the Ras, E2F, and EGFR pathways into modules, we applied the evolutionary statistical factor analysis to the NCI-60 gene expression data set. We centered the analyses on sets of genes known to represent aspects of the core pathways of interest. Reasoning that Ras signaled through phosphorylation events that depended on physical protein interactions, we included in the Ras pathway the Ki-Ras, Ha-Ras, and N-Ras isoforms, and all proteins that physically interacted with these Ras proteins directly or indirectly through an intermediate protein. To do so, we used a protein network constructed by combining interactions from the BIND, BioGRID, DIP, HPRD, IntAct, MINT, MIPS databases and the Vidal genome-wide yeast two-hybrid screen (Rual et al., 2005). This resulted in 589 genes that corresponded to 498 Affymetrix probe sets in the filtered NCI-60 data set. For the E2F pathway, since E2F regulated the transcription of genes directly, we created a gene set by combining multiple gene expression profiles of E2F function described in (Chang and Nevins, 2006), resulting in 224 genes matched to 216 probe sets. For the EGFR pathway, we generated a signature by comparing the expression profiles of EGF-treated Human Bronchial Epithelial Cells infected with adenoviruses expressing EGFR against those expressing a control, as described previously (Bild et al 2006a). The sparse statistical factor analysis utilised the BFRM software that implements models and methods previously described (Wang et al., 2007); the software and examples are freely available at http://icbp.genome.duke.edu/bfrm.html. Analyses used the default parameters in all cases. We initialized models with one latent factor and iteratively included factors and genes until the analysis terminated after increasing the gene set to 1000 genes. This analysis identified 20 module signatures in the Ras pathway, 8 in E2F, and 20 in EGFR.
We obtained HEK-HT cells expressing mutant forms of Ras that signaled constitutively down the Raf (RasG12V,T35S), RalGEF (RasG12V,E37G), and PI3K (RasG12V,Y40C) branches of the pathway, as well as one with only wild-type Ras protein (Lim and Counter, 2005). We cultured the cells in DMEM (Dulbecco’s modified Eagle medium) supplemented with 10% FBS (fetal bovine serum). Then, from each cell type growing asynchronously, we collected total RNA in five independent replicates using RNeasy Mini kits from Qiagen. The Duke Microarray Facilities processed the RNA samples and hybridized them to Affymetrix U133A 2.0 microarrays. We normalized the raw CEL files using the MAS5 implementation in the Bioconductor toolkit. We made the data available in GEO (GSE14934). As above, we discarded the genes with the lowest levels and variance of expression (Supplemental Table S8). Then, we merged the Ras mutant expression data set with the NCI-60 data set by matching probe sets based on Entrez Gene IDs. For genes that matched multiple probe sets, we resolved the ambiguity by choosing the one with the most similar correlation structure as described in (Shankavaram et al., 2007). To project a Ras mutant onto the NCI-60 data set, we quantile normalized the merged data set, selected the 200 genes most correlated with Ras mutant of interest, and used an SVM with a 1st degree polynomial kernel to predict its activity on the NCI-60 cell lines (Chang and Lin, 2001). We calculated the association between the Ras sub-signature profiles and Ras mutant profiles using a non-parametric Kendall correlation and rejected all associations with p-values > 0.05.
To project the Ras, E2F, or EGFR pathway module signatures onto data sets of the mouse prostate (GSE1413), Raf/MEK breast cancer cell lines (GSE3542), Raf mutations in melanoma (GSE4840, GSE4841, GSE4843), lung cancer (GSE3141), lung adenocarcinoma validation (GSE3593-ACOSOG), cetuximab response (GSE5851), and lung squamous cell carcinoma validation (GSE5828), we processed the data as described above for the NCI-60 data set. Specific characteristics of the data sets and parameters used are reported in Supplemental Table S8. When the CEL files were not available, we used the signal data provided in GEO. The melanoma data set was provided on three platforms that we processed separately, and then combined the results. For the mouse data set, we converted the target probes from mouse to human using the HomoloGene database.
Drug sensitivities (quantified as the concentration resulting in 50% growth inhibition) were available for each cell line from the National Cancer Institute Developmental Therapeutics Program web site (http://dtp.nci.nih.gov/index.html). We calculated the association between the sub-signature profiles and drug sensitivities using a non-parametric Kendall correlation and rejected all associations with p-values > 0.05. Gene mutation data for each cell line was available from the COSMIC database (Bamford et al., 2004). For each gene, we split the cell lines into two groups based on the presence of a mutation and compared, between the groups, the scores for each signature using a Wilcoxon rank sum test, rejecting all comparisons with p-values > 0.05.
To calculate the p-value of each signature, we used a sampling strategy, generating using one million randomly generated signatures, calculating the association with drug sensitivies and gene mutations as above, and then computing the bootstrap significance levels for each original signature. To deal with the multiple modules in an analysis (e.g. the two Ras modules in the Ras pathway analysis), we randomly simulated the same number of signatures (20 for Ras and 8 for E2F) and evaluated whether this set of random signatures related to the same pathway modules. We repeated this 100,000 times, and the p-value was the portion of sampled analyses that could also identify the pathway modules.
We assessed the association between pathway module signatures and survival time as follows. For each signature, we split samples into two equally sized groups based on signature scores and generated Kaplan-Meier curves using the Prism software.
In the colorectal tumor data set, we measured the association between the profiles of the Ras, E2F, and EGFR factors and the response to cetuximab. In each data set, we first discarded the samples in which the response was not able to be determined. We split the remaining samples into two groups, where the non-responders constituted the patients with progressive disease, and the responders were patients with stable or regressive disease. To calculate the significance of the difference of the factor scores between the two sets of patients, we used a non-parametric Wilcoxon rank sum test.
Supplemental Figure S1: Alternative Ras Pathway. We deconstruct the Ras pathway using an alternative definition of the pathway based on genes whose expression changes in response to Ras activation, rather than proteins that interact with Ras. These plots, analogous to those in Figure 2B, show that deconstructing the Ras pathway from an alternate starting point can still recover signatures that relate to known biochemical modules of the Ras pathway.
Supplemental Figure S2: A. We find that Ras pathway signature 9 can also distinguish survival in a validation data set of 89 lung cancer adenocarcinomas (GSE3593-ACOSOG), supporting the results seen in Figure 6. B. E2F signature 6 can also distinguish survival in a validation data set of 59 squamous cell carcinomas (GSE5828).
Ras Pathways. We define the Ras pathway in two ways. The first, shown on the left, is based on proteins shown to interact with Ras in protein interaction networks. The second, shown on the right, consists of the genes whose transcriptional expression changes in response to overexpression of activated Ras.
Ras Pathway Module Signatures. We decompose the Ras pathway and identify 20 signatures. For each signature, the genes are listed under the “Gene Symbol” columm, and the Affymetrix probe set is shown under “Probe Set ID.”
Targeted Compounds. This table shows the compounds used in this study. The “NSC” column lists the identifier for the compounds assigned by the FDA Nomenclature Standards Committee. “Compound” gives the name for the compound. “Mechanism/Target” contains information about the gene targeted or the mechanism of action. “PMID” lists the PubMed ID of a reference describing the compound, and “Reference” provides the author and title.
E2F pathway. The genes that comprise the E2F pathway are collected from a series of experiments showing transcriptional changes in response to E2F activation (Chang and Nevins, 2006).
E2F Pathway Module Signatures. We decompose the Rb/E2F pathway and identify 8 signatures. For each signature, the genes are listed under the “Gene Symbol” column, and the Affymetrix probe set is shown under “Probe Set ID.”
EGFR pathway. The genes that comprise the EGFR pathway are those that are up-regulated in response to treatment with EGF.
EGFR Pathway Module Signatures. We decompose the EGFR pathway and identify 20 signatures. For each signature, the genes are listed under the “Gene Symbol” column, and the Affymetrix probe set is shown under “Probe Set ID.”
Statistics on the processing of the data sets.
We thank Bernard Mathey-Prevot, Ashley Chi, Wencheng Zhu, and Steve Angus for helpful discussions; Chris Counter and Kevin O’Hayer for the Hek-HT cells; Marc Vidal and Wei Chen for the protein-protein interaction networks; and Shirin Khambata-Ford for the raw gene expression data in the cetuximab study. All aspects of the research were supported under the NCI Integrative Cancer Biology program via grant NIH 5-U54-CA112952-05 and by 5-RO1-CA106520-05 to JRN. Additional aspects related to statistical factor models were partially supported under NSF grants DMS-0102227 and 0342172. JTC is supported by postdoctoral fellowship #PF-05-047-01-GMC from the American Cancer Society and NIH K99-LM009837-01. We are grateful to Kaye Culler for her assistance in the preparation of the manuscript.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.