|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: VNP GB JM DW LHA MRC. Performed the experiments: VNP GB DW. Analyzed the data: VNP GB LHA. Contributed reagents/materials/analysis tools: GB JM DW LHA MRC. Wrote the paper: VNP GB MRC.
Colorectal cancer progresses through an accumulation of somatic mutations, some of which reside in so-called “driver” genes that provide a growth advantage to the tumor. To identify points of intersection between driver gene pathways, we implemented a network analysis framework using protein interactions to predict likely connections – both precedented and novel – between key driver genes in cancer. We applied the framework to find significant connections between two genes, Apc and Cdkn1a (p21), known to be synergistic in tumorigenesis in mouse models. We then assessed the functional coherence of the resulting Apc-Cdkn1a network by engineering in vivo single node perturbations of the network: mouse models mutated individually at Apc (Apc1638N+/−) or Cdkn1a (Cdkn1a−/−), followed by measurements of protein and gene expression changes in intestinal epithelial tissue. We hypothesized that if the predicted network is biologically coherent (functional), then the predicted nodes should associate more specifically with dysregulated genes and proteins than stochastically selected genes and proteins. The predicted Apc-Cdkn1a network was significantly perturbed at the mRNA-level by both single gene knockouts, and the predictions were also strongly supported based on physical proximity and mRNA coexpression of proteomic targets. These results support the functional coherence of the proposed Apc-Cdkn1a network and also demonstrate how network-based predictions can be statistically tested using high-throughput biological data.
The majority of nonhereditary colorectal tumors arise via the sequential accumulation of mutations in key driver genes, where a mutation in a tumor suppressor (e.g. Apc) or oncogene (e.g. Kras) initiates the process, and a cascade of somatic mutations ensues . Although these mutations were classically thought to be comprised of a few genes (e.g. Apc, Kras, Trp53), recent large-scale sequencing efforts revealed that any given tumor includes (on average) 80 mutations, with as many as 15 lying in frequently mutated “driver” genes . In support of the hypothesis that these key genes function cooperatively in driving tumorigenesis, mouse models mutated at two driver genes simultaneously have shown a synergistic increase in tumor burden, including: Pten-Apc , Kras-Tgfb , and Apc-Trp53 . The evidence of synergistic, i.e. non-additive, increases in tumor burden suggest that the signaling pathways of two mutated genes may intersect downstream, and, thus, predicting and interrogating these points of intersection – as a biological network – is of significant interest. To trace the connections between genes, a variety of high-throughput datasets – e.g. protein-protein interactions (PPIs), gene coexpression, and transcription factor relationships – have been employed to infer functional associations that lend themselves to analysis as networks, in which each gene or protein is represented as a node and an interaction as an edge. Furthermore, network-based analyses can be used to identify biomarkers , to predict tumor progression , or to reveal the molecular alterations underlying disease .
However, our current knowledge of biological networks is far from complete. The coverage of current interactome databases is estimated to be less than 10% of the total number of interactions . Thus, when interpolating the connections between driver genes, network-based analyses that rely solely upon confirmed interactions may lack essential connections. As one goal of our research is to predict and analyze the functional paths between driver genes, a critical step was to develop a predictive framework to infer and evaluate novel connections between genes. The framework proposed here (modeled on Pathfinder ) infers missing edges using predictions from protein family relationships and filters these paths based on known association rules. On the other hand, since a cancer gene participates in multiple signaling pathways, there may be dozens – if not, hundreds – of paths by which two proteins functionally interact. Thus, a computational approach is required to limit the network space to the specific biological context of interest. To extract functionally relevant subnetworks, the framework detects highly probable signaling pathways based on gene-gene mRNA coexpression and Gene Ontology  association rules mined from published pathways.
We used the computational method to elucidate the connections between a well-known driver gene of intestinal cancer, Apc (adenomatous polyposis coli), to another gene also involved in cancer, Cdkn1a (previously known as p21). Though Cdkn1a was not found to be mutated in populations of human colorectal cancers studied to date , its expression level correlates with neoplastic progression and has a prognostic value greater than that of Trp53 . Further supporting its importance in neoplasia, the double mutant mouse, Apc1638N+/− Cdkn1a−/−, exhibits a synergistic increase in its tumor burden . After predicting the network linking Apc and Cdkn1a, we evaluated the relevance of these predictions by manipulating the underlying system: generating in vivo network perturbations in two mouse models, followed by systems-level ‘omic measurements from the small intestinal epithelium. The ‘omic measurements – both proteomic and genomic – of the perturbed system were used for the statistical testing of the predicted network, thus introducing the concept of evaluating in silico predictions against context-specific biological data.
The network analysis framework (illustrated in Figure 1, and explained in the Methods S1) employs the PathFinder architecture outlined previously . The raw network of publicly available physical interactions is first pruned of false positives using a logistic regression model that incorporates (i) the number of times a PPI is observed, (ii) the Pearson correlation of expression measurements for the corresponding genes, (iii) the proteins' small world clustering coefficient, and (iv) the protein subcellular localization data of interacting partners. Positive (1000 PPIs from the MIPS database of interactions) and negative training data sets (1000 randomly selected PPIs that are not in MIPS) are used in 1000 cross-validation trials to acquire the parameters that maximize the likelihood of a true interaction.
False negative interactions are inferred using sequence homology relationships. It was observed that proteins with similar sequences share similar interaction partners in the same organism , and, thus, proteins from the same family are also likely to have similar interaction patterns. The Pfam database, utilizing multiple sequence alignments and hidden Markov models (HMMs), uses sequence similarity to formulate protein family classifications  and serves as a useful tool for exploiting these relationships. Hence, we inferred an interaction edge if (i) two proteins do not interact with each other in the PPI network, and (ii) there exists at least one interaction between the families of these two proteins.
To identify those paths relevant to our model system of interest, coexpression data based on microarray experiments from the ApcMin/+ mouse small intestinal epithelium were obtained from the Gene Expression Omnibus (series GSE422 ); this study used laser-capture microdissection to sample the crypts of adenomas, carcinomas, and normal epithelium. In our implementation, we used Pfam release 23.0  and the Gene Ontology release in August 2008 . The search algorithm was extended to find pathways up to 6 nodes in length, and the threshold for the average coexpression of pathways was .
All animals were handled in strict accordance with good animal practice as defined by the relevant national and/or local animal welfare bodies, and all animal work was approved by the Institutional Animal Care and Use Committee (IACUC) of Albert Einstein College of Medicine (permit number 20070805). Apc1638N+/− and Cdkn1a−/− C57BL6/J mice were generated as described previously  and tissue samples were harvested using the method outlined by Weiser et al., resulting in crypt and villus populations of cells from the small intestine of Apc1638N+/−, Cdkn1a−/−, and wild-type mice .
2D Differential In Gel Electrophoresis (2D-DIGE) was performed as previously described . Differentially expressed proteins from crypt and villus fractions were identified in the mutant mice (Apc1638N+/− and Cdkn1a−/−) relative to the respective fractions from wild-type mice (4 replicates each). Univariate t-tests (unequal variances and equal sample sizes) and multivariate linear regression (coded in the R package LIMMA ) were performed. Gel spots were selected for LC-MS/MS identification based on these two t-statistics at the 0.05 level of significance.
Gel spots were excised, trypsin digested, and the peptides were subsequently analyzed by tandem LC-MS/MS on a LC Packings/Dionex Ultimate 3000 HPLC-Orbitrap XL (Finnigan, San Jose, CA) system . For interpretation of the MS/MS spectra, the MASCOT software package was used to search the SwissProt database; a null database of reversed peptide sequences was searched simultaneously to account for false positives. Identified proteins are listed in Table S1. Mascot DAT files have been made publicly available through the Proteomics Identifications Database , accession number 10638.
Microarray studies for crypt and villus populations from Apc1638N+/−, Cdkn1a−/−, and wild-type mice (4 replicates each) were conducted on Affymetrix Mouse Genome 2.0 chips according to published procedures . All data is MIAME compliant and the raw data have been made publicly available through the MIAME compliant database, the Gene Expression Omnibus , accession number GSE19338.
Raw .CEL files were processed in MATLAB using the Robust Multiarray Averaging procedure . To deal with multiple probes capturing different aspects of a gene product's behavior, we used all probes to represent a gene. Thus, in the following analysis, each Apc-Cdkn1a network node, i, was represented by ki probes on the array, resulting in a matrix of size q×n, where and . To determine whether the Apc-Cdkn1a network nodes were collectively differentially expressed in a tissue compartment (crypts or villi), we extended Hotelling's T2 statistic – a classical approach useful for testing gene groups  – to incorporate multiple experiments, as follows:
Where is the vector of mean mRNA intensity for all the q probes for a genetic background, G, where (Apc indicating Apc1638N+/−; Cdkn1a indicating Cdkn1a−/−; and WT indicating wild-type C57BL6/J). S is the absolute value of the unbiased pooled sample covariance matrix for each mutant:
Where Mutant can refer to either Apc1638N+/− or Cdkn1a−/−, and the absolute value in S is used to avoid imaginary components when taking the inverse root of S in . It should be noted that probes corresponding to Apc and Cdkn1a themselves were excluded, as these are expected to have extremely low intensity values (in the respective mutants) that would skew the perceived aggregate network effect. In , the difference of means, , for each mutant may be positive or negative for a probe i, so, unlike T2, V2 can be either positive or negative.
Given that , sample covariance estimates are not positive definite, and hence, the inverse is singular. To circumvent this issue, we set all covariances to zero for initial calculation of V2 and then calculate the significance of V2 using a permutation test (i.e. stochastically generating new “mutant” and “wild-type” phenotype labels), thus preserving the underlying covariance structure in the null distribution. Setting the off-diagonal elements of S to zero simplifies V2 to:
Thus, V2 is simply the sum of the product of scaled t-statistics calculated for each probe, in each of the two experimental perturbations. As the number of samples was small ( for mutant and wild-type, each), random noise was added to each permutated matrix to obtain an interpolated and smoothed empirical null distribution; the standard deviation, , of the noise for each probe, q, in the genetic background, G, was estimated by the sample standard deviation of each probe. 10000 such permutations were calculated to obtain the null distributions, which –as expected – resemble F-distributions (see Figure S1). Since Apc and Cdkn1a are both tumor suppressors and hypothesized to affect our network of interest in a similar fashion, we expect the t-statistics to vary in the same direction if the null hypothesis (of no joint effect) is to be rejected. Hence, we compute the p-value of V2 as the number of null observations greater than our observed value of V2. Calculating the p-value for the negative tail of the distribution would be useful if the perturbations were expected to have opposite molecular effects (e.g. Apc+/− paired with a Stat3+/− hypomorph).
While we present an analysis for a 2-node perturbation of a network, this analysis is extensible to k experimental perturbations by computing pairwise V2 statistics, resulting in a matrix:
Where represents the statistic between perturbations j and k; as shown, the diagonal reduces to a scaled version of Hotelling's T2 statistic for each experiment. As the statistics are each of a different scale, they cannot be compared directly, and, therefore, the significance of each matrix element should be calculated (as above) via a permutation test. Then, for the matrix of p-values, the diagonal elements provide information about the significance of individual experiments, while the off-diagonal values provide information about pairwise experimental significance. The total experimental support for network perturbations can then be calculated by aggregating off-diagonal p-values, e.g. by Fisher's method . We recommend this approach for dealing with perturbations; for perturbations, as in our case, the p-values can be interpreted directly.
To assess the importance of physical proximity, the topological distance between Apc-Cdkn1a network nodes and the respective proteomic targets was calculated. Physical PPI networks were assembled from BioGRID , the Human Protein Reference Database (HPRD) , and IntAct . Each network node was tested independently for the number of 2-hop paths connecting it to a set of n experimentally measured proteins, expressed as follows:
Where is the entry at row i and column j in the adjacency matrix, A, of the PPI network; i is a protein in the Apc-Cdkn1a network; j is an intermediate protein; and k is an experimentally measured protein. In this case, the experimental proteins were the proteomic targets from either Apc1638N+/− or Cdkn1a−/− mice. If there is at least one intermediate protein, j, for which a two-hop path exists between nodes i and k, then the 2-hop distance, , is 1; the total connectivity, , of protein i to the set of 2D-DIGE targets is simply the sum of the . Significance was calculated against an empirical null formulated from 10000 randomly generated sets of proteins also of size n.
To assess patterns of coregulation, mRNA coexpression values (Spearman's correlation coefficient) were calculated from the corresponding set of normalized microarray experiments, spanning wild-type, Apc1638N+/−, and Cdkn1a−/− crypts and villi; the probe with maximum intensity was used as the representative for a gene. To test the significance of mRNA-level correlations, a modified Kuiper's test statistic, K, was calculated between the group correlations (i.e. all probes on the array) and sample correlations (i.e. set of 2D-DIGE targets) for each node in the network independently; it is calculated as the sum of the maximal and minimal deviations of the sample, , and control (i.e. entire array), F, cumulative distribution functions :
As per the suggestions of Subramanian et al. , the Kuiper's statistic, K, was modified to improve its ability to detect bimodal shifts in location of the sample distribution (as one would expect coexpressed groups of proteins to show both positive and negative correlations):
Where S is the set of proteins being tested (either the Apc1638N+/− or Cdkn1a−/− 2D-DIGE targets); r is the ordered vector of correlation coefficients between the respective 2D-DIGE targets and a single network node; and normalizes to have sum 1. Significance testing was performed using a normal approximation of the empirical null: the empirical null was assembled from the modified K calculated for 500 randomly selected protein sets, each of size , and maximum likelihood estimation was used to fit a normal distribution. For exploring and illustrating the connections of significant (α=0.05) network nodes, we examine the subset of correlations, ry, where such that and ; and the subset of correlations, rp, where such that and (analogous to the “leading edge” subset of GSEA ). To identify differentially expressed nodes, we chose those nodes where the t-statistic (unequal variance) of the maximum intensity probe was such that in either the crypt or the villus compartment, where is the normal inverse cumulative distribution function.
Testing each node in the Apc-Cdkn1a network independently resulted in a p-value for each of the null hypotheses, where , and each hypothesis, , assumes that there is no relationship (physically-based or coexpression-based) between the Apc-Cdkn1a network node, i, and the 2D-DIGE targets. To test the group null hypothesis that all are simultaneously true, p-values were aggregated into a statistic, τ, suggested by Fisher; significance was assessed against a distribution with 2n degrees of freedom  (see also Methods S1). The mutated node (Apc in Apc1638N+/− or Cdkn1a in Cdkn1a−/−) was excluded from the respective analyses, as their extreme expression patterns skew the group-wise results.
The double mutant Apc1638N+/− Cdkn1a−/− mouse was previously shown to exhibit a synergistic increase in its tumor burden when compared with the single mutants . To identify the potential connections between Apc and Cdkn1a, we constructed a predictive framework that, first, learns the annotation patterns characteristic of known signaling pathways (e.g. those found in KEGG  and others) and, then, couples these patterns with tissue specific coexpression data to extract the most likely chains of interacting proteins involved in Apc-Cdkn1a signaling (illustrated in Figure 1). To identify only high-confidence pathways, a two-phase filtering process was first applied to the global PPI network. In the first phase, edges – compiled from mammalian interactions in BioGRID  and HPRD  – were pruned from the network if they did not resemble likely interactions (as defined by a logistic regression model), with the goal of reducing false positives among the reported interactions. To account for false negatives (Phase 2), interactions were added to the network by inferring relationships that are precedented in model organisms based on protein family relationships. After applying these measures to generate a synthetic network, we searched for likely connections between Apc and Cdkn1a using both gene coexpression data and Gene Ontology association rules.
To emphasize nodes and edges relevant to our biological system, we introduced a tissue-specific bias in our search for Apc-Cdkn1a connections by using gene expression data from the intestinal epithelium of ApcMin/+ mice. From these data, we calculated the mRNA-level coexpression value for individual edges via the gene-gene Pearson correlation coefficient. Next, all paths in the synthetic network linking the gene products of Apc and Cdkn1a were queried, and the predicted paths were filtered based on (i) the support of association rules for GO annotations and (ii) the average coexpression along a path; the result (at a significance level of α=0.01) is shown in Figure 2. The Apc-Cdkn1a network includes a number of previously known interactions (solid lines), as well as predicted interactions (dashed lines) based on: (i) protein family relationships, (ii) strength of GO association rules, and (iii) microarray coexpression along the specific path connecting Apc to Cdkn1a. As genetic interactions were included in the original interaction databases, the predicted network includes both physical and functional relationships.
At a systems-level, the proposed Apc-Cdkn1a network bears the statistically unlikely property of being saturated with oncogenes: 8 of the 20 proteins are annotated as oncogenes in OMIM (p-value<5×10−10 by Fisher's exact test, see Methods S1), and many of the remaining genes have been experimentally shown to act as oncogenes (e.g. Erbb3 , , Shc1 , Map2k1 ). Although the Apc-Cdkn1a network contains many well-studied proteins, the node degree (i.e. number of interactions) within the subnetwork does not strictly correlate with the node degree in the unfiltered interaction database (Pearson's correlation=0.51). For instance, while AKT1 has many known interactions, its commonly studied biological partners – namely, GSK3B and PTEN (both of which are associated with Apc  and Cdkn1a  signaling) – do not appear in the network. Other known interactions, such as that between SHC1 and SRC , are also absent from the network. Since our algorithm predicts connections biased by the biology of the system under study (through the use of gene expression data from ApcMin/+ mouse intestinal tissue), a particular protein or edge may not appear in the network if the pathway (i.e. chain of proteins) on which it resides does not meet the gene coexpression and/or GO association rule thresholds.
Conversely, the Apc-Cdkn1a network includes novel associations: those not contained within the source databases (dashed edges in Figure 2). Several of these interactions have recently been validated in focused studies (see Table 1), providing confidence that the framework is useful. In addition, the Apc-Cdkn1a network also suggests that certain interactions previously associated with other cancer models – such as the SRC-CCND1 functional association found in prostate cancer , or the phosphorylation of CDK4 by SRC in a cell line  – are relevant in this model of colon cancer.
As the Apc-Cdkn1a network represents the intersection of signaling pathways emanating from Apc and from Cdkn1a, we expect to observe functional changes in network-associated proteins in response to perturbations at either Apc or Cdkn1a. Single-node perturbations were developed in mouse models with mutations in either Apc (namely, Apc1638N+/−) or Cdkn1a (Cdkn1a−/−). While the Apc-Cdkn1a network was generated using tumor-specific ApcMin/+ data – a model harboring a number of background genetic lesions  – the intestinal tissue obtained from the Apc1638N+/− and Cdkn1a−/− mice at 3 months of age is relatively polyp free, thus allowing us to gauge the effect of a single genetic perturbation on the pre-neoplastic epithelium. Although this removes potential bias that is introduced by subsequent mutations of neoplastic tissue, this approach may also attenuate the flow of information between the two genes.
Since we are using the two perturbations to determine how well the Apc-Cdkn1a network can capture biological phenomena, we introduced a multivariate statistic, V2 to test if differences in mean mRNA abundance exist jointly between the Apc1638N+/− and Cdkn1a−/− models. By using V2, as illustrated in Figure 3, genes with mild differential expression in the two individual mutants can contribute to the overall support of the network, as V2 rewards those genes where each of the two independent t-statistics are both greater than 1. Statistical significance of V2 was tested against a permutation null, and, as our perturbations involved two tumor suppressors expected to have molecular effects in the same direction, we used the positive tail of the distribution. Knowing that many molecules “switch” expression (i.e. high to low, or vice versa) in the transition from crypts to villi , the microarray datasets for these two biological compartments were tested separately. We found that the Apc-Cdkn1a network was strongly supported (p-value=0.002) by the joint mRNA differential expression in the two mutants' crypt compartment. Network coherence was weaker (p-value=0.060) in the villus compartment, and the network as a whole was not differentially expressed in the villi of either mutant, noted in the two V2 matrices' p-values:
Where, as mentioned, the diagonal elements indicate the significance of differential expression within a mutant (as per Hotelling's T2), and the off-diagonal elements indicate significance of joint differential expression across mutants (as per V2). In the crypts, the network was differentially expressed in Cdkn1a−/− (p-value=0.009), but not in Apc1638N+/− (p-value=0.871), and, yet, was jointly supported by differential expression across both mouse models (p-value=0.002). This illustrates that small mRNA-level changes that are shared between multiple perturbations – on a gene-by-gene basis – provide joint support for the network hypothesis, while any individual perturbation may fail to demonstrate the claim.
To illustrate how the joint consideration of gene-wise behavior operates, each network node has color-coded bubbles for the t-statistics of both Apc1638N+/− and Cdkn1a−/− in Figure 3; the sum is shown at the intersection of each gene's bubbles. Though V2 employs products of t-statistics, the sum is better suited for visually demonstrating the principle that small mRNA effects can have a significant impact when considered together. We observe that several nodes that are differentially expressed in the crypts – ERBB3, JAK2, MAPK8, et al. – are no longer differentially expressed in the villus. In addition, some genes – e.g. CCNE1, CAV2, FGFR1, EGFR – switch their direction of expression between the crypts and villi.
The 2D-DIGE analysis reported 12 proteins differentially expressed for the Cdkn1a−/− intestinal epithelium (crypts and villi combined) versus wild type, and 31 proteins differentially expressed in the epithelium of the Apc1638N+/− mice versus wild type (Table S1). To test our network-based hypothesis, we first assumed that the set of regulatory molecules in the Apc-Cdkn1a network are independent. Then, the one and two-hop physical interactions were assessed for each network node (see Figure 4). While directly interacting neighbors (one hop) are typically useful in mapping signaling pathways, they did not associate much of the proteomic data with the network. Also, the few direct connections were not statistically significant; EGFR, for example, tends to have many interactions, and, thus, EGFR's direct connections to the 2D-DIGE targets were not more likely than expected by chance. However, analysis of indirect interactions (two hops) from network nodes captured the relationship to the majority of the 2D-DIGE targets (individual node's p-value <0.005), as illustrated in Figure 4B. Considering the network as a whole and aggregating the p-values (aggregate statistic, τ), the network was significantly (p-value of τ<1×10−15) physically associated with either the Apc1638N+/− or Cdkn1a−/− 2D-DIGE targets, suggesting that the proteome-level effects are at most 4-hops away from the causative mutations.
Based on the physical proximity of the 2D-DIGE targets to the Apc-Cdkn1a network, we hypothesized that these network proteins might be controlling the expression of the 2D-DIGE targets. To examine this relationship further, we studied the pattern of mRNA-level coexpression between network nodes and the 2D-DIGE targets. As before, the network nodes were assumed to be independent, and the pattern of coexpression was assessed for each node individually using a modified Kuiper's test statistic, K; nodes identified as (i) being differentially expressed () in either crypts or villi and (ii) having significant (α=0.05) coexpression with the 2D-DIGE targets are highlighted in Figure 5. Fifteen nodes in the Apc-Cdkn1a network had significant mRNA-level correlations to the Apc1638N+/− 2D-DIGE targets, and four of these were also differentially expressed. On the other hand, eight nodes had significant correlations to the Cdkn1a−/− 2D-DIGE targets, and four of these were individually differentially expressed. Considering coexpression relationships from theApc-Cdkn1a network as a whole, the p-value of τ for coexpression between Apc-Cdkn1a network nodes and Apc1638N+/− 2D-DIGE targets was strongly significant (all nodes excluding Apc and Cdkn1a, p-value<1×10−20; differentially expressed nodes, p-value=1.4×10−5). Given the magnitude of these group-wise statistics, however, the evidence for Apc-Cdkn1a network coexpression with the Cdkn1a−/− 2D-DIGE targets was not as well-supported (all nodes, p-value of τ=3.1×10−8; differentially expressed nodes, p-value of τ=1.6×10−3). Given that τ can be influenced by a few small p-values, we also calculated the probability of observing k p-values less than α=0.05, which, as a binomial distribution, is more sensitive to larger p-values. This also indicated that the Cdkn1a−/− 2D-DIGE targets were least supported by coexpression with the Apc-Cdkn1a network, with the p-values separated by two orders of magnitude again (p-value for coexpression of Cdkn1a−/− targets with differentially expressed network nodes was 3.3×10−5; p-value for coexpression of Apc1638N+/− targets was 3.1×10−7).
Finally, knowing that the use of a single probe per gene is often misleading, the above calculations for proteomic coexpression were also performed using all microarray probes for the network nodes and the proteomic targets. By this approach, the network as a whole was strongly coexpressed in either mutant (for differentially expressed probes, excluding those belonging to Apc or Cdkn1a, p-value of τ<1×10−20). However, for ease of interpretability and visualization, we discuss the results of the analysis using only the maximum intensity probe per gene.
As underscored by Wood et al. , colorectal cancer is the product of mutations in multiple genes operating simultaneously. Though tumors differ at the genetic level, their phenotypes intersect at histopathologic levels and, in our view, at the molecular level, as well, implying that the connections between unique sets of mutations may often merge at downstream signalling hubs. Thus, the reduction of genetic heterogeneity into clinically meaningful biological networks could have an impact in the context of personalized medicine. As a step towards this goal, we developed a computational framework capable of predicting functional connections between genes mutated in cancer, and we applied our methodology to define a network between Apc and Cdkn1a.
Networks, as abstractions of underlying molecular phenomena, offer the hope of distilling system-level structure from biological complexity. Given the many degrees of freedom in the underlying datasets (PPIs, microarrays, et al.), however, numerous network structures are possible for a particular biological context, and candidate networks are often evaluated solely based on topological significance (e.g. the G-score of MetaCore ). Yet, the value of a network may not be reflected topologically; for example, a highly-connected hub may not be highly active in a particular model system. If a network model truly reflects the underlying biology, then – from an engineering standpoint – perturbations of the underlying system should be manifest in and around the predicted network. Specifically, since a network connecting two cancer genes represents cross-connections between signaling pathways, one would expect that perturbations along a pathway would result in altered regulation of network-associated nodes.
A systems-based approach required to biologically evaluate network coherence, however, is not immediately amenable to the tools of classical molecular biology, which are designed to target single molecules or, at most, a few at a time. As an alternative, we outline the use of in vivo single node perturbations – by way of mouse models with targeted inactivation of specific loci – followed by gene and protein expression analysis to gauge systems-level effects. For further investigation, numerous mouse models for cancer biology are already available, and these resources can be productively mined to expand our understanding of cancer gene networks. Though we have demonstrated the value of biologically testing network predictions using 2D-DIGE and microarray data, many other types of screening tools could also be employed to test the functional coherence of predicted networks.
Due to the differences in coverage of proteomic and gene expression data, different approaches were required to probe the potential functional coherence of the network: V2 – a multivariate statistic – was used to gauge the effect of single node perturbations on mRNA-levels of the Apc-Cdkn1a network, while, due to reduced coverage, relational maps – physical and coexpression – were required to assess the effect of driver gene mutations on the proteome. We found that the Apc-Cdkn1a network was supported by the joint differential expression of mRNA in two different network perturbations, with stronger differential expression being observed in the crypts (Figure 3). The V2 statistic is presented in a framework that is extendable to multiple network perturbations – a feature that proves necessary in evaluating the biological coherence of networks, as small mRNA-level effects of an individual perturbation may fail to lend adequate support for a predicted network; coupling multiple perturbations together via the V2 matrix allows the integrity of the network to be assessed via a biologically multidimensional approach. It should be noted that, in testing the mRNA-level support for an individual network, a “self-contained” hypothesis is necessary, embodied by null distributions – such as the permutation null used here – modeling the population from which the samples (mice) were drawn; gene randomization methods, on the other hand, compare network expression patterns to stochastically chosen gene groups, which are bound to have a different and/or reduced covariance structure (especially without incorporating network structure to generate the null gene sets), leading to overinflated significance values .
After applying V2 to the mRNA data, we found that the network was better able to capture joint differential expression in the crypts than in the villi, suggesting that oncogenic transformations are initiated in the crypts by the network genes and then transduced to downstream targets in the villi. This is reflected in Figure 3, where the t-statistic of individual nodes is colored for both mouse models, and more nodes are seen to be brightly colored (i.e. highly differentially expressed) in the crypts. Interestingly, a large contingent of oncogenes – ERBB3, JAK2, MAP2K1, MAPK8 – are clearly downregulated in the crypts, while their expression levels diminish considerably in the villi, indicating that these genes turn “off” during the crypt-to-villus transition – a well-known feature of this biological compartment . Though the downregulation of oncogenes may appear counterintuitive, it is to be expected in these particular mouse models, where the tumorigenic phenotype is mild and the tissue has been harvested in the pre-neoplastic regime. Before the onset of tumors, the downregulation of oncogenes represents a homeostatic reflex of the tissue to the genetic perturbations, i.e. protective downregulation of oncogenes to compensate for the loss of Cdkn1a or Apc. In addition to genes turning “off” in the crypt-to-villus transition, several genes appear to switch their pattern of expression entirely. In particular, CDK3, CAV2, EGFR, and FGFR1 exhibit this behavior, which suggests that they play two different roles in the two compartments.
For visualization, we show only the probe with maximum intensity across all samples for each gene. Given the extent of alternative splicing and array manufacturing variation, however, a single probe can be misleading. Hence, for calculation of statistical significance, we use all probes to model a single gene – a more robust approach that is amenable to matrix-based calculations (such as V2). We concede that, since each gene is represented by a different number of probes, genes with many probes contribute proportionally more weight to the final V2 statistic. This is a useful feature, however, as we have more confidence about the true behavior of these well-probed genes – many of which are well-studied and important in cancer, such as Egfr (7 probes) and Mapk8 (5 probes) – and, thus, they deserve greater weight than the highly variable, single-probe genes.
Proteomic data, however, requires different analytical considerations, as the protein levels of the network nodes may not be directly measured in a given proteomic experiment. To make inferences about our network-based hypothesis, we used two different mappings: one based on physical interactions, and another based on mRNA correlations. From Figure 4, it is clear that the proteins measured in the 2D-DIGE experiment are not merely a random sampling from the proteome. Rather, they are physically close to (i) the hypothesized network as a whole (p-value of τ<1×10−15) and (ii) several individual signaling molecules (individual p-values<0.005). Specifically, the 2D-DIGE targets from the Apc1638N+/− and Cdkn1a−/− experiments have significant physical proximity to CTNNB1, FGFR1, ERBB3, CAV2, and CDKN1A itself. The tight physical proximity of the predicted network nodes to experimentally measured targets suggested that the signaling molecules more proximal to the mutations may regulate the proteomic targets, and we used mRNA-level coexpression to examine this relationship further. While coexpression relationships abound between the Apc1638N+/− 2D-DIGE targets and the Apc-Cdkn1a network nodes, this is less significant for the Cdkn1a−/− targets (p-value of τ=3.3×10−3). Taken together with the results of the mRNA analysis, this suggests that the hypothesized network nodes more effectively capture Cdkn1a−/− signaling at the mRNA level rather than at a proteomic level, whereas the opposite is true of Apc signaling.
While the network as a whole showed differences in the level of proteomic coexpression, two differentially expressed nodes – MAP2K1 and AKT1 – were significantly coexpressed with the measured proteome in both network perturbations (Figure 5). Interestingly, AKT1 was also found to be closely physically associated with the proteomic targets in Cdkn1a−/−, while MAP2K1 was physically associated with Apc1638N+/− 2D-DIGE targets (Figure 4). As we know that mRNA coexpression can provide evidence regarding the regulatory role of proteins , the mutual discovery of MAP2K1 and AKT1 in the two network perturbations – via both coexpression and physical connectivity to the perturbed proteome – suggests that these two proteins may serve as intersection points of Apc and Cdkn1a signaling. Also of interest is the observation that coexpression connections and physical connections tend to associate different subsets of the proteomic targets, as the more physically distant proteomic targets (e.g. SULT1B1, OTC, KRT19) are also the ones that tend to be coexpressed with multiple network nodes. Not only does this illustrate that physical and coexpression maps capture different dimensions of biological function, but it also illustrates the necessity of using both maps to provide complementary information in evaluating the molecular context of network hypotheses.
While the 2D-DIGE studies revealed many differentially expressed proteins, annexin A2 (ANXA2) was among the most highly ranked in its physical proximity to the hypothesized network. At the protein level, ANXA2 was upregulated in both mouse models of colon cancer (see Table S1). From studies of prostate cancer , ANXA2 upregulation is expected since in vitro experiments indicate that overexpression promotes a more invasive, proliferative cell phenotype. Though ANXA2 had high mRNA expression in one population of human colorectal tumors , it is downregulated in some populations of human prostate tumors  and colorectal cancer cell lines . Since activation of either Apc or Cdkn1a signaling leads to upregulation of ANXA2 in our studies, activation of alternative or repressive pathways may lead to such unexpected downregulation of ANXA2 in some tumors.
In addition to ANXA2, the mutant mice exhibited other protein-level alterations that can potentially contribute to tumorigenesis. Elongation factor E2 (EEF2), for example, was found to be upregulated in both mouse models, and its tumorigenic potential in gastrointestinal  and breast  cancers is well known. Though drugs inhibiting the EEF2 pathway (via its kinase) exist , and molecular chemotherapy targeting ANXA2 can also be envisioned, our network-based hypothesis suggests these changes may be controlled by specific upstream signaling molecules that integrate the information from mutated genes. Thus, in patients where levels of ANXA2 or EEF2 are elevated, molecular therapy targeting the proposed upstream network targets – such as MAP2K1 or AKT1 – may be more effective.
In conclusion, we outline a novel method for identifying networks that connect signaling pathways associated with cancer driver genes. The first step towards statistically analyzing these novel subnetworks was pursued using single node perturbations of the system in vivo, followed by network interrogation via high-throughput -omics experiments. Together, the various lines of evidence – mRNA differential expression, 2D-DIGE-target physical proximity, and 2D-DIGE-target coexpression – strengthen the hypothesis that the Apc-Cdkn1a network helps to mediate both Apc and Cdkn1a signaling. Thus, we show that using ‘omic data to test a network-based hypothesis not only allows one to assess the biological validity of in silico predictions, it also allows one to prune the hypothesis to identify molecular targets (e.g. SRC and EGFR) that are likely to integrate the various signaling pathways perturbed in cancer.
Null distributions of V2 in the crypts and villi.
(0.11 MB DOC)
Methods covering the statistical analysis of 2D-DIGE targets; the analysis of OMIM; and the construction of the filtered protein-protein interaction network.
(0.08 MB DOC)
Lists of 2D-DIGE targets identified in the intestinal crypts and villi of the Apc1638N+/− and Cdkn1a−/− mice.
(0.12 MB DOC)
Competing Interests: The authors have declared that no competing interests exist.
Funding: This work has been supported by the National Institutes of Health Grants UL1-RR024989 from the National Center for Research Resources (Clinical and Translational Science Awards) and P30-CA043703 from the Case Western Reserve University Comprehensive Cancer Center. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.