Colorectal cancer (CRC) is the second leading cause of cancer death in adult Americans 
. Interest in this complex disease is represented by a very mature body of research, much of it at the genomic level. Yet the identification and verification of proteins that have a functional role in the patho-physiology of CRC remains an important goal as proteins directly mediate the functions dysregulated in the disease. Modern, high-throughput proteomic methods provide one way of profiling the significant changes in protein expression of tumor samples with respect to control, using tissue biopsies obtained from patients diagnosed with this disease 
Proteomic screening techniques are particularly useful for furthering the understanding of the mechanisms that underlie complex phenotypes like CRC, in that they provide information at the post-translational level. However, due to various biological and experimental constraints (e.g.
, ascertainment bias and physical properties of proteins), proteomic methods may screen only a limited fraction of proteins and protein isoforms present in cells and tissues. We propose that this limitation may be mitigated through the integration of proteomic data with genome scale data sources, such as measurements of gene expression. In addition, protein-protein interaction (PPI) databases, which are rapidly growing in terms of both the quality and quantity of their annotations, provide another source of genome scale data integration 
. Such integrative approaches can potentially lead to functional inference at the systems level, through identification of pathways and molecular sub-networks that are implicated in CRC.
In support of this approach, a recent review by Ideker and Sharan 
summarizes studies that indicate that genes with a role in cancer tend to cluster together on well-connected sub-networks of protein-protein interactions. This suggests a hypothesis that the synergistic expression of multiple cancer-related genes at the level of mRNA can co-regulate the expression of proteins in their immediate “network neighborhood”. These differentially expressed proteins may be captured by expression proteomics experiments, thus their network neighborhood should provide an ideal starting place to search for sub-networks with a possible role in the disease.
The effectiveness of network-based approaches to the identification of multiple disease markers has been demonstrated in the context of various diseases, including Huntington's disease 
, the inflammatory response 
, and human breast cancer 
. Furthermore, it was recently shown that “differentially expressed sub-network markers” were more accurate predictors of metastasis in breast cancer (compared to single gene markers) 
. However, existing approaches are generally limited to mRNA expression data in terms of quantification of molecular expression, which captures post-transcriptional activity only to a limited extent 
. Consequently, inclusion of protein expression data in the search for sub-network markers has the potential to improve the effectiveness of systems biology approaches 
. However, it remains largely unknown how a network-based approach may be enhanced when starting with proteomic data.
In this paper, we propose a novel computational approach that takes into account certain topological features of the interactome, namely connectivity and proximity, for searching the neighborhoods of proteomic targets to find significant sub-networks implicated in CRC. In doing so, we partly overcome (i) the bias inherent in proteomic profiling experiments, particularly those that are gel-based, which are typically limited to capturing changes only in relatively abundant proteins and (ii) the noise, missing data, and ascertainment bias in PPI data. This is accomplished by assessing the functional association between proteins based on the quantification of the statistical significance of network crosstalk through information-flow based modeling of the PPI network and development of a reference model that takes into account the network connectivity of proteomic targets. We hypothesize that identification of candidate sub-networks with a significant association to proteomic targets can reveal proteins that are not detected to be differentially expressed at the level of the proteome, but whose activity in the network may play a key role in maintaining the phenotype. Consequently, the proposed framework provides a means for expanding proteome expression data to infer a role for proteins that exhibit significant crosstalk to the proteomic targets. The flow of the proposed computational framework is illustrated in .
Schematic of an integrated, proteomics-first approach for the discovery of functional, candidate sub-networks in a disease phenotype.
A key objective of this study is to systematically elaborate a proteomics-driven approach as a sound method for inferring small sub-networks implicated in complex phenotypes, and ultimately make these methods practically available to a wider community of researchers working in this area. For this purpose, we ground our approach on the hypothesis that the observed fold change of the proteomic targets may be associated with the synergistic dysregulation of their interacting partners at the level of mRNA. From a computational perspective, our hypothesis is based on the premise that sub-networks which exhibit significant association with the proteomic targets should also show a significant change in activity between control and cancer. To test this hypothesis, we first score each protein in the network based on their crosstalk with the proteomic targets. In order to account for noise, incompleteness of data, and ascertainment bias, we also develop novel methods for assessing the significance of these “crosstalk scores”. Then, for each proteomic target, we identify a candidate sub-network that is composed of its interacting partners with significant crosstalk scores. Subsequently, using an information theoretic measure, we evaluate the synergistic differential expression of these candidate sub-networks between control and disease, based on changes in mRNA expression obtained from microarray experiments performed on tissue biopsies collected from a cohort of patients with CRC. Finally, using the sub-networks that exhibit significant synergistic dysregulation as features, we develop classifiers to predict disease class across different data sets.
The proposed computational approach for assessing functional association between proteomic targets and other proteins uses a random-walk based algorithm. Recently, Kohler et al. 
and Chen et al. 
used similar network algorithms to prioritize candidate disease genes implicated by linkage analysis in a variety of human diseases. Vanunu and Sharan 
developed a global, propagation-based method that exploits information on known causal disease genes and PPI confidence scores. Their method more accurately recovered known disease gene relationships compared to several other extant methods. In contrast to these applications and rather than using raw scores obtained by such information flow based algorithms, we develop reference models to assess the statistical significance of these scores, with a view to identifying proteins that are significantly
associated with proteomic targets. Furthermore, our biological hypothesis, which drives our approach, is that targets (proteomic or genomic) significant for the CRC phenotype may reside in or near cancer hotspots in the network, and thus present an ideal starting place to search for high-value sub-networks associated with the disease. Therefore, our computational approach does not rely on canonical disease-related genes or proteins; rather, it is a global, unbiased search that tries to identify network interactions statistically significant with respect to all
targets in an experimentally-derived set.
Our previous work in this area 
was limited in scope due to the lack of access to the topology of the commercial PPI we employed. This prevented us from assessing the importance of topology for sub-network generation, which is the primary focus of our computational approach in this study. Likewise, our network scoring and statistical hypothesis testing were all greatly limited in the previous work due to incomplete access to an unpublished microarray data. For the same reason we were practically prevented from iteratively adjusting network search parameters in the commercial software that would have generated a large list of candidate sub-networks for scoring.
Here we describe a new network search method for finding high-value candidate sub-networks associated with CRC. To overcome the limitations of the previous study and to permit independent evaluation of our methods, we utilize a public PPI (HPRD) and public microarrays (Gene Expression Omnibus) to evaluate performance using two independent sets of proteomic targets obtained by 2D-PAGE that are also publically available. We compare this result to that obtained using a set of CRC driver gene mutants as seeds for the network search. The basis for this test is the hypothesis that if mutated gene products map to cancer hotspots on the network, they would be similarly useful as seeds for our network search algorithm. To reveal the practical utility of our integrative approach, and to extend it beyond merely a theoretical computational framework, we validate by western blot several targets in a sub-network predicted by our method to be dysregulated, using a cohort of tissue biopsies not used in the original proteomic screen. Finally, we employ a cross-validation approach to compare the disease classification performance of the proteomic-versus genomic-derived sub-networks.
Our results show that the proposed proteomics-driven approach, as it integrates a variety of biologically relevant data, can identify significant sub-networks implicated in a complex phenotype, i.e. CRC. The definition of terminology frequently used in this paper is provided in .
Definition of terminology used frequently in this paper.