In our study, we introduced a novel network-based approach to discover the driver mutations during cancer development. Compared with current approaches, there are some notable features to this approach. Firstly, it has successfully allowed for the detection of critical mutations despite the frequency and for identification of the responsive core modules from the perturbed pathways or gene sets, which improves the efficiency and avoids the use of irrelevant members. Secondly, this method is based upon carefully constructed, high-quality molecular networks derived from HPRD, literature curated, and manually screened networks. In this novel network, false positive interactions are theoretically further reduced by cutting the inter-GO connections and weighting the interactions using co-expression values, as opposed to other networks which are inferred by using only the co-expression levels [9
] or solely literature curated networks obtained from different contexts [10
]. Additionally, our approach is based on an explicit hypothesis that phenotypic changes represented by significant transcriptomic changes respond to cancer driver mutations. Unlike other methods that integrate gene expression information only to infer the modular network structures [9
], we also used the differentially expressed levels of the modules as a tool to screen the modules most likely influenced by drivers, to characterize those core modules, and to identify mutations enriched in the modules.
Our findings demonstrated a correlation between genetic mutations and phenotypic alterations at the module level, not at the single gene level. These genotype-phenotype correlations have been conceived for a long time but were only partially probed previously in certain genes, e.g., EGFR, TP53, BRCA1, BRCA2, K-ras, and their pathways. This may indicate that while the impaired DNA repair pathways seems to result in mutations widely distributed over all genes, it also causes more damage or the most responsiveness in the core modules. The presence of core modules in all six cancer types suggests the potential for a general mechanism, which supports the hypothesis that cells are modularly organized and module disruption potentially causes cancer.
Furthermore, the robustness of these findings has also been demonstrated. On one hand, we compared different strategy’s impact on the results. In details, when weighting GO network, we used two different co-expression level definitions, PCC and SCC. We compared the results using these two different strategy and found these two strategy’s results are same or similar, including network/module properties (Additional file 1
: Table S1), module robustness (PCC: Figure ; SCC: Additional file 2
: Figure S1), and mutation enrichment level (PCC: Figure ; SCC: Additional file 2
: Figure S3). All results together suggested the findings are stable even though using different network weight strategy. On the other hand, we also demonstrated our findings robust by analysis ten different datasets from six cancer types, which has shown consistence.
Regarding the reason behind our findings about genotype-phenotype correlated changes, it may be attributed to co-evolution at network/module level. Proteins perform its function by interacting with other partners in the modular mode, where modularity is deemed to affect the co-evolution on the proteins [33
]. First, interaction proteins have been found to be co-evolved to meet the structural constraints on the binding site [34
]. Second, the member genes in the module would co-evolve to be co-opted for a new common function [35
]. Thus, protein-protein interaction information, especially the modularity, contributes to build the relationship between genotype and phenotype.
Core modules provide biological insight into cancer development. Firstly, core modules are useful to identify the cancer drivers, which have been demonstrated in all three cancer types in which driver mutation data are available. Secondly, mutated genes in core modules tend to be hub-genes and functionally similar. Closer links among mutated genes were found in core modules from the same cancer type. Also, higher network relatedness was found between two different datasets from colon cancer (0.242) and ccRCC (0.256) compared to breast cancer (0.086) and NSCLC (0.096). This may imply a more complex development of breast cancer and NSCLC compared to colon cancer and ccRCC, or alternatively, it may be due to the heterogeneous histopathological features within their corresponding datasets. For breast cancer, samples in wang05 [36
] were lymph node-negative whereas the combination of lymph node-negative and positive were found in van02 [37
]. For NSCLC, Sanchez10 [38
] contained primary adenocarcinomas and squamous-cell carcinomas whereas only primary adenocarcinomas were found in Beer02 [39
]. Thirdly, for the mutated genes in the core modules from multiple cancer types, some may play a central role in cancer pathways such as TP53. Also, these genes’ network relativeness based cancer phylogenic relationship reflects the similar cellular origins across the different cancer types, which may be due to epigenetic factors, e.g. (1) common mutational mechanism pre-disposed at the early stage of differentiation for certain cell types, or (2) similar challenges from tumor microenvironment. This finding is also consistent with the prior findings that tissue lineages can influence mutational frequencies of certain oncogenes [40
However, there is not sufficient evidence to make conclusions regarding the causal relationship of mutations and expression changes, and many mutated genes within the core modules may only be associative. In addition, due to the public data limitation, the tumor sample sources exhibit differences between the expression profile and genomic mutation data. Besides, the pathological conditions are different between different datasets even though the results have demonstrated that networks from same cancer types, whether or not with same or different pathological status, have higher network relatedness than those from different cancer types, suggesting the differences from cancer types dominated the comparison between different cancers. Along with the rapidly increasing amount of data available, some aspects of our approach can be augmented by incorporating data from other dimensions, e.g., copy number variations or epigenetics, which could potentially help reduce the false positive rate and identify more explicit pathways. Meanwhile, more full datasets for each patient under each pathological condition will become available in the future. The core modules revealed in this study are potentially valuable resources for the elucidation of how mutations arise, with general or specific roles in different cancer types, and provide insight into convergent cancer development in different organs, and may be informative for clinical usage as well.