Biological systems consist of different multi-functional elements or modules that interact selectively, and often nonlinearly, to coordinately regulate complex behaviours (1
). Multiple data sources can reveal different aspects and levels of biological system function. Traditional computational or statistical approaches (2–5
), mainly focusing on one type of data source, cannot provide a ‘systems view’ or a ‘global picture’ of a complex biological system, such as cancer (1
). To achieve a greater understanding of the main features of complex biological processes or systems requires the effective integration of diverse sets of data and knowledge. Many methods have been developed to integrate different types of biological data, including combining protein–DNA interaction and gene expression data for regulatory network identification (7
) or gene set enrichment analysis for differentially expressed pathway identification (9–11
). Integrating protein–protein interaction (PPI) data with gene expression data has also been attempted for active PPI network identification (12–14
). The availability of high-dimensional microarray gene expression data and PPI data should support the identification of biologically meaningful and ‘cancer driver’-related networks for cancer studies (15
Protein–DNA interaction, PPI and/or molecular pathway data are rich in information about biological processes captured at different levels of system function. Some methods have been developed to identify significant gene sets or pathways involved in diseases or biological processes by incorporating prior biological knowledge to help understand underlying biological mechanisms. For example, gene set enrichment analysis or pathway enrichment analysis approaches (9–11
) were proposed by using ‘known’ membership information in functional gene clusters or pathways. Prior knowledge can also contain network structure information, such that PPIs, protein–DNA interactions or other knowledge from canonical signalling pathways can be conveniently represented as the edges in graphs. Based on the structure of a PPI network, identification of active subnetworks is becoming increasingly important in systems biology, as it can reveal the underlying mechanisms governing the observed changes in gene expression (12
). As such, biomarkers have been extended from traditional individual genes to a network of gene markers that reveal more biologically relevant information, often by incorporating PPI network or pathway information.
From PPI network data, several methods have been proposed to search for subnetworks with significant changes in gene expression under different conditions (12–14
). For example, Chuang et al.
) proposed a PPI network-based approach to identify biomarkers of metastasis in conjunction with breast cancer gene expression profiles. In this approach, biomarkers are not individual genes or proteins, but rather subnetworks of interacting proteins within a large human PPI network. Subnetworks identified by these methods not only suggest possible models of the molecular mechanisms underlying metastasis but they can also reveal key network hubs that are usually not detectable by examining only their differential expression. Ideker et al.
) first converted the significance value (P
-value) into a Z
-score (measuring the change in gene expression between two phenotypes), and then aggregated the Z
-scores of genes in a subnetwork to a network score for overall evaluation. A search algorithm was then implemented to find subnetworks with maximum network scores. Dittrich et al.
) further revised Ideker’s method by proposing a scoring function based on a mixture model for signal-noise decomposition and a search algorithm based on integer linear programming techniques.
These methods have achieved some success in identifying biologically relevant subnetworks, but with noticeable limitations. The foremost limitation is that genes in a PPI network are treated independently when the network score is calculated; dependency among the genes in a subnetwork is ignored during the network analysis. However, genes in a local subnetwork have functional relevance; therefore, they should form a significant subnetwork even though not all of them have significantly different gene expression values. Another limitation is that many hub genes, which are biologically important and have many interactions in a PPI network, often show little change in expression compared with their downstream genes. By selecting downstream genes rather than hub genes, the resulting subnetwork may not reveal the key upstream regulatory components of the system. Finally, because of the heterogeneity in tissue samples and the inherent noise in microarray data (17
), reproducibility of currently identified subnetworks is often low when tested on independent data sets (18
We propose a novel subnetwork identification method for network analysis of microarray data with two different phenotypes. Specifically, we present a bagging Markov random field (BMRF)-based method for subnetwork identification. The BMRF approach is built on a framework of Markov random field modelling and maximum a posteriori estimation (MRF–MAP) to integrate gene expression and PPI data. Note that an MRF model has been applied for network-based analysis to predict protein function using PPI data and has achieved some degree of success (20
). This success was largely attributed to its flexibility to represent different types of dependency in a network. A modified simulated annealing search algorithm is further implemented to avoid local maxima and reduce computational cost. Finally, an aggregation scheme, called bagging (22
), is developed to help identify confident subnetworks from bootstrap versions of resampled data. Note that a previous study has applied the bagging scheme to a Bayesian approach to infer gene regulatory networks and has achieved an improved performance over other methods (23
). Simulation experiments demonstrate the effectiveness of the proposed method. Comparing results against several benchmark methods show that our method consistently outperforms these other tools in identifying subnetworks and hub genes. For subnetwork identification in the clinical setting, we then apply our method to breast cancer data sets acquired before any drug therapy from two different conditions as follows: ‘untreated’ (patients received no systemic drug therapy) and ‘tamoxifen-treated’ (patients treated only with the anti-oestrogen tamoxifen after surgical removal of their primary breast tumours). Clinical outcome was used to define the output classes for prediction: ‘early recurrence’ (breast cancer subsequently recurred within certain years) and ‘late recurrence’ (breast cancer did not return during the follow-up period within certain years). The results show that our method can improve prediction performance with high reproducibility across different data sets and identify several important subnetworks associated with the development of ER+ breast cancer and/or tamoxifen resistance.