|Home | About | Journals | Submit | Contact Us | Français|
A central goal of systems biology is to elucidate the structural and functional architecture of the cell. To this end, large and complex networks of molecular interactions are being rapidly generated for humans and model organisms. A recent focus of bioinformatics research has been to integrate these networks with each other and with diverse molecular profiles to identify sets of molecules and interactions that participate in a common biological function— i.e. ‘modules’. Here, we classify such integrative approaches into four broad categories, describe their bioinformatic principles and review their applications.
Cellular organization is thought to be fundamentally modular1,2. At the molecular level, modules have been variously described as groups of genes, gene products, or metabolites that are functionally coordinated, physically interacting and/or co-regulated1-7. For example, a pioneering perspective1 on modular cell biology described a module as a distinct group of interacting molecules driving a common biological process— for example., the ribosome is a module that synthesizes proteins. Modules, in essence, are functional building blocks of the cell1-7.
In an effort to develop a complete map of biological modules underlying cellular architecture and function, large networks of inter-molecular interactions are being measured systematically for humans and many model species8-16. Such networks include physical associations underlying protein-protein, protein-DNA or metabolic pathways, as well as functional associations, including epistatic and synthetic lethal relationships between genes, correlated expression between genes, or correlated biochemical activities among other types of molecules (Supplementary Table 1). Numerous approaches have been advanced to mine such networks for identifying biological modules, including methods for clustering interactions and those based on topological features of the network such as degree and betweenness centrality (as reviewed5-7; see glossary). These approaches are based on the premise that modular structures such as protein complexes, signaling cascades, or transcriptional regulatory circuits display characteristic patterns of interaction5-7.
Module discovery in biological networks has been extremely powerful for elucidating molecular machineries underlying physiological and disease phenotypes5-7,17-19. Nonetheless, many challenges confound the interpretation of biological networks and their embedded modular structures. A first challenge relates to the sheer complexity of the problem at hand-it is not yet completely clear how to transform data for thousands of molecular interactions into functionally coherent models of cellular machinery. Second, technological biases in high-throughput approaches20-22 can compromise signal accuracy. For example, experimental artifacts, variability in coverage across datasets, sampling bias towards well-studied processes, limitations in screening power and inherent sensitivities in various assays can yield false positives and false negatives in interaction data23-26. Third, individual high-throughput experiments measuring a subset or type of interactions (e.g., protein-protein or protein-DNA), simply cannot expose the full interaction landscape of a cell. Finally, as molecular networks are commonly assembled in single, static experimental conditions, they grossly overlook the inherently dynamic nature of molecular interactions, which can be massively rewired during physiological or environmental shifts10,27,28. Hence, current network models reveal only partial and static snapshots of the cell.
A key strategy to address these challenges is data integration. In recent years, a rich collection of integrative methods has emerged for identification of network modules of high quality, broad coverage, and context-specific dynamics. Here, we review these integrative approaches, highlighting their logical underpinnings and biological applications. We classify integrative module discovery methods into four broad categories: identification of ‘active modules’ through integration of networks and molecular profiles, identification of ‘conserved modules’ across multiple species, identification of ‘differential modules’ across different conditions, and identification of ‘composite modules’ through integration of different interaction types. Together, these four categories encompass a wide spectrum of network integration strategies and available data types. An illustrative poster29 titled ‘Integrative Systems Biology’ was previously published and is recommended as an accompanying guide.
One of the most successful integrative approaches has been to overlay networks with molecular profiles to identify ‘active modules’. Molecular profiles of transcriptomic, genomic, proteomic, epigenomic and other cellular information are rapidly populating public databases (Supplementary Table 1). As these profiles capture dynamic and process-specific information correlated with cellular or disease states, they naturally complement interaction data, which are primarily derived under a single experimental condition. Computational integration of network and ‘omics’ profiles has thus become a popular strategy for extracting context-dependent active modules, which mark regions of the network showing striking changes in molecular activity (e.g., transcriptomic expression) or phenotypic signatures (e.g., mutational abundance) associated with a given cellular response4,30-38 (Figure 1; these regions have alternatively been described as network hotspots39,40 or responsive subnetworks41-43).
A large number of computational techniques have been developed that automate large-scale identification of active modules in an unbiased manner. Several of these methods have been packaged as publicly-available application tools (Table 1). These methods generally fall into three classes, as follows. Given the rapid emergence of integrative methodologies, some effort has been made to compare their accuracy (precision), sensitivity (recall) or computational efficiency within individual method classes44-46. However, unbiased comparisons across different methods’ classes using uniform data metrics will need to be undertaken comprehensively47.
The first class of methods, themed SigArSearch (Significant-Area-Search)31,33,48 was previously reviewed43. Many of these methods33,41,44,48-56 descend from an early formulation, JActiveModules48, (also implemented as an application tool through the network analysis/visualization platform, Cytoscape57; Table 1), which was the first to frame the active modules search task as an optimization problem. SigArSearch methods invoke three common procedural steps for module discovery (Figure 1). First, network nodes (molecules), and/or edges (interactions) are annotated with scores quantifying molecular activity, where activity is measured via molecular profiles such as gene expression levels- the most common data choice in such applications. Next, a scoring function is formulated to compute an aggregate score for each subnetwork, reflecting overall activity of member nodes/interactions. Subsequently, a search strategy is devised to identify subnetworks with high scores, which mark active modules.
Scoring and search of active modules present a range of computational considerations and implementations43. Different scoring functions have assumed scores on network nodes48, or edges41,58 or both59; or constrained scores by topology56 or signal content44; or prioritized by high-scoring ‘seed’ nodes60, including using strategies for computational color coding of ‘seed’ paths51,55. Active module search has proven to be a computationally difficult problem48. Hence, so-called heuristic solutions (e.g. based on greedy52,61-63, simulated annealing48 or genetic64 algorithms; Box 1) that optimize computing time by recovering high-scoring subnetworks without necessarily seeking the maximally-scoring (globally optimal) subnetworks have been widely applied. Nevertheless, exact methods that guarantee the detection of maximally scoring subnetworks, albeit at higher computational expense, have been programmed to run in fast time-scales44,45,65,66.
Simulated Annealing (SA) A probabilistic heuristic that attempts global optimization of a function in a large search space (analogous to a physical system) and aims to bring the system from an arbitrary initial state to an optimized state using minimum energy. SA was the first heuristic to be applied for hotspots searching. In one SA framework48, sub-networks were expanded through iterative addition of active nodes (showing significant molecular activity) until no further gain in sub-network score was possible. A node was randomly selected per iteration Ti and its state toggled between active or inactive. The toggle was retained if sub-network score increased as a result of the addition, or else, accepted according to the probability function where η reflects the score of the highest-ranking component of a sub-network at a given iteration i.
Greedy methods Greedy algorithms create decisions that locally optimize an iterative step. For example, in one greedy-based scheme52, sub-networks were iteratively expanded from high-degree nodes until (i) aggregate sub-network score fell below a predefined threshold, or (ii) sub-network size was saturated. Alternately, nodes only within a fixed radius of the seed node were aggregated62. In a greedy variant of SA, the number of negative scoring nodes admitted per iteration (inactive) was limited61.
Genetic Algorithms (GA) GA mimic natural selection among members of a population to iteratively compute various combinations of solutions, selecting those with the best fitness (scores). In one GA-based hotspot detection method64, node fitness was estimated based on both, molecular activity and network topology.
Exact Approaches- Exact methods extract maximally scoring as well as highly scoring sub-networks in optimum timeframes44,45,65,66. One such method44 allowed fast recovery of modules by transforming the sub-network search task into a well-known prize-collecting Steiner trees (PCST) problem and solving it using integer linear programming (ILP).
Network propagation (network smoothing) NP methods propagate network flow from select nodes to identify sub-networks accumulating the maximum ‘flow’ (or influence from neighboring nodes). In one such method67, an ‘influence graph’ was generated by releasing flow from cancer genes (source node) along interaction edges, where weight (w) of an edge between a pair of nodes gi and gk was given by the relationship w(g↓i, g↓k) = min(influence (g↓i, g↓k), influence (g↓k, g↓i)). To identify cancer hotspots, the influence graph was decomposed into sub-graphs of connected maximum coverage which tends to be a polynomial-hard problem. An alternate model of ‘enhanced influence’ was devised to reduce this complexity through enhancing the measure of network connectivity (influence) by multiplying edge weights (w) with the number of mutations associated per interacting gene pair.
Co-clustering methods These methods allow simultaneous clustering of interaction data and conditional profiles to identify co-regulated or correlated modules. In a bi-clustering method, cMonkey69, p-values of correlated expression (rik), sequence similarity (sik) and network connectivity (qik) were measured and aggregate p-value was defined as joint membership probability (πik). Using SA, nodes with high membership values πik≈1 were iteratively aggregated; those with low values πik ≈0 were dropped; while those with intermediate values were added with decreasing probability per iteration (heat gradient) to identify hotspots.
The second group of methods for active module identification emulates the related concepts of diffusion flow and network propagation36,37,45,67-72. Analogous to fluid or heat flow through a system of pipes, network ‘flow’ is ‘diffused’ from nodes implicated in molecular profiles, such as differentially-expressed or known disease genes. The flow reaches outwards along network edges to subsequently identify active modules as subnetworks accumulating maximum flow.
Recently, a series of bioinformatics tools including HotNet67, PARADIGM70, MEMo73 and Multi-Dendrix37 (Table 1) have incorporated propagation-based methods for network mapping of cancer mutations. These methods have proven particularly valuable for discovering mutational hotspots in human cancers67,70-74, and additionally discriminating ‘driver’ oncogenic pathways from ‘passenger’ mutations37 For example, in one implementation of the application tool HotNet67, significantly mutated pathways in glioblastomas and adenocarcinomas were identified through network-propagation of associated cancer mutation profiles. Here, diffusion flow was run on a human protein-protein network seeded from known cancer genes to map their global neighborhood of interaction. This operation translates to computing the net ‘influence’ of cancer genes on all remaining genes in the network (Box 1). The resulting ‘influence network’ (representing the full set of network connectivities surrounding cancer seed genes) was subsequently partitioned into weighted subnetworks, thresholded either by number of patients in which they were mutated, or by average number of somatic mutations associated per interacting gene pair in a given subnetwork, as informed by tumor sequence profiles. The highest weighted sub-networks marked significantly mutated cancer pathways. Such strategies have become increasingly popular and data-rich due to easy availability of genome sequence and other ‘omics profiles in public repositories such as The Cancer Genome Atlas75 (TCGA; http://cancergenome.nih.gov/).
Additionally, a number of propagation-based tools such as RegMod45, ResponseNet76 and NetWalker77 (Table 1) permit functional network analysis informed by transcriptomic data. For example, a network-optimization framework dubbed ResponseNet traces information flow from upstream response regulators through signaling and regulatory pathways embedded in integrated protein networks for providing pathway-based explanations for downstream transcriptional changes captured in gene expression profiles.
Network propagation methods are particularly suitable for annotation, ranking, or clustering of genes (such as disease genes) based on affiliations formed by network connectivity. In these situations, the precise architecture of a network may hold less concern. Rather, the primary motivation behind network propagation is to take advantage of the general functional proximity of genes to one another. Hence the phrase ‘network smoothing’ has come to describe such strategies.
The third group of methods employs simultaneous clustering of network interactions and the conditions under which these interactions are active, in a concept termed ‘bi-clustering’46. Clustering based on network connectivity alone has proven instrumental in defining basic principles of modular network organization7,78,79. Bi-clustering algorithms further expand these capabilities by evaluating both network connectivity as well as the correlation of performance across multiple biological datasets36,46,80,81. A quantitative assessment of bi-clustering methods was recently presented46. Many (bi)clustering methods have been adapted as application tools (Table 1) such as SANDY82, SAMBA83 and cMonkey69 (Box 1), that permit multiplexed data analysis by interpreting global network topology and statistics in contexts of transcriptional regulatory information, differential expression profiles across multiple conditions and/or other biomedical evidences (phenotypic, sequence-based, literature, and/or clinical information).
Modules derived through such a broad spectrum of data, covering multiple levels of biological regulation, are providing increasingly comprehensive interpretation of biological systems. For example, methods have also been developed for identifying active modules within metabolic networks, in which omics or regulatory data are used to constrain the allowable metabolic fluxes through the reactions in the network. High-flux reactions (edges) are clustered together and reported as active modules. We refer the reader to recent reviews84,85 on integrative methods for modeling of metabolic networks through omics-based constraints. A version of the application tool, COBRA (constraint-based reconstruction and analysis; Table 1) permits omics-constrained analysis of genome-scale metabolic networks to predict feasible metabolic phenotypes and relevant modules under a given set of conditions86.
Active modules have been identified using a wide array of interaction types (e.g. protein-protein, regulatory and metabolic; Supplementary Table 1A) and ‘omics’ profiles (e.g., gene-expression, mutation status, RNAi phenotypes and other cellular state data; Supplementary Table 1b), any combination of which may be applied within a single module-finding application.
A great many applications have related to interpretation of omics profiles in context of protein-protein interaction networks34,39,48,50,62,67,70,72-74,81,87. For example, a recent study72 established a comprehensive network view of molecular pathways altered in clear cell renal cell carcinoma (ccCRC) by analyzing a diverse cohort of TCGA-derived omics data including gene-expression, genome mutation, and methylation profiles in conjunction with human protein-protein interactions. The methods HotNet and Paradigm were used to identify cancer-relevant active modules (Figure 1c), highlighting PI3K pathways and SWI/SNF chromatin remodelling complexes. Moreover, aberrant remodeling of cellular metabolism was found to recurrently affect tumor stage and severity. Similarly, employing the application program ResponseNet, yeast networks of protein-protein, metabolic and protein-DNA interactions were analysed simultaneously with mRNA-profiling data to discover pathways responding to alpha-synuclein toxicity88.
Another study applied the JActiveModule method to detect protein-protein pathways showing dysregulated expression in human breast cancer62. Compared with individual cancer gene markers, these expression-based modules showed greater accuracy in distinguishing metastatic from non-metastatic breast cancers, demonstrating the superior power of module-based biomarkers for disease prognosis. Alternatively, co-clustering of RNAi data with protein-protein networks identified HCV-responsive modules in humans, establishing the role of human ESCRT-III complex as an infection-permissive host factor81. Other discoveries of omics-derived modules using protein interaction knowledge have spanned a variety of model organisms, including metabolism in yeast48, drug response in Mycobacterium tuberculosis50, aging in Drosophila89, aging56 and embyogeneisis in C. elegans34, and cellular responses to inflammation87, HIV infection61, or TNF-mediated stress90 in humans.
Another prominent group of applications relates to integration of omics profiles with protein-DNA interaction networks for identification of active regulatory pathways4,82,91. For example, co-clustering of protein-DNA interactions and multi-condition gene-expression profiles in yeast demonstrated widespread dynamic remodeling of transcription networks in response to diverse environmental stimuli82. It further showed that while a few transcriptional complexes act as constant “hubs” of transcription (see glossary), most appear transiently under particular conditions. In another study, differentially expressed arsenic responsive pathways were extracted through overlay of transcriptional profiles on yeast protein-DNA networks using the jActiveModule platform91. It was found that transcriptional data recognized important transcriptional complexes in regulatory networks but not in metabolic networks, while phenotypic profiles (of arsenic sensitivity) mapped more cohesively onto metabolic networks.
Active module finding has also been applied to metabolic networks50,91-93. Constrain-based methods for analyzing metabolic networks, including the widely exploited flux balance analysis (FBA) method, predict steady state distributions of metabolic fluxes based on various physio-chemical constrains such as rates of cellular growth and bioenergetics94. A recent variation on these methods adopts an integrative framework, whereby metabolic flux predictions are guided by omics or regulatory information (as reviewed84,85). For example, a genome-scale reconstruction of a human metabolic network (curated from literature evidences) was constrained using quantitative measures of gene- and protein- expression to predict tissue-specific metabolic uptake and release92. The study revealed a central role for post-transcriptional regulation in directing tissue-specific metabolic behaviors and associated metabolic diseases.
Discovery of active modules has paved the way for exciting diagnostic and therapeutic interventions. For example, active modules showing characteristic patterns of gene expression correlated with specific disease phenotypes can yield valuable biomarkers for disease classification62,95,96. Module-based biomarkers achieve greater predictive power and reproducibility over single gene markers, as demonstrated for the classification of numerous human disorders including Alzheimer’s97, diabetes36,98-100 and several forms of cancers including breast cancers45,55,62,99,101,102, ovarian cancer73,103,104, glioblastomas67,70,73,74, and others39,72,95,105,106. Because active modules can reveal pathway-centric insights reinforced by multiple lines of evidence, they naturally provide mechanistic explanations for complex traits and multi-genic diseases like cancer. Moreover, active modules can assist in discovery of drug-target pathways50,107 and predicting patient outcomes, such as response to chemotherapy55.
Biological networks undergo significant rewiring through evolutionary time, concomitant with gains, losses, or modifications in gene functions108-111. Therefore, network modules showing conservation over large evolutionary distances are likely to reflect well preserved ‘core’ functions maintained by natural selection. Discovery of such ‘conserved modules’ can address fundamental questions about biological regulation while predicting evolutionary principles shaping network architectures. Some publicly available tools for finding conserved modules are summarized in Table 1.
In one of the most fundamental approach to identifying conservation at the network level, individual interactions have been observed to occur between orthologous gene-pairs in two species, corresponding to conserved protein-protein (interologs)112 or conserved regulatory (regulogs)113 interactions. In one interesting extension of this idea, a network of co-expressed gene pairs in humans, flies, worms and yeasts was derived and, then a clustering algorithm used to extract conserved modules underlying cell cycle regulation and other core cellular processes3.
Beyond conservation of individual interactions, comparison of modules across species may reveal high overall consistency in structure and function despite lack of one-to-one correspondence at the level of individual molecules or interactions. Hence, a group of approaches have been developed to align complex network structures, paralleling advances in computational solutions for cross-species sequence comparison114. These network alignment approaches can be organized as follows:
Computational methods for network alignment have greatly advanced evolutionary comparisons of network modules. For example, local network alignment tools like PathBlast115 and NetworkBlast116 permit parallel comparisons of simple pathways (also known as linear paths) or subnetworks (also known as modules), respectively. These methods employ a common heuristic workflow whereby a merged representation of two networks, denoted the ‘network alignment graph’, is searched for conserved paths or subnetworks based on a probabilistic log-likelihood model of interaction densities.
Network alignment has been progressively scaled for analysis of multiple (more than two) networks. For instance, fast computation of conserved modules across as many as ten species was achieved in one study117 by redefining the alignment graph in NetworkBlast and treating multiple networks as separate layers linked via common orthology (see glossary). Orthology, as in the above methods, is commonly defined based on sequence homology. However, each gene/protein may potentially harbor multiple orthologs and paralogs due to gene duplication events in any of the multiple species being compared. The resulting many-to-many correspondences between putative orthologs can introduce high computational complexity in network alignment methods, which can scale exponentially with the addition of each new species and corresponding network. To address this scalability issue when aligning graphs from multiple species, global alignment methods such as those implemented in a recent study118 and network alignment tools such as IsoRank119 and Graemlin120 allow for functional orthologs based on similar neighborhood topologies across species (i.e., the overall arrangement of interactions surrounding a gene or protein or molecule).
An important question in network evolution pertains to how evolutionary dynamics of genome modification shape network architecture over time121-123. Network alignment methods for scoring module conservation such as MaWish124 and others are increasingly incorporating evolutionary rates of gene deletion, insertion or/and duplication for accurate representation of the network evolution model. One study125 additionally accounted for the phylogenetic history of genes, through reconstruction of a conserved ancestral PPI network (CAPPI) from multiple species and its subsequent projection on the individual networks to identify conserved subnetworks across fly, worm and humans.
Conservation-based studies have provided fascinating insights into network evolution. For example, the identification of conserved metabolic genes and reactions across Archea, Bacteria and Eukaryotes, followed by species clustering and simulations in the presence and absence of oxygen, evidenced that the emergence of all three domains of Life predated widespread availability of atmospheric oxygen, and that adaptability to oxygen was coupled with increased network-complexity, and concurrently, increased biological complexity126.
Additionally, comparative analyses of conserved modules can supplement sequence-matching techniques for function prediction114,127-130, based on the premise that interaction partners of orthologous genes or proteins are likely to be functionally conserved as well. This was illustrated in the proof-of-principle application of NetworkBlast, where 4,645 previously uncharacterized protein functions were predicted based on their conserved interaction neighborhoods inferred based on pairwise alignment of protein-protein networks across yeast, worm, and fly116.
Evolutionary conservation can also support predictions of drug-action mechanisms: when a given drug is shown to target elements of a module that is conserved across two evolutionary distant model organisms, the probability that the same drug also targets the corresponding conserved module in humans increases131. Furthermore, identification of evolutionarily diverged modules in pathogenic species can uncover pathogen-specific drug targets that are absent in humans132.
Molecular interactions can change dramatically in response to cellular cues, developmental stages, environmental stresses, pharmacological treatments and disease states32,101,130,133,134. Yet the inherently dynamic wiring of molecular networks remains under-explored at the systems level, as interaction data are typically measured under single conditions (e.g., standard laboratory growth media). Therefore, a number of so-called ‘differential’ network analyses (Figure 2) have adopted an experimental approach whereby biological networks are measured and compared across conditions to identify interactions and modules that are differentially present, absent or modified.
Analogous to ‘differential’ expression analyses, differential network analysis involves pair-wise subtraction of interactions mapped in different experimental conditions130. The subtractive process filters out ubiquitous interactions (so-called ‘housekeeping’ interactions130) that are redundant to all static conditions of interest. By selectively extracting interactions relevant to the studied condition or phenotype, this reduces the typical complexity of static networks. Most notably, differential networks tap interaction spaces that are inaccessible to static networks, as individual interactions that may be too weak (in magnitude of interaction strength) to capture in either static condition can be solely identified based on significance of their differential measure27,130. Such differential interactions once identified, may be further organized into modules using a number of hierarchical or graph clustering methods47,135 or various Cytoscape57-based network analysis tools136,137.
Physical networks assembled from quantitative protein-DNA and protein-protein binding data under different conditions were some of the first to be analyzed in a differential mode. For example, utilizing standard ChIP-based assays for protein-DNA interactions in vivo (Supplementary Table 1), alterations in Transcription Factor-promoter binding following amino acid starvation10 or chemical induction of DNA-damage138 were mapped in yeast, providing insights into dynamic regulation of stress response pathways. Similar comparisons of protein–protein interactions following epidermal growth factor (EGF) treatment in yeast have shed light on EGF-dependent signaling139. A recent study140 exploring tissue-specific effects on network wiring demonstrated a profound role of tissue-regulated alternate splicing on dynamic remodeling of protein-protein networks. Using a luminescence-based mammalian interactome mapping approach (LUMIER) for measuring physical binding between experimentally chosen ‘bait’ (seed) and ‘prey’ (target) proteins, the authors mapped protein-protein interactions between normally functioning ‘prey’ proteins and several neurally-regulated ‘bait proteins’ that were genetically engineered to include or exclude specific exons with the purpose of exploring exon-dependent effects on network wiring in human cells. The study found that almost a third of neurally-regulated exons that were tested significantly modulated protein-protein interactions, and that overall, tissue-dependent exons participated in more protein-protein interactions than other proteins.
Differential analysis has also been performed across functional networks (i.e., as opposed to physical networks, see Supplementary Table 1). For instance, we applied an approach termed differential epistasis mapping (dE-MAP) to compare genetic networks induced by different types of DNA damaging agents27,141. In another example, gene co-expression networks from transcriptomic profiles of normal or prostate cancer samples were compared to identify subnetworks induced in prostate cancer142. Differential, but not static networks, in this study successfully recognized known prostate cancer-specific interactions for RAD50 and TRF2.
Similarly, metabolic networks assembled from correlated activities of liver metabolites were differentially compared between normal and diabetic conditions to identify functional regulators of diabetic dyslipidemias in humans143. It is likely that continued advances in differential network mapping and analysis will shed light on tissue-specific, spatio-temporal and dosage-dependent rewiring of biological networks.
Different types of biological interactions provide distinct, yet complementary, insights into cellular structure and function. For instance, protein-protein, regulatory and metabolic networks each reflect a different aspect of the physical architecture of a cell (Supplementary Table 1). Moreover, ‘genetic’ interactions, which quantify epistatic effects of one gene on the phenotype expressed by another, reveal functional relationships between gene pairs. A key opportunity lies in reconciling these complementary network views of the cell into cohesive models. Powerful integrative approaches aimed at identifying composite functional modules composed of multiple types of biological interactions are providing considerable advances in this direction.
One class of approaches maps ‘composite modules’ that are jointly supported by physical and genetic interactions144 (Figure 3). A common theme in these approaches13,129,145-147, implemented in the application PanGia148 (Table 1), involves identification of overlapping clusters of physical and genetic interactions as ‘composite modules’ implicating genes acting ‘within’ a pathway. Clusters of genetic interactions bridging two different composite modules reflect inter-module dependencies running ‘between’ synergistic, compensatory or redundant pathways145. Integrative analysis of composite physical-genetic modules can reveal physical mechanisms underlying mutational phenotypes associated with genetic screens, or conversely, predict genetic dependencies between protein complexes mapped in physical binding assays. Module maps elucidating global physical-genetic interrelations have been assembled in a number of studies exploring Hsp90 signaling149, chromosomal biology13,146, RNA processing150, secretory pathways151, DNA damage response27, or global biological processes145,152.
Integrative strategies have similarly uncovered ‘composite modules’ in signaling and regulatory networks, primarily through combined evaluation of protein-DNA (transcription factor (TF)-target) and protein-protein interactions 11,59,153,154, or by additionally including genetic interactions152. In early work along these lines, composite ‘motifs’ comprised of regulatory and protein-protein interactions among 2, 3 or 4 proteins were mapped and classified into distinct feed-forward loops, interacting transcriptional hubs and other logical circuits153. Such simple ‘motifs’ were thought to combine with recurrent patterns to organize higher-order network ‘themes’ or complex functional modules associated with specific biological responses152. In other work along these lines154, yeast protein-protein and protein-DNA interaction networks were combined to identify 72 co-regulated protein complexes. Such coregulated complexes depict dense protein clusters (in protein-protein networks) whose members are jointly regulated by a common set of transcription factors (in corresponding protein-DNA networks). At the network level, these TF-protein co-complexes were visualized along with their regulatory relationships to the other (non-transcriptional) modules they regulate. Evolutionary comparison of these co-regulated complexes suggested the possibility that protein complexes may evolve with slower dynamics than protein-DNA transcriptional relationships. Related studies exploring co-regulated complexes in yeast have revealed cross-pathway communication between hyperosmotic, heat shock and oxidative stress response systems59, and elucidated signaling networks active during pheromone response53.
Protein-DNA interactions have also been combined with metabolic networks to understand the effects of transcriptional regulation on biochemical output84,85,91,155-157. For instance, a method called PROM (probabilistic regulation of metabolism) was developed to facilitate automated and quantitative integration of regulatory interactions and other high-throughput data for constraint-based modeling of metabolic networks157. The method was applied for genome-scale analysis of an integrative metabolic-regulatory network model for Mycobacterium tuberculosis, incorporating information from over 2,000 TF-gene promoter interactions regulating 3,300 metabolic reactions, 1,300 expression profiles, and 1,905 deletion phenotypes from E. coli and M. tuberculosis. The method enabled powerful prediction of microbial growth phenotypes under various environmental perturbations and aided in identification of novel gene functions. Furthermore, the study isolated several transcription factor hubs (see glossary) regulating multiple target proteins in the pathogen-interactome as a strategy for uncovering promising anti-microbial drug-targets.
Given the above four integrative approaches, a very recent trend has been to chain together more than one of these to create network analysis pipelines of increasing sophistication and complexity. For example, network module-finding methods based on integration across molecular profiles and network types (e.g., for finding active modules or composite modules) have been extended across species for extracting co-functional modules that are also conserved. A multi-species and scalable framework, neXus (Network-cross(X)-species-Search)158, was developed for discovering conserved functional modules derived through parallel expression profiling in multiple species (Figure 4). Specifically, a clustering based approach was used to extract sub-networks from functional linkage networks (incorporating a wide array of interaction and omics information) derived independently in mouse and human. Sub-networks were seeded from differentially expressed orthologues, and simultaneously expanded in both species. Using programmatic constrains to threshold candidate sub-networks by network connectivity and molecular activity, conserved active sub-networks were nominated, which showed significant differential activity in stem cells relative to differential cells and shared similar patterns of gene expression across mouse and human. An extended version of the cMonkey framework designed for simultaneous (over sequential) data-integration across multiple species159 (Table 1), further expands the scope of such analyses by allowing parallel evaluation of protein-protein interactions, transcriptomic data, sequence profiles, metabolic and signaling pathway models and comparative genomics from multiple species to infer conserved co-regulated modules.
Another recent study160 mapped global genetic networks in the fission yeast S. pombe and compared them with integrated maps of existing genetic and protein-protein networks (composite modules) in the divergent budding yeast S. cerevisiae, with the aim of identifying conserved functional modules. The authors demonstrated a hierarchical model for evolution of genetic interactions: interactions among genes whose products were in the same protein complex showed the highest degree of conservation, those involved within the same biological process showed lower but still significant conservation, whereas those participating in different biological processes were poorly conserved. Conservation of cross-pathway interactions between distinct biological processes was observed on a larger scale. Together, these observations reveal functional and evolutionary design principles underlying modular organization of cellular networks.
With continued progress in integrative bioinformatics pipelines and expansion in data handling capabilities, potentially a very large combination of data types, conditions, species, time points and cell states should be amenable to joint evaluation for in-depth network analysis.
The past decade has witnessed explosive growth in data on biological networks9-14,16,161,162 albeit with inherent limitations24, and largely from a static perspective130. The integrative approaches reviewed here substantially increase the scope, scale and depth of network analyses, by permitting joint interpretation of ensembles of biomedical information. While these strategies have greatly refined high-throughput data analysis by tackling several of its prevalent challenges such as variability in accuracy, coverage and context-specificity, even greater power for mining biological knowledge remains to be achieved by implementing a combination of such approaches. Such combination strategies encompassing multiple algorithms, data types, conditions and species contexts are likely to maximize performance, relevance and scope of module-assisted network analysis. Along these lines, for example, although it has not yet been attempted, it would be conceivable to analyze differential networks (Approach 3) across multiple species (Approach 2) to detect conserved dynamic modules and process-specific pathways. Another challenging direction would be to study the evolution of composite modules, as it is becoming increasingly clear that different network types exhibit specific evolutionary dynamics, with for example regulatory interactions evolving faster than genetic, protein and metabolic networks 163.
Module-based biomarkers derived through integrative network analyses also provide superior predictive performance in disease classification, especially when compared with single-gene disease markers that have been routinely annotated through genome wide association studies (GWAS)38,62,71,72,164,165. Future work on integrative network analyses will provide greater clues into pathway structures and highlight network-level dynamics underlying biological responses.
We gratefully acknowledge NIH grants P41 GM103504 and P50 GM085764 in support of this work.
Dr. Trey Ideker is Professor and Chief of Genetics at the UCSD School of Medicine. He received Bachelor’s and Master’s degrees from MIT in Computer Science and his Ph.D. from University of Washington in Molecular Biology. Ideker is a pioneer in using genome-scale measurements to construct network models of the cell. His recent research includes mapping of networks governing the response to DNA damage and methods for network-based diagnosis of disease. Among Ideker’s accolades are the 2009 ISCB Overton Prize and features in the Scientist, Technology Review, New York Times, San Diego Union Tribune, and Forbes.
Dr. Mitra is a postdoctoral scholar in the laboratory of Dr. Trey Ideker at UCSD, Dept. of Medicine. Her research entails development and application of network-based approaches for systematic elucidation of biological and disease regulation. Her primary focus lies in delineating the network basis of cellular stress-response systems, particularly those relating to autophagy and aging. This work involves high-throughput experimental and computational pipelines for assembling large-scale maps of dynamic cellular networks. Dr. Mitra received her PhD in Genetics from the Albert Einstein College of Medicine, NY in 2007. Her graduate work explored chromosomal genetics and stem cell therapies for application against human cancers.
Dr. Anne-Ruxandra Carvunis is a postdoctoral scholar at the University of California, San Diego, where she conducts research in systems biology and evolutionary biology under the supervision of Professor Trey Ideker. She received a Bachelor’s degree in Biology, a Magistere title in Biology/Biochemistry, and a Master’s degree in Neuroscience from the Universite Paris VI and the Ecole Normale Superieure de Paris. She also holds a Master’s degree in Interdisciplinary Approaches to Life Sciences from the Universite Paris VII and the Ecole Normale Superieure de Paris. She received a PhD in Bioinformatics in 2011 from the University of Grenoble, France.
Mr. Sanath Kumar Ramesh received his Master’s degree in computer science from University of California San Diego. His research work in the laboratory of Prof. Trey Ideker focused on developing bioinformatics tools for functional analysis of biological networks using heterogeneous data driven models. His current interests involve solving data-storage and other computational challenges faced in high-throughput network analyses as well as in creating network visualization platforms.