The goal of this work was to create and analyze an ensemble database that represents the superposition of machine-readable knowledge on the topologies of inflammatory networks in humans as a prelude to more detailed network analysis and mathematical modeling. Different subsets of on-line resources can be queried for gene names and interactions and we show that it is possible to combine this data into a single SIF-compliant protein interaction network rich in information and amenable to Cytoscape import. Interactions vary in type with some directed and signed and others undirected and unsigned, depending on the source of data. The average number of interactions per node in the ensemble network is high (degree ~27) and it displays a power law degree relationship. The ensemble network also exhibits evidence of a bow-tie structure in which a multiplicity of pathway-specific receptors feed into a smaller set of highly interconnected intracellular kinases and signalling molecules which then output onto a larger collection of pathway-specific transcription factors and effectors. Overall, genes from 128 pathways are present in the final network but the majority of genes (> 50%) are pathway-specific with fewer than 0.1% mapped onto 40+ pathways. The set of highly represented genes includes many of the cytosolic kinases lying in the middle of the bow-tie structure (PI3K/AKT, MAPK/ERK, JAK/STAT, NFκB).
A striking feature of the databases from which the ensemble was assembled is that they are highly inconsistent with respect to the number of nodes and the number and identities of the interactions for a given node. For example, more than 80% of the interactions in the ensemble were specific to one database and fewer than 0.1% appeared in six or more databases (the remaining exhibited a power-law relationship to frequency). We find the root of this inconsistency to lie in the wide discrepancy in pathway annotations between databases. Even when we focused on highly studied pathways activated by EGF, TGF-β, TNF-α, and WNT ligands, we observed remarkably poor agreement (consistently less than 10%) regarding the constituents. What constitutes a “canonical” pathway therefore appears to be database (or even expert) specific. Both at the biochemical and phenotypic level, exogenous stimuli are known to exhibit profound cell- and context-specific effects [59
]. Discrepancies in pathway annotations between databases may be reflective of this [22
], but it is currently impossible to determine whether the primary problem is real biological variation, the absence of suitable controlled vocabularies or another technical problem.
Given extensive discussion about the “modularity” of biological networks [60
] we asked whether the ensemble graph or the databases from which it was assembled show evidence of modularity. The simplest way to define a module is as a set of genes for which interactions among genes within the set is more frequent than interactions with genes outside of the set. Under these circumstances we observed that only 5% of edges in the ensemble network constituted intra-pathway interactions and the vast majority of interactions therefore crossed pathways (potentially representing sources of “cross-talk” and consistent with data arising from high-throughput interaction screens for components of MAPK, TGF-β, and TNF-α pathways [4
]. Four obvious and non-exclusive explanations for this data suggest themselves: (i) biological pathways represented in the ensemble database are not modular in any meaningful sense and instead comprise closely connected networks (ii) we cannot easily identify modularity in large networks through pathway annotation because the definition of these pathways is highly subjective and variable from one database to the next (iii) the ensemble network contains many interactions that do not exist in reality (iv) modularity can only be understood with respect to specific temporally-restricted biological functions. The later possibility is the most interesting: while it is true that the MAP kinase cascade can be considered to be a component of a relatively well-defined enzymatic pathway that transduces signals from growth factor receptors to the cell nucleus, the organization of this cascade changes over time as receptors adapt and negative regulatory pathways are activated. Moreover, in cells exposed to a different growth factor, activation of the MAP kinase cascade can have very different biological consequences.
While the degree of Modularity among the 128 pathways annotated in the ensemble network may be low, it is statistically significant compared to what is expected by chance. This may constitute a form of “fuzzy” modularity, wherein diffuse and overlapping modules are integrated with one another and the broader cellular network, perhaps conferring flexibility in adapting to complex environmental perturbations [45
]. Moreover, we observe a wide range in our statistical metric of modularity between pathway databases (ranging from P
for GeneGo, to an average of P
0.5 for PANTHER) reflecting the widely different curation standards. Results emerging from cancer genome sequencing projects validate the concept of pathways as functional modules. For tissue-specific and even clinically homogeneous cancer subtypes, thousands of diverse mutations have been catalogued. However, the majority of mutations can be mapped onto a limited number of canonical signal transduction pathways (TP53/RB1, PI3K/AKT, Wnt, Hedgehog, and TGF-β). Moreover, mutations within the same pathway are often functionally equivalent (exclusive), and specific combinatorial patterns of pathway activation/deactivation are required to induce transformation [61
]. Viewing pathways as functional modules is thus a useful concept for integrating diverse molecular data and reducing biological complexity to simpler principles. The trick will be to learn to identify these modules in interaction graphs, perhaps by implementing automated network module detection algorithms [42
], and comparing how such a priori
defined modules overlap with annotated pathways.
Can pathway and interactome databases be used as tools for modeling functional experiments in specific cell types? Currently pathway databases are employed largely to generate static network maps for topological analysis and, with high-throughput genomic, data to assist in the identification of meaningful co-variation [64
]. Increasingly, however, it is becoming recognized that computable models are crucial for the quantitative analyses of biological systems. The utility of computable models arises from their ability to making predictions that can be tested experimentally. A reasonable approach to building computable input–output models would involve assembling a comprehensive scaffold of molecular interactions, converting the scaffold into one or more models and then comparing the models to various types of experimental data [10
]. Qualitative formalisms such as Boolean logic appear to be effective in this role [9
]. Moreover, by focusing on relatively restricted portions of interactions networks, it should also be possible to inform kinetic models of mass-action biochemistry [50
]. In both cases, it is necessary to start with complete topologies [66
] and both errors and omissions have profound implications for experimental design, data analysis, and model development.
Using four exemplary signalling systems (EGF, TGFB, TNF, and WNT), we show that downstream signalling kinases are connected to extracellular ligands via hundreds of alternative topologies, many of which are biochemically implausible in that they do not involve transmembrane receptors or the known topology of MAP kinase cascades. We refer to these as “bypass” edges. A number of algorithms are available for reducing such network redundancies and idiosyncrasies using topology alone [56
] or using experimental data [67
]. These may represent a tractable way to initiate model topologies in the absence of expert prior knowledge. We illustrate an alternate heuristic approach for utilizing interactome information to building out network complexity from simple linear scaffold. Nonetheless, it is clear that additional research is required in this area.