|Home | About | Journals | Submit | Contact Us | Français|
Abstraction of intracellular biomolecular interactions into networks is useful for data integration and graph analysis. Network analysis tools facilitate predictions of novel functions for proteins, prediction of functional interactions and identification of intracellular modules. These efforts are linked with drug and phenotype data to accelerate drug-target and biomarker discovery. This review highlights the currently available varieties of mammalian biomolecular networks, and surveys methods and tools to construct, compare, integrate, visualise and analyse such networks.
The exponential accumulation of molecular-biological intracellular data has the promise to advance biomedical sciences into a stage so as to understand and track most important events that regulate mammalian cells under normal and pathophysiological conditions. This would permit the development of a new generation of personalised therapeutics  as well as open doors in synthetic biology . Physicists, mathematicians and engineers are increasingly engaging in systems biology. In this trend, experts bring tools from different disciplines to model intracellular complexity. Modelling efforts can be divided into three categories: network inference, dynamical modelling and graph analysis . In contrast to graph analysis, network inference and dynamical modelling need quantitative details and/or large data sets to build models . The main challenge with network inference and dynamical modelling methods is that many model realisations can fit the same data. Hence, the question commonly asked is: ‘how do we really know whether the model represents the real system under investigation since there could be many alternative models that can fit the same data?’ Much of current available and rapidly accumulating experimental data attempting to capture intracellular regulation is qualitative, noisy, inaccurate and incomplete. Hence, validating models from an assemblage of possible models is difficult. Alternatively and complementarily, network integration and graph analysis highlighted in this review represent a practical alternative to network inference and dynamical simulations. Some of the challenges within this research domain are: ‘how to project lists of genes or proteins identified in multivariate experiments onto large-scale known intracellular interaction networks?’, ‘how to integrate different networks so they can be used as background knowledge to fill in missing gaps not captured experimentally?’ and ‘how to develop heuristics to overcome the NP-hardness of the graph search problem?’ Effective data integration with filtering, graph querying and visualisation tools are key components for success in this subfield of systems biology .
The abstract representation of interactions within a cell to networks (formally graphs) is becoming a conventional approach to deal with the large volume of data collected through emerging high-throughput technologies [6–8], and low-throughput studies reported in the research literature integrated and abstracted into networks. Different biological networks can be represented by different types of graphs (Fig. 1) . For example, protein–protein interaction networks can be represented as undirected graphs where nodes are proteins and edges represent direct physical interactions. Gene-regulatory networks can be abstracted to directed graphs where nodes are genes encoding transcription factors (or other types of proteins) and links represent transcriptional regulation. Metabolic networks can be represented as bipartite graphs where nodes are separated into two sets: enzymes and substrates . Although different graphs are used for different networks, the abstraction to networks helps with data integration . For example, Tanay et al.  used a bipartite graph to integrate different ‘omics’ data by using yeast genes as anchors. Most efforts in reconstructing in-silico regulatory networks are for model organisms, but it is recognised that mammalian cellular networks are critically needed to facilitate biomedical breakthroughs. This review will focus mostly on data sets and tools that deal with mammalian biomolecular intracellular networks.
Several repositories collect, store and share experimentally determined mammalian protein–protein interactions. Such networks, stored as undirected graphs , include, for example: HPRD [14–16], MINT [17, 18], IntAct [19, 20], Reactome , DIP [22, 23] and BioGrid . Other similar databases infer mammalian protein–protein interactions using orthologs. Here, protein–protein interactions identified in lower organisms are inferred to also exist in mammalian cells. Such efforts include OPHID , HPID , IntNetDB , STRING  and POINT . Data warehouses consolidate different protein–protein databases by merging networks stored in different formats. These include, for example, UniHI [30, 31] which integrates both experimentally and computationally determined interactions; and cPath  or BioNetBuilder  which are Cytoscape  plug-ins that integrate interaction data stored in PSI-MI format . Systems such as Atlas , BioWarehouse , BIOZON [38, 39], INTEGRATOR  and the Gaggle  provide integration and querying capabilities with other types of biological data in addition to protein–protein interactions, metabolic, gene-regulation and cell-signalling networks. Several studies compared different mammalian protein–protein interactions databases to assess their overlap and coverage [31, 42].
Early efforts in reconstructing in silico protein–protein interaction networks did not consider that most interactions are mediated through structural domains. The fields of structure biology and network biology are coming closer as more attention is given to structural elements of protein– protein interaction networks . Once structural domains are considered, the domains network underneath the protein–protein interaction network can be delineated . Complementarily, databases and algorithms have been developed to predict protein–protein interactions based on structure and sequence . Additionally, databases of experimentally determined protein–protein interactions specific for a structural domain also exist . Careful analysis of protein–protein and domain-domain interactions in yeast showed that hubs can be divided into two distinct groups: single- and multi-binding hubs . Considering structure with protein–protein interactions can be used to reconstruct macro-molecular complexes. For instance, Takamori et al.  reconstructed synaptic vesicles after carefully measuring the size of most components. Besides synaptic vesicles, the molecular organisation of macro-molecular complexes is still mostly unknown. A first step towards assembling this 3D puzzle is identifying protein–protein interactions, their localisation and the functional relationship among the components .
In contrast to protein–protein interaction networks, cell-signalling representation as a network captures functional relationships. Signalling networks are commonly represented as directed graphs with three types of links: activation, inhibition and neutral. Besides proteins, signalling networks also include small molecules such as calcium and cAMP. Some examples are science signalling (previously STKE) , the Cancer Cell Map (http://cancer.cellmap. org/cellmap/), KEGG [50, 51], GOLD.db  and BioCarta (http://biocarta.com). Signalling networks regulate the response of cells to changes in the extracellular environment where signals, received at the cell surface by receptors, transduce information to effector proteins through cascades of coupled biochemical reactions. The most common signalling reaction is phosphorylation. Databases that record phosphorylation sites (predicted computationally or determined experimentally) are critical for tracking information-flow through cell-signalling pathways. Recent efforts made substantial progress in this area [53, 54]. For example, NetworKIN  is a web-based resource providing access to predicted as well as experimentally identified phosphorylation sites and the kinases responsible for the phosphorylations. Cell-signalling networks can be inferred directly from multivariate data using network inference methods such as Bayesian networks [55, 56]. Bayesian networks are used for constructing acyclic graphs based on statistical interdependencies among measured variables . Since Bayesian networks are part of the network inference modelling genre, describing them in detail is out of the scope of this review.
Directed graphs with activation/inhibition links can also be used to represent gene-regulatory networks. Here, genes are translated to proteins that function as transcription factors (or more distal regulators) regulating the expression of other genes. Gene networks function at longer time scales compared with cell-signalling, metabolic or protein–protein interaction networks. Biotechnologies that can experimentally map gene-regulatory networks in high-throughput are rapidly emerging. Network inference approaches are instrumental for building networks from perturbation or time-series data. These methods are typically applied to gene expression microarrays. ChIP-chip  and chIP-seq , comparative genomics (identifying conserved non-coding sequences as potential binding sites), or purely computational approaches that use known consensus DNA binding motifs [60, 61] can also be used to reconstruct in silico gene-regulatory networks. Several tools are developed to integrate such data to allow novice users easy access to identify gene-regulatory interactions For example, MYBS , YEASTRACT , SGD [64, 65], SCPD , TRANSFAC , MAPPER , TRANSCompel , TRRD [70, 71] and SWISS-REGULON  are web-based tools providing a user interface for an underlying gene-regulatory network database.
Metabolic networks are in general more complete and rich in quantitative information as compared with protein–protein, cell-signalling and gene-regulatory networks. BioCyc  and its subset database MetaCyc [74-76] are comprehensive resources for metabolic networks in many organisms. The Escherichia coli metabolic network is the most complete metabolic network from an experimentally, in silico predicted, and computationally analysed perspectives. It was extensively mapped by Palsson and co-workers [77–79]. A metabolic network for yeast was later similarly reconstructed by Förster et al.  and was compared with the E. coli metabolic network.
Besides protein–protein, cell-signalling, gene-regulation and metabolic networks there is a growing appreciation for non-canonical metabolites, non-protein biomolecules and non-conventional post-translational modifications that function in intracellular regulation. One example is miRNA networks. miRNAs are short (~22 nucleotide) transcripts that pair with (full-length) mRNAs of transcribed and translated genes and thereby suppressing their translation into proteins . Since these transcripts have known sequence, it is computationally simple to identify the network of interactions between miRNAs and the expressed genome. Shalgi et al.  developed and analysed a network of transcription factors and miRNAs. Cui et al.  used a large-scale cell-signalling network extracted manually from research literature  to assess how endogenous miRNAs target and regulate components in the cell-signalling system.
A large amount of knowledge about functional regulatory interactions and the components involved in these interactions is embedded in the biomedical research literature from the past 30–40 years . Text mining is used to extract interactions using natural language processing (NLP) and information retrieval technologies . The first step in this process consists of extracting biological terms [87–89]. Protein and gene names and other biological entities are organised into dictionaries . It is important to resolve ambiguities in entity naming, for example, resolving synonymous names for proteins and genes [91–95]. Unique terms in biomedical text often represent a biological entity , whereas co-occurrence is often used to resolve ambiguity in names . A related effort is to automatically assign gene ontology (GO) annotation for biological terms [97, 98]. Taggers, such as ABNER , can be used to highlight different entities in biomedical text, and systems such as AliBaba  make use of tagged text to build networks from key terms in abstracts. iHOP [101, 102] tags biomedical text with the ability to navigate on the web from one highlighted term to another by clicking on hyperlinked terms. The most sophisticated systems, for example, GeneWays  and PathwayStudio [104, 105], use NLP to extract interactions.
Automatic literature search can be combined with data analysis of microarray gene expression profiling by identifying literature-based relationships between co-regulated genes . Text mining of disease phenotypes can be automatically linked to protein–protein interaction networks in order to identify enrichment of human phenotypes that correlate with disease genes . OMIM's morbid map is commonly used as a resource for text mining tools that attempt to integrate diseases with biomolecular networks. OMIM is an NCBI resource that mines relationships between genes and disease phenotypes . Text mining cannot be completely covered in this review. For additional information on this topic, readers may find the review by Krallinger and Valencia  as a helpful start.
Efforts of building biomolecular networks entail the challenges of visualisation and interoperability. There is a rapid emergence of desktop and web-based applications for pathway and network visualisation [110, 111]. For example, consider the systems VisANT [112, 113], PATIKAweb , Cytoscape , CellDesigner  and AVIS, a light-weight viewer that uses the Google Gadget API to automatically visualise cell-signalling pathways , for network visualisation. Visualisation tools support different network storage formats. For example, PIMWalker  supports visualising data stored in PSI-MI [35, 118]; The systems biology markup language (SBML) was extended to provide information needed for network visualisation. Standard storage schemas such as SBML are important for interoperability.
Interoperability efforts attempt to develop standards for data sharing and exchange between isolated data sets and analysis tools. Many schemas are used to represent biomolecular intracellular networks. These include SBML , CellML , BioPAX , KGML  and PSI-MI . All these formats use XML  which provides a flexible way to store data in a structured format with semantics about the data captured within the storage schema. Each storage schema listed above is geared towards handling different types of biomolecular networks . For example, PSI-MI is mostly useful for describing details about experiments, SBML is useful for directly exporting networks into quantitative modelling tools such as the SBMLToolbox  or others . BioPAX does not require quantitative information. BioPAX is useful for network visualisation as well as data exchange. Some databases and their tools develop their own XML schemas. These include, for example, VisML developed for VisANT  and KGML  developed for KEGG . Although, many standards exist, attention is still given to improving them as well as standardising and expanding existing standards .
It is realised that interoperability efforts will greatly benefit from the development of ontologies. Ontologies are a set of terms that describe entities with encoded conceptual relationships between entities assembled and organised for specific knowledge domains. The gene ontology (GO) consortium attempts to provide controlled vocabulary and hierarchical relations for knowledge representation of function, cellular component and involvement in biological process for categorising genes and proteins [127–129]. GO is a part of a greater effort towards developing open biomedical ontologies  for a variety of biomedical and biological domains. GO has been useful for organising and extracting functional relationships between groups of genes. Genes and proteins identified experimentally are classified based on their common annotated functions. GO is now integrated with most leading gene and protein databases [131, 132].
The annotation of genes and proteins into GO terms was conducive to the development of many GO analysis tools. The general theme of these tools is the effort of identifying common functions for group of genes. GOLEM , GOlorize , DyGO , GObar , WEGO  and GoSurfer  are tools for visually exploring and analysing groups of genes using GO. BiNGO , GOToolBox , GOstat  and GOTermFinder  can be used to identify over-representation of GO terms in groups of genes. For example, the most popular approach is to apply GO analysis to groups of genes that are either up-regulated or down-regulated in microarray experiments [143–146]. Tools such as DAVID  and PathExpress  go a step further and provide linkage to pathways using the KEGG database. The Blast2GO tool is an example of linking sequence with GO annotations .
Developing ontologies for genes was done first. Efforts are now shifting to developing ontologies for the relations between genes. GO is being extended to include interaction/pathway ontology. Interaction-ontologies are less developed but are needed. Lu et al.  proposed an ontology for classifying interactions. Science signalling database developed a database named CMADES which contains controlled hierarchical vocabularies that cover most types of reactions and their functional effects. This information can be readily converted into a standard ontology. BioPAX  is a leading interaction-exchange standard in the field that uses ontologies. Once ontological relations and ontological events for links are specified, these can become a set of logical models and potentially dynamical models. For example, the INOH pathway database (http://www.inoh.org/ontology-viewer/) attempts to convert static pathways into dynamical event systems through the use of event ontologies. Systems such as HyBrow (hypotheses browser) use ontologies and experimental data to build dynamical testable hypotheses that can be used to design experiments . Although such systems are currently at a prototype level, it is expected that they will improve in the future. BioSigNet  and PathwayLogic (http://www.csl.sri.com/projects/pathwaylogic/) are two other examples of developing a logical languages to describe relationships between cellular components in the context of regulatory networks. Alternatively, unified modelling language (UML) is a method from software engineering that is well-accepted for developing complex software systems. To handle the inherent complexity of large-scale software systems, UML unify eight different design views of a system before the software is implemented . This approach has been suggested as a potential language for representing and modelling biochemical networks . The above-mentioned efforts link network integration and graph analysis towards dynamical modelling. The advantage of these approaches is their ability to embed complex reasoning through knowledge representation in a standard way to describe biomolecular regulatory networks in standard formulation.
Many proteins and genes do not have GO annotation. This means that these genes were identified experimentally but do not yet have an assigned function. Most computational approaches to predict function for genes use sequence similarity; but these efforts are gradually augmented by network topology-based approaches. For an excellent review on this endeavour, see . Predicting function for genes with unassigned function using protein–protein interaction networks is based on the observation that proteins that are known to interact, often share GO terms. The network-based prediction of protein function can be categorised into two groups: direct or module assisted . The direct method explores the protein–protein interaction neighbourhood around the uncategorised gene to assess the functional category most prevalent in the node's immediate neighbourhood. If a certain functional category is highly over-represented, it is used as a prediction for the function of the gene with the unassigned function. There are several algorithms that can be used for this purpose, for example, Markov random fields (MRFs). MRFs are used to identify neighbourhoods around a gene by applying a Markov random walk. A random walker, starting from the gene with unassigned function, is travelling randomly on edges and nodes from the protein–protein interaction network to visit nearby nodes with already assigned function [156, 157]. A simpler approach is to look at the enrichment of GO terms in the first-level neighbours . Chua et al.  showed that looking at different combinations of sets of first and second neighbours can significantly improve the functional prediction quality.
Concerns were raised when it was found that two full-genome high-throughput yeast-2-hybrid screens that attempted to characterise protein–protein interactions in yeast showed little overlap . Protein–protein interaction networks from different sources have been compared and evaluated to identify biases, overlap and sources for false-positives and false-negatives . Comparing networks across species is useful for predicting interactions and function for proteins using orthologs . Several tools and algorithms have been developed for comparing and aligning different networks. These include NetAlign [163, 164], PathBLAST  and others [166, 167]. These tools combine interaction topology and sequence similarity to identify conserved network substructures across organisms, across networks and within the same network.
Network motifs are small circuits of interacting components in directed graphs that are found to be highly overrepresented in real networks compared with counts of the same motifs in shuffled networks created from the original networks. The different possibilities for links among few nodes (i.e. 3–6) define different types of network motifs [168, 169]. Alon and co-workers were the first to introduce the network motif concept for analysing the topology of intracellular gene-regulatory networks. They analysed the gene-regulatory networks of Saccharomyces cerevisiae and E. coli to identify signature patterns of motifs in those networks. Przulj et al.  used a similar approach to analyse protein–protein interaction networks. Instead of counting motifs, Przulj and co-workers searched for graphlets. Graphlets are similar to network motifs but are defined for undirected graphs. Network motifs identified in mammalian cell-signalling networks showed some similarities with motif patterns identified in gene-regulatory networks . The bifan motif , a four-node motif connecting two source nodes to two targets nodes, was found to be the most highly abundant motif in most intracellular networks studied so far. This is probably due to the large number of isoforms resulting from the process of duplication–divergence.
The concept of network motifs was found to be useful for making predictions. Albert and Albert  combined motif search algorithms with the SUGGEST machine-learning algorithm to predict interactions. Similarly, Yu et al.  used ‘defective cliques’ to predict interactions. Bu et al.  used network motifs identified in protein–protein networks to predict the function of proteins with unassigned GO functional classification.
Many real-world networks [175, 176], including intracellular networks, were shown to be organised in modules [177–180]. These modules can be identified with network clustering algorithms such as betweenness centrality clustering [175, 178, 181–184]. Betweenness centrality is computed for each node or link by counting the number of times shortest paths go through the node or link . This method allows identification of clusters by finding the nodes and links that connect clusters. Such nodes and links have relatively low connectivity but many shortest paths go through them. Simpler methods for network clustering use the shortest path length or the number of shared neighbours as the distance measure needed for finding clusters . For example, Rives and Galitski  defined the distance between all pairs of nodes as a transformed shortest path 1/d2, where d is the shortest path distance. The reason for not using d directly is to emphasise shorter distances. Many more complicated methods for finding clusters in networks exist. For example, Frey and Dueck developed a method using message passing . Sen et al.  used eigenmodes of the connectivity matrix and applied it to cluster the yeast protein–protein interaction network. Another approach is to compare real networks to randomly wired networks . With these methods, deviation from random connectivity towards modular structure can be identified.
Once clusters have been identified, their strength can be quantified. Radicchi et al.  defined a strong community structure if nodes in the community have more connections within the community than with other nodes in the network. This measure was inspired by the analysis of web communities . After modules have been identified with only considering the topology, the modules can be validated by observing if the components in the module share similar GO terms [178, 189]. Instead of using GO just for validation, Lubovac et al.  combined GO terms and network connectivity for module identification. Several software systems can be used to assist non-specialists to identify modules in networks. For example, MoNet  is a Java implementation of the Girvan– Newman  betweenness algorithm. MCODE  is a tool that uses the concept of clustering coefficient to identify network clusters . MCL  uses a Markov clustering algorithm. As different network clustering algorithms are developed, benchmarks to evaluate their performance are important for their evaluation .
Using background knowledge encoded into biological intracellular networks such as metabolic, protein–protein interactions, gene-regulation or signalling pathways, it is possible to expand interactions around the neighbourhood of seed lists of genes or proteins, as long as they exist as nodes in the large background network [27, 196]. This concept was applied to analyse metabolic networks . Canonical signalling pathways were enriched using information from a protein–protein interaction network . The same concept was also applied to identify disease genes modules . Asthana et al.  used protein–protein interaction networks from multiple sources to fill in gaps and reconstruct protein complexes. A similar concept was applied to rank seed lists of genes based on their importance. The famous PageRank algorithm was applied to analyse lists of seed genes identified in microarray experiments in context of a background protein–protein interaction network where the genes are ranked based on their ‘importance’. Importance is defined based on gene degree of connectivity and also based on the connectivity of the gene's direct neighbours . Morrison et al.  used a similar method with a background network created using the GO database instead of using interaction networks.
Projecting lists of seed nodes onto a background network can be applied using different algorithms. For example, finding all shortest paths between all pairs of nodes in the seed list , finding the Steiner tree [204, 205], finding the minimum spanning tree, or expanding the neighbours around seed nodes using a random walker. Scott et al.  implemented Steiner trees to connect seed lists of genes shown to be altered in microarray experiments using a background protein–protein interaction network. Steiner trees, commonly used to design telecommunication networks, are minimal spanning trees where intermediate nodes can be used to connect terminal (seed) nodes.
Genome-wide association studies are used to identify mutations in human genes that can be linked to disease propensity [206–208]. By comparing single-nucleotide polymorphisms (SNPs) from healthy individuals originated from the same demographic background, with individuals with a known common disease, SNPs in the disease group can be identified. Lists of genes with SNPs for a particular disease are rapidly emerging. For example, The Wellcome Trust Control Consortium analysed seven groups of 2000 patients having common diseases including: bipolar disorder, coronary artery disease, Crohn's disease, rheumatoid arthritis and type 1 and 2 diabetes . They found 24 genes with mutations that correlate with disease susceptibility. The cooperation between groups that engage in genome-wide association studies is critical for identifying SNPs because of the high cost of sequencing and the statistical power that requires large samples. The International HapMap consortium is a large-scale collaboration project that records and shares, through a central database, data on SNPs from four different demographically homogeneous populations [207, 208]. Databases reporting genes with their disease associations are emerging . Once disease genes are identified, disease gene lists can be projected onto biomolecular networks to identify modules that are perturbed in disease. In other words, a list of genes associated with a disease could serve as seed for graph analysis. This approach can be used to identify additional disease genes and potentially novel drug targets .
The rapid identification of disease genes allows for the construction of a human–disease/human–disease-gene bipartite network . This network is useful for identifying global relationship between different diseases. This type of network analysis would lead to the identification of functional modules of interacting disease genes and be used to predict additional disease gene candidates [199, 210, 212]. At the same time, networks of drugs and drug targets can be developed [213, 214]. Combining and analysing such networks is a valuable initial step towards finding novel ways to reuse approved drugs and to better understand side effects.
Network integration involves data mining and data standards at a bottom layer which addresses the need for interoperability. Graph analysis tools utilise the fused data sets at a top layer. Network integration and graph analysis of intracellular networks is only one effort in a broader biomedical science revolution that involves breakthroughs in genomics, structural biology and imaging  (Fig. 2). Although network integration and graph analysis deal with qualitative data, and as such, the analysis and the representation miss many important aspects of cellular regulation, this approach provides means to handle diverse and massive data sets more easily, and can produce useful predictions that can be validated experimentally . Network representation of intracellular complex systems can be integrated with networks of drugs, disease phenotypes and side effects. Identifying additional novel components that participate in pathways in mammalian cells can be fruitful for understanding disease mechanisms, for identifying new biomarkers and for discovering novel drug targets.
This research was supported by NIH Grant No. 1P50GM071558-01A27398 and start-up fund from Mount Sinai School of Medicine to A.M. A.M. would like to thank the anonymous reviewers for their very helpful comments and suggestions.