Lecture 1: Representation of biological systems as networks

Network representation of intracellular biological networks typically considers molecular components within a cell as nodes and their direct or indirect interactions as links (

1) (

Slide 2). Different types of intracellular molecular biological networks can be represented by different types of mathematical structures called graphs (

2) (

Slides 3 and 4). Some examples are (i) metabolic networks, for instance, the EcoCyc database provides access to metabolic networks across many organisms (

3) (

Slide 5); (ii) cell signaling networks (

Slides 6 to 8), such as one representing pathways in mammalian neurons (

4); (iii) kinase-substrate networks, such as the one we constructed for developing the tool kinase enrichment analysis (KEA) (

5) (

Slide 9); (iv) gene regulatory networks (

Slides 10 and 11), such as the one described in the RegulonDB database for

*Escherichia coli* (

6); (v) protein-protein interaction networks (

Slides 13 to 15), for instance, the human protein reference database (HPRD), which contains a large manually extracted data set describing mammalian protein-protein interactions (

7); (vi) epistasis interaction networks, which are networks created by connecting genes if they exhibit a genetic interaction when knocked out or down-regulated (

8) (

Slides 16 and 17); (vii) disease gene interaction networks, which connect diseases to genes that when mutated cause or contribute to the disease (

9) (

Slide 20); and (viii) drug interaction networks in which drugs are linked to their targets (

10,

11) (

Slide 21).

There are many different types of graphs, including directed or undirected graphs (

Slide 4), directed acyclic graphs (DAGs), trees, forests, minimum spanning trees, Boolean networks (

Slide 8), and Steiner trees (

12,

13). For in-depth coverage of these data structures and the algorithms used to traverse and manipulate such structures, the classic computer science text book by Cormen

*et al*. is recommended (

14). Network representation enables integration of data from many different studies into a single framework. For example, Tanay

*et al*. (

15) showed how many different types of experimental data can be integrated using genes as anchors (

Slide 22). Networks can be generated directly from time-series data or from perturbation data (

Slides 18 and 19). The topology of regulatory networks can be “reverse engineered” directly from data tables of the changing quantities of mRNA expression or protein abundance over time or under different perturbations using Bayesian networks (

16,

17) derived from advanced statistical learning techniques, or using tools, such as ARACNE, that use the concept of mutual information from information theory initially developed by Claude Shannon (

18).

From this lecture, students should have an understanding that there are different type of representations for biological networks, that networks can be based on integration of multiple sources of published information or reconstructed directly from the data, that some biological networks can be connected to diseases or drugs, and that different data sets can be integrated through the abstraction to a network representation (

Slide 25).

Lecture 2: Milestones and key concepts in network analysis

The field of graph theory, a subfield in mathematics, was first introduced by Euler, who used this approach to prove the famous Seven Bridges of Konigsberg problem in 1736 (

Slide 27). Major breakthroughs in the field of graph theory are attributed to Paul Erdos and Alfred Renyi, who studied the properties of random graphs (

Slide 28). They simply asked: What are the consequences of throwing on a table a bunch of buttons and then randomly connecting them with links (

Slide 28)? Erdos and Renyi are also famous for proving the Map Coloring problem, which answers the question, what is the minimum number of colors needed to distinctly color states on a map.

The seminal publications by Watts and Strogatz in 1998 (

23) (

Slides 29 to 31) and Barabási and Albert in 1999 (

24) (

Slides 32 to 34) popularized the notion that complex systems can be viewed as networks wherein components within a complex system can be represented as nodes and are linked through their interactions, which are called edges. This approach has been applied in many scientific disciplines, including systems biology and cell signaling research. Representing the complexity of biological regulatory systems as networks enables analysis of the network’s topology, which provides insight into the organizational principles of the cell, achieved through evolution. Network topology includes information about the general and specific properties of nodes, properties of edges, properties of the entire network (global topological properties), and modules within the network.

Properties of nodes include connectivity degree, which is the number of links for each node; node betweenness centrality (

25), which is the number of shortest paths that go through a node among all shortest paths between all possible pairs of nodes; closeness centrality, which is the average shortest path from one node to all other nodes; eigenvector centrality (

26), which is a more sophisticated centrality measure that assesses the closeness to highly connected nodes; and bioinformatic analyses of the molecules represented by the nodes, for example, their Gene Ontology annotations (

27), which describe the nodes’ function, location in the cell, and involvement biological processes. Properties of edges include edge betweenness centrality (

28), which is the number of number of shortest paths that go through an edge among all possible shortest paths between all the pairs of nodes; the types of relationship, for example, edges may represent activating or inhibiting relationships between a pair of nodes; and edge directionality, which indicates the upstream and downstream nodes connected by a particular link. Analyses based on the type of interactions, for example, phosphorylation, binding, gene regulation, or other types as specified in the

*Science Signaling* Connections Maps in the Database of Cell Signaling (

29), or those proposed by the Edge Ontology (

30) can be used to further characterize the network’s topology. Global topological characteristics of networks include connectivity distribution (

24), which is represented by a histogram showing how many nodes have a certain amount of links; characteristic path length (

23), which can be computed using Floyd-Warshall’s or Dijkstra’s algorithms and represents the average shortest path between all pairs of nodes; clustering coefficient (

23), which represents the local density of interactions by measuring the connectivity of neighbors for each node averaged over the entire network; grid coefficient, which extends the clustering coefficient from only looking at first neighbors to also examining second neighbors; network diameter, which represents the longest shortest path; and assortativity, which assesses whether nodes prefer to attach to other nodes on the basis of common nodal properties (

31). Networks can also have recurring circuits of a few nodes and their edges, which are called network motifs. Network motifs are recurring circuits composed of a few nodes and their edges (

Slide 38), and these types of circuits appear in the topology of biological regulatory networks much more frequently than they are observed in random or shuffled networks (

4,

32,

33). Some motifs are particularly important because they directly influence a system’s overall dynamics; examples include feedback loops (

34), feedforward loops (

35), bifans (

36,

37), and other types of cycles (

38). Network motifs can be identified within directed or oriented graphs or in undirected networks, which are called graphlets (

39) (

Slide 39), and a biological example of graphlets includes motifs present in protein-protein interaction networks.

Another characteristic of networks is their modularity, which represents the modules, or network clusters, which are dense areas of connectivity separated by regions of low connectivity. Modules can be found using unsupervised clustering algorithms, such as nearest neighbors clustering, Markov clustering, and betweenness centrality–based clustering, which uses nodes with high betweenness centrality and low connectivity to separate clusters (

25).

Understanding the structural organization of biological networks using these and other topology analysis measures gives clues to the evolutionary processes that may produce the observed topology of biological regulatory networks. Network evolution algorithms attempt to recreate realistic topologies on the basis of a few simple rules governing network growth. Network evolution models can help to intuitively understand the design principles embedded within complex biological regulatory networks. Some of these algorithms include rich-get-richer (

24) (

Slide 34), growth by duplication-divergence (

40) (

Slide 36), exponential growth (

24), and geometric growth (

41) (

Slide 37). Other models start with a random network, or a very regular structured network where the network gradually evolves to a realistic topology based on rules that reposition links (

23,

42) (

Slides 31 and 47).

To determine whether the topological properties observed in real networks deviate from random connectivity and are not observed by chance, topological observations can be compared to Erdos-Renyi random networks (

43) (

Slide 28) or other types of shuffled networks (

44), which serve as statistical controls.

By applying various topological analyses, general properties of biological regulatory networks emerge. Most biological molecular regulatory networks are scale-free (

Slide 32), meaning that the connectivity distribution, the distribution of edges per node, fits a power law (

24). The scale-free architecture makes the networks robust to random failures (

45). Biological networks are also small-world, meaning that they are highly clustered, with shortcuts that connect the clusters (

23) (

Slide 29). Biological networks have “party” hubs and “date” hubs (

46): Party hubs are nodes that interact with many proteins at one cellular compartment at a specific time, whereas date hubs are proteins that can be found in many places inside the cell and interact with various partners at different times. Hubs are also categorized as multisite or single-site (

47) (

Slide 40). Some cell signaling networks have a bow-like structure in which many receptors (one loop of the bow) share few downstream adaptor proteins (the knot in the bow) (

48), and these central adaptor nodes integrate information from the many receptors and then distribute the information to many effectors in multiple cell signaling pathways (the other loop of the bow) (

Slide 41). The topology of regulatory biochemical networks contains fewer feedback loops than observed in random Erdos-Renyi networks. In addition, the feedback loops tend to be nested (

38). In the topology of gene and cell signaling networks, bifan motifs are the most abundant (

32,

36), probably due to evolution through duplication and divergence (

40) (

Slide 43).

By examining the structure of nine directed networks from the real world (six biological and three human-made complex systems (

38)), we indentified, counted, and characterized the cycles (the closed loop pathways) in these networks. Most cycles in real networks are made of source and sink nodes, which deplete feedback loops (

Slide 44). Such topology contributes to optimal design for dynamical stability (

Slide 45). We also developed a network evolution model that starts with a random directed Erdos-Renyi–type network (

42). The network evolution process reassigns links if they contribute to a local increase of sources and sinks. After several evolutionary steps, the random network becomes dynamically stable and displays a power-law, scale-free, connectivity distribution. By adding a simple rule, the evolution can continue forever, maintaining the network scale-free structure, as well as displaying dynamics that are at the cusp between stability and chaos. In addition, the hubs, the highly connected nodes in the network, rise and fall in their connectivity while the global properties of the network remain (

Slide 46). Hence, the simple theoretical network model captures dynamical evolutionary features observed in real complex systems.

From this lecture, students should gain a basic understanding of the history of graph theory and how graph theory was used to characterize the topological properties of biological networks. Biological networks have distinct topologies, and this influences the system’s dynamical behavior (

Slide 48).

Lecture 3: Making predictions using network analysis

The networks described in the first lecture can be used to analyze new experimental data in the context of known biology. Different algorithms can be applied to predict or find “connections” between lists of “seed” nodes using prior knowledge of molecular interactions. For example, genes or proteins that are differentially regulated under control versus treatment conditions could be used as seed nodes for building functional networks using information from prior publications (

49,

50). The algorithms that can be used to find connections between genes, proteins, or both, using information from background networks to form subnetworks, include mean-first-pass-time (MFPT) (

51), nearest neighbor expansion, Steiner trees (

12,

13,

52), and the shortest path search algorithm (

53). We used the shortest path approach to predict key regulators of neurite outgrowth in Neuro2A cells stimulated with a cannabinoid agonist (

50) (

Slides 53 to 61) and to identify a previously unknown Noonan syndrome gene (

54) (

Slides 62 and 63). We also applied the Steiner tree approach to minimally connect up-regulated genes in breast cancer, as well as a Steiner tree to connect up-regulated genes in colorectal cancer (

13) (

Slide 64). The Steiner tree approach was also used by Huang and Fraenkel to connect signaling pathways to transcription factors (

52) (

Slide 65), and others have used a similar approach to connect commonly expressed genes in stem cells (

55) (

Slide 66). Such methods are useful for finding functional connections between nodes and predicting additional nodes that had not been detected experimentally. Similarly, network analyses that utilize background knowledge interaction networks can be used to predict interactions. For example, protein-protein interactions can be predicted by completing circuits that appear to have missing links in clusters within the network structure (

56) or by combining clusters in the network structure with function of nodes in clusters (

57) (

Slides 51 and 52). Not only can networks or portions of networks be predicted, but information about the molecules within the network can be inferred or predicted from various properties of the network and its constituents. For example, protein function can be predicted on the basis of known protein-protein interactions and the assumption that proteins close in network space are likely to share functions (

58) (

Slide 50).

Making predictions using network analysis and using Gene Set Enrichment Analysis (GSEA) (

59) [see also the Teaching Resource by Clark and Ma’ayan on GSEA that is part of this course (

60)] are conceptually related. Gene sets can be converted to networks, and networks can be converted to gene sets. For example, one can create a network by connecting genes if they are targeted by the same microRNAs, share many Gene Ontology terms, or are found in many pathways together. One can also create gene sets from networks. For instance, we have created a gene set of all genes encoding proteins in the human interactome that interact with at least 50 other proteins, which serves as a resource gene set library for the gene set enrichment analysis tool Lists2Networks (

61). We also created enrichment analysis tools for kinase enrichment analysis (KEA) (

62) (

Slide 67) and for chip-X enrichment analysis (ChEA) (

63) (

Slides 68 and 69) by integrating networks of kinase-substrate interactions for KEA or transcription factor-target interactions from chip-seq and chip-chip studies for ChEA. The resulting directed graphs, representing kinase-substrate interactions and transcription factor-target interactions, respectively, were converted to gene set libraries for the purpose of performing gene set enrichment analysis.

From this lecture, students should have an understanding of the algorithms that can be used to seed genes and proteins within biological networks and how network analysis can drive hypothesis generation (

Slide 70).