In the recent years, development of high-throughput genomic and proteomic techniques resulted in proteome-wide interaction networks (interactomes) in a number of model organisms (Ito et al., 2001
; Uetz et al., 2000
; Giot et al., 2003
; Li et al., 2004
; Stelzl et al., 2005
; Rual et al., 2005
; Ptacek et al., 2005
). Databases such as the BioGRID (Breitkreutz et al., 2008
), IntAct (Kerrien et al., 2007
), DIP (Salwinski et al., 2004
), and MINT (Chatr-Aryamontri et al., 2007
) have been established to collect and curate sets of interactions from different experiments and make them publicly available. Most databases contain physical binding interactions, while the BioGRID additionally includes genetic interactions (such as synthetic lethality) and biochemical interactions, which describe a biochemical effect of one protein upon another.
A protein (or a protein state) is mapped to a node in a cellular protein network. Hence, the solution of a channel mode context (as tensors Φ and
) will highlight an ITM consisting of the proteins most visited by a directed flow from sources to sinks, that is, the proteins lying on the most likely paths connecting sources and sinks. Clearly, biological interpretations of the model results will depend on the nature of interactions ascribed for links within the network graphs: the interpretation for an ITM from a genetic or functional network and that for an ITM from a physical network should be different. Here, we will mainly focus on the physical networks where interactions correspond to binding between two proteins (undirected) or a post-translational modification of one protein by another (directed). Each step of a random walk in such a network is equivalent to a physical event and dissipation naturally corresponds to protein degradation by a protease and negative feedback mechanisms that limit transmission of information. It is thus plausible that the information channels obtained by solving the channel mode with suitable sources and sinks may correspond to (portions of) actual signaling or gene regulation pathways. However, it is important to note that the biological validity of a network path is contingent upon the transitivity of biochemical effect along that path as not all protein-protein interactions have the same downstream effect. Also, even in the best case, the information obtained from a random walk models would be primarily qualitative since cellular processes in general do not correspond to linear models.
The simplest way to use the channel mode is for knowledge retrieval by exploring large networks. In many model organisms, it is possible to construct physical protein interaction networks that integrate proteome-wide data collected from results of multiple experiments from different sources using a variety of techniques. All three modes discussed in this paper, emitting, absorbing and channel, can be used to explore network neighborhoods of proteins of interest and learn more about their function(s). In particular, given two proteins, one set as a source and the other as a sink, one may use the channel mode to extract a sub-network containing only the proteins most relevant to the possible functional relation between them. By using graphical tools to visualize the sub-network and by examining the annotations for the individual proteins within it, one can learn about their role within the cell and hence understand the cellular context of the query proteins.
More complex settings of the channel mode can be used for hypothesis forming and confirmation. For example, using destructive interference between flows from multiple sources may reveal the points of crosstalk between different biological pathways that can be selected for further experimental investigation. Given one or more proteins of interest one can explore the hypothesis about their function by using the property that sinks split the flow. Set these proteins of interest as sources and set several sinks, each associated with an a different biological role. After running a channel mode, the sinks attracting most of the flow would point to the most likely cellular role of the proteins, given all alternatives. Of course, if all alternatives are biologically invalid, no valid functional inference can be made.
Since it is possible to arbitrarily specify sources and sinks and obtain model results that may not correspond to any cellular role, it is desirable to be able to check whether retrieved ITMs can be associated with any existing annotation. A common way to do so is through enrichment analysis (Huang et al., 2009
), which assigns terms from a controlled vocabulary such as Gene Ontology (Ashburner et al., 2000
) or KEGG (Kanehisa et al., 2010
) to a set of genes or proteins with weights. Each term from a controlled vocabulary annotates one or more proteins and enrichment analysis aims to retrieve, by statistical inference, those terms that best describe the set of submitted proteins with weights. While many enrichment tools were developed for analysis of microarrays (Huang et al., 2009
), we found that none of them are entirely suitable for analyzing the results of our model. We have therefore developed a novel tool, called SaddleSum
(Stojmirović and Yu, 2010
), which is based on asymptotic approximation of tail probabilities (Lugannani and Rice, 1980
). For each term, it computes the probability that a score greater than or equal to the sum of weights, for all the proteins associated with that term, can arise by chance. In the context of the channel mode, the quantities that can serve as input to SaddleSum
are source specific content, total content, and destructive interference.
Example: yeast pheromone pathway
As an illustration, we will consider the mating pheromone response pathway in Saccharomyces
, the one of the best understood signalling pathways in eukaryotes (Bardwell, 2005
). The mating signal is transferred from a membrane receptor to a transcription factor in nucleus, leading to transcription of mating genes. We will only provide a very brief overview of the pathway necessary for discussing our examples; more details are available in the review by Bardwell (2005
A mating pheromone binds the transmembrane G-protein coupled pheromone receptors Ste2p/Ste3p. This leads to dissociation of Ste4p and Ste18p, the membrane bound subunits of the G-protein complex, which also contains the subunit Gpa1p. Ste4p then binds to the protein kinase Ste20p, which is recruited to the membrane through Cdc42p, and the scaffold protein Ste5p. On the scaffold, a MAPK (mitogen activated protein kinase) cascade occurs, where each protein kinase in the cascade is activated by being phosphorylated by the previous kinase and in turn activates the next protein. In this case, the cascade goes Ste20p
Fus3p or Kss1p. The final activated kinase Fus3p or Kss1p then migrates to the nucleus where it phosphorylates the proteins Dig1p and Dig2p, the repressors of the Ste12p transcription factor activity. The Ste12p transcription factor can then, in coordination with other transcription factors such as Tec1p, promote the transcription of the mating genes.
As a basis for the underlying network, we used all physical yeast protein-protein interactions from the BioGRID-3.0.65 (Breitkreutz et al., 2008
). To improve the fidelity of the network, we removed every interaction reported by a single publication unless that publication described a low-throughput experiment, which we assumed to be more targeted. We considered an experiment low-throughput if it reported fewer than 300 interactions in total. We also removed all interactions labelled with the “Affinity Capture-RNA” experimental system since they were not protein-to-protein. The physical binding interactions were given a weight 1 in both directions while the interactions labeled as “Biochemical Activity” were interpreted as directional (bait
prey) and given a weight of 2. In cases where both physical and biochemical interactions were reported, only biochemical were considered. Since it is known (Steffen et al., 2002
) that proteins with a large number of non-specific interaction partners might overtake the true signaling proteins in the information flow modeling, we excluded a set of 165 nodes corresponding to cytoskeleton proteins, histones and chaperones. We found that the excluded nodes do not strongly affect the results of the particular examples presented here. For each example, we computed the normalized channel tensor summed over all sinks, that is
in our notation.
focuses solely on the MAPK cascade portion of the pheromone pathway, with Ste20p as a single source and Ste12p as a single sink. Selection of top proteins by participation ratio captures all important participants of the cascade but emphasizes a “shortcut” through Slt2p, which is a MAP kinase involved in a different signalling pathway. Upon examination of the reference (Zarzov et al., 1996
) used by the BioGRID to support the Ste20p
Slt2p link, we discovered that it does not anywhere claim existence of such interaction. Hence, we removed Slt2p from our network for all subsequent queries and reran the query. In addition to the true pathway, the second ITM emphasized a path through Nup53p (a nuclear core protein). We examined the publication (Lusk et al., 2007
) indicated by the BioGRID to support the Ste20p
Nup53p link and found that while it is true that Ste20p phosphorylates Nup53p in vitro
, another kinase was mainly responsible for its phosphorylation in vivo
. We therefore felt justified to exclude Nup53p as well. The final ITM resulting from the same query with Slt2p and Nup53p excluded in addition to the 165 proteins mentioned before is shown in . Enrichment analysis using the GO database (, column D) gives “receptor signaling protein serine/threonine kinase activity” as a top term under “Molecular Function” and “filamentous growth” as a top term under “Biological Process.” Hence, the final ITM agrees well with the canonical understanding of the MAPK cascade.
FIG. 3. ITMs for the MAPK cascade part of the yeast pheromone response obtained by running the normalized channel mode with Ste20p as the source and Ste12p as the sink (μ=0.85). In addition to the “standard” excluded (more ...)
FIG. 5. Gene Ontology term enrichment analysis of examples from and using SaddleSum. The most significant GO terms from the Biological Process and Molecular Function categories are shown on the left (number of annotated proteins is in (more ...)
To obtain an ITM best describing the entire pheromone response pathway, it is necessary to include two sources, the receptor Ste2p and the membrane-bound protein Cdc42p (). Including only Ste2p is not sufficient because of the shortcut through the link Gpa1p
Fus3p, which avoids the MAPK cascade. Furthermore, inclusion of Cdc42p is biologically sensible because Cdc42p activates Ste20p (Bardwell, 2005
) and is hence necessary for the MAPK cascade. Since the information flows from Ste2p and Cdc42p to Ste12p share some but definitely not all paths in common (), interference between them (), rather than total visits, is most appropriate to highlight the most important proteins in the signalling pathway.
illustrates the effect of changing the damping factor μ
. With μ
1 (), the flows from sources visit a much larger portion of the network (the average path length
) than with μ
) or μ
). The lower bound on path length is 3, the shortest distance from both sources to Ste12p. In terms of enrichment analysis with GO (, columns A–C), all three cases pick as significant the terms related to cell growth but with different statistical significance. In addition, both the μ
0.85 and μ
1 cases can be associated with terms related to MAP kinase and signal transduction. Hence, results for large μ
tend to give lower GO term E-values but with lower specificity while results for small μ
give very specific results but with less significant E-values. The μ
-dependence of E-values for any given term is not surprising since different μ
s correspond to different null models. Based on the images in , the enrichment results as well as our experience in other model contexts, the values of μ
around 0.85, corresponding to a random walk visiting about four more nodes than the minimum necessary to reach the sink, appear to give the best results in emphasizing biologically relevant channels.