We started with sets of HOG pathway sources and targets and an undirected, weighted PPI network for S. cerevisiae
from STRING composed of only physical binding edges (). We oriented the network 
and used the three source-target-based algorithms (Greedy, Betweenness, Direct-ST) and two global algorithms (Jaccard, Short-Path) to predict directed edges in this network using the relevant objective functions (Shortcuts
-X-SS). We evaluated each method with respect to its ability to: 1) reduce the objective function cost; 2) predict edges that lie within the STRING potential edges; and 3) predict edges that lie within the STRING potential edges that also connect known HOG-related nodes. For the Greedy method, we also performed cross-validation experiments.
The Greedy algorithm drastically reduces source-target distances
Our Greedy algorithm achieves the greatest cost reduction compared to the other four methods over all variants of the pathway-aware edge prediction problems (). Moreover, Greedy substantially decreased source-target distances after adding only a few edges. For example, after adding 3 edges, the Shortcuts
cost (measured as the total shortest-path distance amongst
source-target paths) can be reduced to approximately 60% of the original cost. In contrast, it takes 10 edges for Direct-ST to achieve the same ratio. The Betweenness algorithm does monotonically decrease the cost, however, because edges are added based on greater usage (as opposed to greater explicit cost reduction), its reduction is much slower than Greedy overall. The global methods (Jaccard and Short-Path) do not leverage the sources and targets and therefore are unable to reduce source-target distances at all; in general, there are an enormous number of possible edges that play no putative role in the pathway and it is difficult for these methods to disambiguate these edges from HOG-relevant edges. The tremendous cost reduction seen with the Greedy predictions implies that there are a few missing edges in the network whose addition may cover a large bulk of the information flow in the network.
The cost reduction achieved by the five methods for each objective function.
-SS and Shortcuts
-X-SS, both Greedy and Direct-ST perform equally well. This is because there are only 11 paths to optimize over instead of 55 (each target to a single source). Thus, a viable strategy is to find the target
that is furthest away from any source and connect a source directly to it. This can greatly reduce the cost function, even if no other path uses this edge, though this need not be the case in general.
Comparing the prediction accuracy of each method
Next, we judged the quality of the predictions based on how well they overlapped with the STRING potential edges and with HOG-relevant proteins (). In these tests, the accuracy of the method is the percentage of predicted edges, made from amongst all possible non-existent edges, that lied in the relevant set.
The prediction accuracy of the five methods for each objective function.
When only considering support in STRING (), we find that the global methods (Jaccard and Short-Path) significantly outperform the source-target-based methods. In particular, every prediction made by the Jaccard algorithm is correct according to STRING as are over 60% of the Short-Path predictions. This result agrees with previous studies that showed that network distance and shared topology are strong indicators for functional or physical relatedness 
. The probability of predicting a STRING potential edge from amongst all possible edges is only 3.5%, and thus most approaches perform significantly better than baseline.
This test, however, does not tell us whether the predictions bear any relevance to the HOG pathway, which is the primary focus of this study. To better home-in on HOG-relevant predictions, we filtered the STRING potential edges to only include those edges that connected two known HOG-related proteins. shows that the global methods do not make any predictions that relate to the HOG pathway. On the other hand, the Greedy predictions remain at the same level in both tests, which implies that its predictions tend to be highly accurate and
lie amongst HOG-related nodes. The difference is especially pronounced in the hop-restricted cases, where Greedy is more accurate than any other method by roughly 40% (Shortcuts
-X). Two of these edges connect Hog1 to known HOG transcription factors, Msn4 and Cin5 — both previously established interactions in KEGG 
or the literature 
(which are missing from the STRING database and thus do not appear in the original network we used). The probability of predicting a HOG-relevant STRING potential edge from amongst all possible edges is only 0.076%, which is much lower than the accuracy of all three source-target-based algorithms.
Of the top 15 predictions made by Greedy and Betweenness for the Shortcuts-X problem, only one prediction overlaps, and a similar trend holds for the other objectives. This likely stems from the fact that Greedy takes the magnitude of the cost reduction into account, whereas Betweenness only computes the number of shortest paths that use the candidate edge. Because both algorithms perform significantly better than baseline, this implies that they may provide complementary predictions and both may be reasonable depending on the use case.
Interestingly, despite their similar performance in cost reduction for Shortcuts
-SS and Shortcuts
-X-SS (), Greedy makes more accurate predictions than Direct-ST (). This is because there are many cases where a direct source-to-target prediction can be equivalently replaced by a target-target interaction. For example, if
was added in the first step, the predictions
) both equally reduce the cost from a single source (
) to the target
. However, target-target interactions are more likely to exist within the STRING potential edges than direct source-target edges, and indeed Greedy makes several TF-TF predictions (e.g.
), thereby giving it an advantage.
To show that the orientation step is indeed useful in extracting HOG paths given sources and targets, we ran each algorithm on the unoriented
STRING PPI network (Figure S2
). We found that for both hop-restricted objective functions, the Greedy algorithm makes more HOG-relevant predictions when using the oriented network (53% vs. 46% for Shortcuts
-X and 40% vs. 20% for Shortcuts
-X-SS, compared to using the unoriented network). Moreover, the global methods (Short-Path and Jaccard) also benefited significantly from the orientation, which implies that defining network neighbors more precisely can help in identifying putative interactions.
Overall, these results show that the global methods perform well in identifying putative interactions, but that the Greedy algorithm can home-in on more pathway-consistent interactions while drastically reducing source-target distances.
Integrating additional biological features into the framework
While predicting plausible edges from amongst all possible edges serves as a strong validation technique, in practice, we would also like to leverage other data sources (such as expression, sequence, and literature evidence) when making predictions. To naturally integrate these sources into our framework, instead of predicting from amongst all possible edges, we only predict from amongst the set of STRING potential edges (Methods
). Each potential edge is weighted by STRING with a confidence value in
, which we explicitly set to
; in the previous sections,
was given a default weight of 0). By using these data types and weights together, we can pinpoint putative interactions that have evidence from a wide variety of biological sources as well as evidence from the network.
presents the top 10 predictions made by the Greedy algorithm for the Shortcuts
objective function, many of which are known physical interactions missing from STRING. The
predictions have direct evidence of physical interaction according to BiOGRID 
, but were not present in the STRING network. The
predictions lied within the STRING binding edges (and thus represent physical interactions), but were either oriented in the opposite direction or were left out of the oriented network.
was originally oriented
, but the Greedy algorithm suggests that that this edge was either oriented incorrectly or is bidirectional.
was left out of the network because the orientation algorithm did not find any length-bounded paths that included this edge. Although in general biological pathways are short, this prediction exemplifies an exception where considering longer pathways through the edge
improves the source-target connectivity. These correct predictions demonstrate that our approach can correct for limitations of the edge orientation.
Top 10 predictions for Shortcuts using the Greedy algorithm.
For the following three predictions, we verified both the physical interaction between the two nodes and the directionality (which is not possible for edges validated with the undirected STRING or BioGRID databases). The
) involves two general stress TFs that play a substantial role in the HOG pathway 
. Harbison et al. 
showed that indeed Msn4 binds the MSN2
gene in the succinic acid stress condition. This study did not profile Msn4 DNA binding in osmotic stress, but it is plausible that this stress-activated TF could bind MSN2
in other conditions as well. The
) was recently shown by Pokholok et al. 
to occur in osmotic stress. We discuss the
) at length in the next section.
Overall, 7 of the top 10 predictions have support for direct physical binding in the cell. In addition, the
prediction was not directly supported in the literature but warrants further study. Both Reg1 and Msn4 have been shown to physically associate with the 14-3-3 proteins Bmh1 and Bmh2 
but have not yet been shown to directly interact with one another. Proteins with a common physical interaction partner may be more likely to directly interact themselves than proteins with other types of functional connections (e.g. genetic interactions) 
presents the top 10 predictions made by the Greedy algorithm for the Shortcuts
-X objective function, which attempts to model more biological constraints by imposing a hop-restriction on the source-target paths. Remarkably, the top three predictions (
) represent best-case predictions: The two genes/proteins involved are known to physically interact, the directionality is correct, and the interaction is highly relevant to osmotic stress response. In particular,
are core HOG pathway interactions that are well-characterized 
and appear in KEGG 
, but lack evidence for physical binding in STRING. The MAPK Hog1 is central to the HOG response program, and its activation of downstream TFs is a critical component of the response. The other two validated predictions involve HOG pathway members as well. Sho1 is a transmembrane osmosensor, and its branch of activation of Hog1 is known to be mediated by interaction with Cdc42 
interaction is also present as part of the related starvation subpathway of MAPK in KEGG 
. Finally, the
) is between two members of the Sho1 HOG pathway input branch 
. Overall, of the 659,719 STRING potential edges considered, only 0.0011% are in KEGG, and thus the fact that 3 of the top 10 predicted edges lie in KEGG is highly significant (
, Fisher's exact test).
Top 10 predictions for Shortcuts-X using the Greedy algorithm.
Other predictions whose physical interaction could not be validated also involve pairs of HOG pathway members. Some predictions occur between the two independent upstream input branches in the pathway (e.g.
) or between upstream proteins and proteins that are very far downstream (e.g.
). From an algorithmic standpoint, these edges do indeed provide faster diffusion of signal from sources to targets; however, they may not represent direct interactions that occur in the cell. In contrast, the
prediction is a shortcut within the Sho1 input branch, which contains the cascade
. Note that several of these predicted edges have very high weights (e.g.
) from STRING reflecting their strong functional dependencies, which makes them more likely to be selected by our algorithm. However, several predictions were made despite lower evidence (e.g.
), which suggests that their addition strongly aided source-target connectivity. Interestingly, none of the top 10 predictions directly connects a source to a target. This further necessitates an approach like ours versus Direct-ST.
To further validate our ability to extract accurate pathway-relevant predictions from within the potential set, we conducted 5-fold cross-validation experiments by leaving out HOG-relevant edges (see Methods
). The probability that a random prediction would recover a left-out edge from amongst all the potential edges is extremely small (0.033%). Using the Greedy algorithm, we found that 12% (16%) of the top 10 predictions for Shortcuts
-X) recovered a left-out edge. Recovering one correct edge (10%) yields a P-value of
and recovering two correct edges (20%) yields a P-value of
(Fisher's exact test). Both values are significant (our results lie between them) further supporting the ability of our method to make accurate edge predictions.
To explore the sensitivity of our results to the hop-restriction length, we repeated our computational experiments using a hop-restriction length of
. Overall, we found similar qualitative performance for the algorithms when predicting from amongst all possible edges (Figure S3
). However, when predicting from amongst the potential set, we found only a few overlapping predictions with those made when the hop length was 5. Interestingly, these included the well-known HOG interactions
, suggesting that the most confident and likely predictions are not wholly affected by the decreased hop restriction. Of course, some different predictions are also to be expected; for example, using a hop length of 4, the algorithm makes predictions for
. While these predictions make sense algorithmically, they do not make sense biologically because they attempt to shortcut the sources of the pathway directly to a core node (Hog1). This suggests that 4 hops may be too restrictive and may motivate using a hop restriction of 5 in future efforts.
We also found that our approach was able to recover missing interactions when not leveraging the STRING-derived weights (see Text S1
). This implies that our approach is not entirely dependent on the potential edge weights and that our objectives are well-defined.
: A novel prediction
To demonstrate our approach's ability to make novel, biologically meaningful predictions we selected
for experimental validation. This was a top prediction for two objective functions (for Shortcuts
-SS it was the
prediction and for Shortcuts
it was the
uncharacterized prediction; ). As we showed, the addition of a few edges can greatly reduce the objective function cost, and therefore we place more confidence in these top edges.
Verifying a directed protein-protein interaction at the mechanistic level requires extensive experimentation and is beyond the scope of this work. However, genetic experiments such as gene deletions can establish condition-specific causal relationships between proteins in signaling pathways. For instance, loss-of-function mutations and gene over-expression were used to identify and order the genes along the apoptosis pathway in C. elegans
. In our case, if Tpk2 controls the TF Sok2 in osmotic stress, TPK2
deletion should affect Sok2's regulatory activity in this condition. Because many interactions along signaling pathways occur post-translationally, we would not expect the SOK2
gene to be differentially expressed in the
mutant even if Tpk2 does activate or inhibit Sok2 at the protein level. Instead we determine the degree to which the deletion alters Sok2's function as a transcriptional regulator. As predicted, the knockout significantly affected genes bound by Sok2 (
, Fisher's exact test; see Supporting Text S1
for microarray details and Table S1
for lists of affected genes). The knockout alone cannot confirm whether the
interaction is direct or indirect, but clearly establishes that there is a functional connection between these proteins that is active in osmotic stress. Moreover, the orientation of the predicted
edge is correct because if Sok2 were upstream of Tpk2 in the pathway, its bound genes would be unaffected by TPK2
To test the significance of our knockout (KO) with other perturbation experiments, we used the Rosetta compendium 
of 300 KO expression experiments and compared the overlap between differentially expressed (DE) genes in each experiment with the list of Sok2 targets (see Supporting Text S1
). Of 301 experiments, only 31 (10.3%) had a lower P-value than the one obtained from our TPK2
KO. In the other direction, we considered 117 additional TFs for which a high confidence set of targets exists 
. For each, we computed the significance of the intersection between their targets and genes affected by the TPK2
deletion using Fisher's exact test. Similar as the test above, of the 118 tests only 14 (11.9%) had a lower P-value than our predicted Tpk2-Sok2 pair. Combined, our predicted interaction ranked close to the top 10% in these two independent analyses further supporting our prediction.