With advances in high-throughput experimental technologies, we are now witnessing a revolutionary change in our ability to measure and store various forms of interaction data (e.g. protein–protein, protein–DNA/RNA, protein–metabolites and genetic interactions) for different organisms. The central objective in Systems Biology is to fuse and analyze the diverse molecular, cellular, tissue-level and higher level data sources to deduce how sub systems and whole organisms work from these network of interactions.

A node in such a network is usually a gene or its corresponding protein, and an edge in a gene network represents a protein–protein interaction (PPI) or a transcription factor (TF)–DNA binding. A major challenge here is to identify the underlying regulatory pathways, each of which is a chain of interacting genes within a network. A regulatory pathway begins with a causal gene and ends at a target gene, and each gene in a pathway can either activate or deactivate some functions of its neighboring genes. The uncovering of potential pathways across genes helps biologists understand the cellular functions of each gene and protein. As noted earlier, a range of experimental methods such as two-hybrid analysis have been developed to uncover a large amount of interactions between genes and proteins, and gene knock-out experiments and some studies (

Bader *et al.*, 2004;

Mering *et al.*, 2002) have also computed the confidence level associated with an interaction. On the basis of discovery that genes with high centrality in a gene network usually have higher essentiality, lethality and pleiotropy (

Hahn and Kern, 2005;

Jeong *et al.*, 2001), several research works (

Bebek and Yang, 2007;

Froehlich *et al.*, 2007;

Scott *et al.*, 2006;

Suthram *et al.*, 2008;

Tu *et al.*, 2006;

Vaske *et al.*, 2009) have focused on inferring the regulatory pathways on a number of organisms such as fly, yeast and human.

Given a weighted directed gene network and a specific gene, we seek to develop efficient algorithms for the following sub problems: (i) UnknownCausal: if the given gene is a target gene, infer possible causal genes and their pathways; (ii) UnknownTarget: if the given gene is a causal gene, infer possible target genes and their pathways. Since, given a target gene, some biological experiments such as expression quantitative trait loci (eQTL) analysis can locate an approximate location of a causal gene and thus can provide candidate causal genes, another sub problem is (iii) CandidateCausal: given a causal gene, infer the most likely causal gene among the candidates.

Recently, approaches based on the random walk model have been suggested as a means to address this problem. A random walk typically starts from the given gene, walks through several nodes, and terminates according to some pre-defined parameters, such as walk length and edge weights. Generally, the number of visits to a node across multiple walks from a given gene gives a measure of influence or importance, which in turn provides a measure of how likely that node is involved in a pathway containing the given gene.

Tu *et al.* (2006) first proposed this model to address this inference problem with an additional requirement, rooted in domain knowledge, that the random walk could visit any node at most once.

Suthram *et al.* (2008) and

Missiuro *et al.* (2009) proposed the electrical circuit model as an analogy of gene networks, in which each edge is a resistor and its conductance is proportional to the edge's weight. This circuit model can be solved according to Kirchhoff's and Ohm's laws, and the amount of current flowing through each node can be interpreted as the possibility of this node being involved in a pathway. It has been shown that this circuit model can be interpreted as the random walk model (

Doyle and Snell, 1984), but it cannot directly apply on directed edges. The information flow model (

Stojmirović and {Yu}, 2009) used a similar approach to topic sensitive PageRank (

Haveliwala, 2003), whereas the electrical circuit model (

Suthram *et al.*, 2008) and PageRank-based method (

Voevodski *et al.*, 2009) can be seen as the special cases of the information model. As we discuss later, these approaches have some important limitations that we seek to overcome.

Specifically, in a clear departure from past work, here we discuss a novel approach to solving this inference problem by adapting the

*k shortest paths algorithm* to directly point out potential pathways. In this article, we only discuss simple paths, in which a node is not allowed to appear more than one time. Intuitively, this agrees with the prevailing belief among domain scientists that a regulatory pathway is unlikely to repeatedly pass a node. The classic

*k* shortest paths simple paths algorithm is due to (

Yen, 1971). It executes

*O*(

*n*) times Dijkstra algorithm to generate candidate paths for each of the

*k* shortest paths, so its time complexity is

*O*(

*kn*(

*m*+

*n*log

*n*)), where

*n* is the number of nodes and

*m* is the number of edges. Recently,

Hershberger *et al.* (2007) classified the candidate paths into different classes and adopted a replacement-edge strategy for each class, and

Gao *et al.* (2010) used the transformed graph with side cost to improve the efficiency of Yen's algorithm, but both of them have the same worst-case time complexity and space complexity as Yen's algorithm. All the algorithms discussed earlier address the

*single-pair* problem, which is to find the

*k* shortest paths between a pair of nodes in a weighted graph. However, in our inference problem, we need to calculate the

*k* shortest paths from the given node to each other node to approximate the potentiality of each gene being involved in a pathway. For this reason, we introduce a

*single-source*
*k* shortest paths algorithm.

Our algorithm is also based on Dijkstra algorithm, moreover adopting a data structure called {pseudo tree} to reduce the space storing all single-source *k* shortest paths. We show that an exact algorithm to solve this problem has high complexity and therefore we propose a heuristic approximation algorithm that appears to work well in practice—the single-source *k*-shortest paths algorithm. We note that the *k*-shortest paths may often have high overlap (low diversity, since may share a large number of edges) and one of our objectives is to explicitly identify diverse shortest paths. The domain intuition here is that, diverse paths between two genes might help one uncover non-redundant transduction pathways which are of interest to domain scientists. To accommodate this additional requirement, we propose a variant of our basic approach—a single-source *k diverse* shortest paths algorithm.

Following best practices

*in silico* validation procedures outlined by previous research on this problem (

Beyer *et al.*, 2006;

Suthram *et al.*, 2008;

Tu *et al.*, 2006), we demonstrate the efficacy and efficiency of the proposed approaches to inferring regulatory pathways on the Yeast gene network. Specifically, on the

CandidateCausal sub-problem, our shortest paths algorithm achieves significantly higher accuracy than extant state-of-the-art approaches (

Stojmirović and {Yu}, 2009;

Tu *et al.*, 2006), and on the

UnknownCausal and

UnknownTarget sub problems, the importance values assigned by our shortest path variant that explicitly accounts for diversity in paths, not only achieves higher enrichment but also enriches different functions. In terms of scalability, when compared with the exact single-pair

*k*-shortest paths algorithm (

Yen, 1971), as well as some recent variants (

Malviya *et al.*, 2011), on the yeast gene network, our method achieves up to two orders of magnitude speedup. We should also note that while our approach relies on a simple heuristic approximation procedure, we empirically observe no difference in quality when compared with the exact algorithm.