Techniques for the reconstruction of biological networks, such as genetic, metabolic or signaling networks, are used for getting insight into the inner mechanisms of the cell. They are usually based on perturbation experiments, e.g., gene knockouts or knockdowns, in which one or more network nodes (e.g. genes) are systematically perturbed and the effects on the other nodes are observed. More concretely, in the context of genetic regulatory networks with knockout experiments, the nodes of the networks are genes. Each gene is knocked out at least once and the expressions of the other genes are measured for each knockout experiment. The expression change with regard to the unperturbed wild type defines the influence of the knocked out gene on the other genes. Based on that, connections between the genes can be established. Using the difference in the expression between the perturbed and the wild type, weights and signs can be associated with the connections to quantify the influence and indicate over- and under-expression, respectively.
An important problem in this kind of network reconstruction is that direct connections between genes might be established that are spurious, i.e., do not exist in reality. We illustrate this with the following example. Suppose that the transcription factor A
, produced by gene a
, is needed to activate gene b
. The activation of b
results in the production of transcription factor B
which is encoded by b
. Let B
on its turn activate gene c
, which is manifested by the production of the corresponding protein C
. Also, assume that genes b
cannot be activated by any other gene/transcription factor. Now, if we disable gene a
, for instance by deleting it from the genome, the production of transcription factor A
is prevented and as a result B
will not be produced. Thus, our measurement of the expression of genes a
, and c
will show that a
directly influences both b
. In the reconstructed networks we prefer to directly relate two nodes only if there is a direct
influence between them. So, a direct connection between a
, resulting from the transitive
influence via b
, would be obsolete, and the process of removing such indirect influences is called transitive reduction
There are several ways to remove spurious direct relations depending on the representation of the biological networks. For instance, in [2
] this is done by comparing the measured influence strength between the direct and indirect interactions, i.e., an interaction chains that do not contain the direct interaction. The direct interaction is removed if it is weaker than the last edge of the indirect one. Similarly, in the genetic networks inference tool Aracne
the method of data processing inequality
is used to remove indirect interactions [3
]. This method works for undirected networks. For each triple of genes their pairwise interactions are checked. The one with the lowest score is assumed to be a result of an indirect interaction of the other two. Other approaches to remove obsolete interactions can be found in [4
Using TR for filtering out spurious connections was proposed by Wagner [5
]. The rule is to remove the direct interaction for each gene pair g
, if there is an alternative chain of interactions between g
. Going back to our example above, TR would mean that the undesired edge between a
, caused by the indirect interaction, is removed from the network.
The first algorithms for TR date from the seventies [1
] and there are several other relevant publications on this subject [6
]. In all previous works on TR the networks were represented as directed graphs without weights on the edges. However, in the network inference algorithms interaction strengths between genes usually play an important role. Motivated by this fact, the concept of TR of weighted graphs was introduced [4
], where the weights correspond to interaction strengths. Both these papers take interaction signs (promoting or inhibiting) into account.
] several types of TR are described depending on how individual edge weights are extended to paths to quantify the indirect interactions. One of the variants of TR coincides with the definition that we employ in this paper. However, the interaction strength along paths, which is actually used in [4
], is defined quite differently from ours. Also, they do not use thresholds in their definitions.
With regard to the definition of transitive closure for weighted graphs and the general theoretical background, the closest to our work is [9
]. Their work is also motivated by a biological application, in particular, the analysis of protein-protein interaction networks. The authors define the notion of transitive closure of weighted graphs, but stop short of introducing TR. Originally, we were inspired by their ideas about the transitive measure of interaction along a path in the network.
Unlike the previous work in [4
], we adopt an approach that disregards the interaction signs. A benefit of this decision is that the TR algorithms are of polynomial complexity and amenable to parallelisation. In contrast, the algorithm in [8
] is in the worst case NP-complete, which means that its runtime grows exponentially with the size of the networks. Tests with the Dream
] show that the omission of signs does not incur any degradation in overall performance compared to the signed weighted TR of [8
We present parallel versions of the TR algorithms for both unweighted and weighted directed graphs. These algorithms are developed for general purpose graphics processing units (GPUs). GPUs have been extensively used for various applications, including bioinformatics and systems biology. Since GPUs are standard in modern computers, parallel algorithms become an attractive possibility to speed up computations.
The crucial idea of TR on GPUs is to formulate the algorithm in terms of matrix operations. Since GPUs are very efficient in implementing the latter, this results in a remarkable speed-up so that networks consisting of tens of thousands of nodes can be handled within seconds on a standard desktop computer.
Parallel algorithms for computing transitive closure of a graph, which is closely related to TR, have been developed before e.g., [11
]. However, the only work that we could find that deals explicitly with parallel algorithms for TR is [13
]. Their algorithms are restricted to unweighted graphs and claim efficiency for the special case of sparse graphs. To the best of our knowledge, there does not exist a previous implementation of a TR algorithm on GPUs and thus, our work is novel in that sense, too.
Approach for unweighted graphs
We adopt a graph-theoretic framework to formally represent biological networks. A directed graphG
) is a pair of sets of nodesV
, i.e., an edge eE
is an ordered pair of nodes (v
. Without loss of generality, we identify V
with an arbitrary but fixed order represented by the set of numbers
, where n
is the number of nodes in V
. For example, in genetic networks, we associate the nodes with genes and the edges with interactions between genes. More formally, a gene i
is connected to a gene j
via an edge e
) if and only if i
. The graph can be equivalently represented by an adjacency matrixA
. For standard graphs which do not have weights associated with the edges, the elements Ai,j
of the matrix have value 1, if there is an edge from nodes i
to node j
or 0, otherwise.
from node i
to node j
is a sequence Pij
), where k0
, for 0<l
. Nodes kl
, where 1≤l
−1, are called intermediate
nodes. A cycle
is a path Pij
whose first and last node coincide, i.e., i
. A cycle that consists of one edge is called a self-loop
. A graph is acyclic
if it does not contain any cycles. We denote the set of edges of Pij
). Let Paths
) and Paths
) denote the sets of all paths in the graph G
and all paths from i
Definition 1 (Unweighted Transitive Closure and Reduction)
The transitive closure
of a graph G
is the graph GT
) with (ij
if and only if there exists a path from i
. The transitive reduction of an acyclic graphG
is the unique [1
] smallest graph Gt
), i.e., with the least number of edges, such that (Gt
Intuitively, this means that the transitive closure is preserved by the reduction, i.e., no information about reachability is lost. For an acyclic graph G
it can be shown [1
] that the TR Gt
can be obtained by removing each redundant
from the original graph G
for which there is an indirect path, i.e., not including edge (ij
), between i
. An example of TR for acyclic graphs is given in Figure .
Transitive reduction of an acyclic graph. The edge (a,c) is removed because of the indirect path (a,b,c). Similarly, (a,f) is removed because of the existence of several indirect paths between nodes a and f.
The definition of TR can be extended in a natural way to graphs with cycles. However, the reduced graphs are not unique and in general cannot be generated by deleting edges from the graph (see Figure ). To solve this, Aho, Garey and Ullman [1
] shrink the strongly connected components of the graph to single nodes and apply TR on the resulting acyclic graph. After the reduction, we expand the components as they were in the original graph, as is done in [14
Figure 2 Transitive reduction of a cyclic graph. The graphs in (b) and (c) are both transitive reductions of the graph in (a), since all three graphs have the same transitive closure. One can see that edges that do not exist in the original graph may occur in (more ...)
Extension to weighted graphs
We aim at modelling experiments that use node perturbations, e.g., gene knockouts, for the reconstruction of interactions between nodes. We already saw in the example above that spurious direct interactions are added if there exist an indirect path between two genes. Hence, the outcome of the experiments actually produces a transitive closure of the real (original) network. By applying TR as a kind of inverse operation of the transitive closure, we try to cancel these side effects by removing direct interactions between two nodes, if there is an alternative indirect path between them. Finding the TR amounts to reconstruction of the network as by removing those direct interactions we usually obtain a good approximation of the real network.
However, sometimes both direct and indirect interactions can exist at the same time. Examples of this are the feed-forward loops that occur in the genetic networks of many organisms [15
]. Unfortunately, in such cases, the TR as defined above will still remove the direct interaction. To avoid this anomaly, we use the notion of TR on weighted graphs
where the weights represent interaction uncertainties
. Knowing the interaction uncertainty allows to refine the edge removal criterion: an edge (ij
) is removed from the original graph only if it is less certain than some indirect interaction between i
. In other words, if an edge is at least as certain as all indirect interactions, then it is kept in the graph.
The interaction uncertainties are represented as weights
of the edges. Formally, we associate to a directed graph G
) a function
, which maps each edge to a real number, to obtain a weighted directed graphG
). The adjacency matrixA
becomes a matrix of weights where the special value
)} denotes the absence of an edge between two nodes. There are several plausible ways to choose the weight function w
. In this paper we assume that the weights are p
-values or similar measures, i.e., of the interval [0,1], which are obtained from the post-processing of the perturbation experiments. Thus, the smaller the weight w
, the less uncertain is the interaction between two genes.
Definition 2 ((Minimal) Transitive Interaction Uncertainty)
Transitive interaction uncertainty
along a path is defined as a function
that maps each path to a real number. We apply the principle of the weakest link
and define W
)} for a path P
. Then, for a given edge (i
we define the minimal transitive interaction uncertainty
as the strongest weakest link, i.e., the minimal transitive path uncertainty over all paths between nodes i
Note that if edge (i
, then (i
). This implies directly that h
). Recall that the criterion for preserving an edge in the reduced graph is that its uncertainty is not greater than the minimal uncertainty of the indirect paths, i.e., w
By putting the last two inequalities together, we can refine the edge preservation criterion to w(i,j)=h(i,j) and obtain the following definition:
Definition 3 (Weighted Transitive Reduction)
The transitive reduction of a weighted graph
) is the graph Gt
) with Et
)} and wt
) for all eEt
Informally, edge (i,j) is kept in Et if and only if its weight/uncertainty equals the minimal transitive uncertainty of all paths between i and j. The edge weights remain the same in the reduced graph. It is worth noting that the above definition of TR for weighted graphs supports the presence of cycles. An example of TR for weighted graphs is given in Figure .
Figure 3 Transitive reduction of a weighted graph. Edge (a,f) with weight 0.5 is removed because of the indirect path (a,d,e,f) with a smaller transitive interaction uncertainty of 0.4. In contrast to Figure , the edge (a,c) is not removed as (more ...)
Of course, there are other possible options to define the path weights based on the edge weights. For instance, summing up the edge weights to obtain the weight of the path is in some cases even a more natural choice than the max-min (weakest link) approach that we use. However, in the case when p-values (or similar metrics in the interval [0,1], e.g., correlation) are used, this is not the best option. Summing up the p-values of all edges in the path can produce a result which is greater than 1, i.e., something which is not a probability. Since a trivial path consisting of only one edge is also a path, it is preferable to have a weight for non-trivial paths that is of the same nature as the edge weight.
In general, the refined notion of TR with weights does not entirely resolve the anomaly of removing a direct interaction which exists in the network. One way to further improve the filtering of the edges is to use thresholds. We introduce a lower thresholdtlow determining that any edge e with w(e)≤tlow is unconditionally kept in Et, i.e., regardless of the existence of more certain indirect interactions. In this way we ensure that interactions which are measured with high certainty are not removed from the network. Similarly, we use an upper thresholdtup such that any edge e with w(e)≥tup is unconditionally removed from the network. Hence, very uncertain connections are always removed from the original graph G. This filtering with threshold tupis actually independent of the TR concept and it can be done as a pre- or post-processing step.
This can be shown by reasoning towards a contradiction. Assume that thresholding (TH) and TR are dependent, in other words, that the final result depends on the order in which TH and TR are performed. Say we have a graph G for which this holds. Then, there must be at least one edge e from a node v to a node v′ which is not present in either TH(Gt) or TH(G)t, but it is in the other. Let it not be present in TH(Gt) (the case when it is not in TH(G)tis similar). Then, either it was removed by TR (case 1), or by TH (case 2). Case 1: If e was removed by TR, there must exist a path P in G between v and v′ with W(P)<w(e). The assumption was that e is still present in TH(G)t. Therefore, we must have w(e)<tup. But then, also W(P)<tup, in other words, P must exist in TH(G). The existence of P in TH(G) means that e must be removed from TH(G) when applying TR, leading to a contradiction. Case 2: If e was removed by TH, then w(e)>=tup. But then, it would also be removed when applying TH on G, leading to a contradiction.
Using a threshold tlowsplits the edge weights into two sets. Within the sets the difference between values does not play any role. For instance, assuming tlow = 0.5, the difference between the edge weights 0.8 and 0.7 is the same as between 0.16 and 0.06. An analogous remark holds also for tup.
For many graph problems, the unweighted problem is often a special case of a more general weighted problem. For example, an algorithm to determine shortest paths in a weighted graph can be used to find shortest paths in unweighted graphs by assigning the same positive weight to every edge. For our problem of (weighted) TR, a similar analogy does not hold, i.e., we cannot use the algorithm for weighted cyclic graphs to calculate the TR of an unweighted cyclic graph. This fact results from the different natures of our definitions: For the weighted case, we choose the greatest weight on a path without considering its length. For the unweighted case, however, we will be adding up the edges of paths to obtain their length (cf. next section). The first approach is not affected by cycles – the transitive interaction uncertainty of a path with cycles is always greater or equal to the one for the same path with all cycles removed, thus, cycles are ignored “automatically” when searching a path with the minimal transitive interaction uncertainty. In contrast, for the second approach, cycles must be actively detected to ensure that only paths in which no node occurs more than once are considered (the longest simple path problem is NP-hard). Figure illustrates this with the path (abac
) which has a length greater than one but still there is no indirect alternative path from a
. For a similar reason, our algorithm for cyclic weighted graphs cannot be adapted to cyclic signed weighted graphs. The problems with cycles in signed graphs are discussed in [8
Transitive reduction of cyclic unweighted graphs needs cycle detection.