Traversing a phylogenetic network

In a network, *vertex traversal *refers to the process of visiting each vertex, exactly once, in a systematic way. Such traversals are classified by the order in which the vertices are visited. We need two types of network vertex traversals to describe our algorithms. These are well-known for phylogenetic trees, and we present them here for phylogenetic networks. The algorithms for the traversals given below start from any given vertex *v *in the network. In this paper, we will always perform the traversals from the root vertex of the network.

Pre-order traversal of a phylogenetic network from a vertex v

1. Visit the vertex *v*.

2. Recursively perform pre-order traversal from each child that has not yet been visited.

Post-order traversal of a phylogenetic network from a vertex v

1. Recursively perform post-order traversal from each child that has not yet been visited.

2. Visit the vertex *v*.

Since a phylogenetic network is a DAG, such traversals will visit all the vertices of the network exactly once. (Refer to [

18] for more details on existence on such traversals on DAGs). For the purposes of this paper, we assume that the vertices of a network are uniquely labelled by integers. Note that the leaves are already labelled from the set [

*n*]; and so we use other integers for other vertices. Whenever the child vertices of

*v *are extracted, they are also arranged in increasing order of their integer labelings and the pre- and post-order traversals are performed in this order. This will ensure the following: if vertices

*v *and

*v' *are such that there is no directed path between them, then the vertex

*v *is traversed prior to vertex

*v' *in the pre-order if and only if the vertex

*v *is traversed prior to the vertex

*v' *in the post-order. See the Figure for some examples. With this property, we notice that the pre- and post-order traversals from the root of a phylogenetic network each trace the same spanning tree, which we call here the

*traversal tree*.

Dynamic Programming solution

Dynamic programming is used to provide efficient solutions for finding the exact parsimony score when the network is a phylogenetic tree [

15,

16]. In this section, we show that the same approach can be generalized to phylogenetic networks. Sankoff's algorithm on a tree traverses the vertices of the tree via post-order while computing the minimum costs of each state at each vertex from the leaves to the root, and then chooses the best assignments on each vertex by backtracking from the root to the leaves by traversing the tree vertices via pre-order. Both the phases are presented for networks in Algorithms 1 and 2 respectively. We describe them briefly below. It can be noted that if the network is a tree, then our algorithms match with the pre-order and post-order phases of Sankoff's method for trees.

Given a phylogenetic network

*N*, with leaf vertices labeled [

*n*] and with state assignment function

*λ *over the alphabet Σ, assign to each vertex

*v * *V *a quantity

*S*_{v }(

*i*) for each

*i * Σ. In phylogenetic trees,

*S*_{v }(

*i*) denotes the minimum sum of costs of all the events from the vertex

*v *to all the leaves that are reachable from

*v*, given that

*v *is assigned state

*i *and all the descendant vertices from

*v *are each assigned a state. In networks, there is no simple way to compute such a quantity. Instead, we allow

*S*_{v }(

*i*) to be a lower-bound of the above exact score and it is calculated during the post-order traversal phase.

*Post-order traversal phase: *If *v *is a leaf of *N*, then *S*_{v }(*i*) is assigned 0 if the observed state is state *i*, and infinite otherwise. Now all we need is a recursion relationship to calculate *S*_{v }(*i*) for rest of the vertices. For each child *w *of *v*, we say *w *satisfies the *post-order traversal condition with respect to v*, or simply *traversal condition with respect to v *in view of the observation in the beginning of this section, if the following hold:

(i) The vertex *w *is a reticulate vertex and

(ii) if *v' *is the parent of *w *other than *v*, then the vertex *v *must be traversed prior to *v' *in the post-order traversal of *N*.

We now define recursively for each edge (*v, w*),

For a phylogenetic tree, *s*_{(v, w) }(*i*) always assumes the first of these quantities, and it thus gives the sum of the substitution costs along the edges of the tree that lie below the vertex *v*, provided the vertex *v *is assigned the state *i*. For phylogenetic networks, in order to account for the substitution costs along the edges that lie below a reticulate vertex *w *just a single time when vertex *v *is assigned the state *i*, we let the 'parent' *v *of *w *in the traversal tree account for all the substitution costs along all the edges that lie below *v*. On the other hand, if *v *is not a parent of *w *in the traversal tree, *s*_{(v, w) }(*i*) simply denotes the substitution cost from state *i *at vertex *v *to another state at *w *that is least expensive.

We then define

where the sum runs for all child(ren) vertex(s) *w *of *v*. As mentioned before, in phylogenetic trees, *S*_{v }(*i*) denotes the minimum possible sum of substitution costs along all the edges from the vertex *v *to all the leaves that are reachable from *v*, given that *v *is assigned state *i *and all the vertices reachable from *v *are each assigned a state.

In phylogenetic networks, while calculating *s*_{(v, w) }(*i*) where *w *is a reticulate vertex such that (*v, w*) is not an edge in the traversal tree, there is no prior knowledge of the state that will be later assigned at the reticulate vertex *w*. Thus *s*_{(v, w) }(*i*) can only be a lower bound of the edges of the network that lie below the vertex *v*, if the vertex *v *is assigned the state *i*. The reasoning for this is that *s*_{(v, w) }(*i*) is the substitution cost from state *i *at vertex *v *to another state at *w *that is least expensive, instead of the substitution cost from state *i *at *v *to the state at *w *that will be later assigned. Since the definition of *S*_{v }(*i*) depends on the definition of *s*_{(v, w) }(*i*), and they are defined recursively, we observe the following: *S*_{v }(*i*) is a lower bound on the sum of substitution costs along the edges of the network that are reachable from the vertex *v*, provided that *v *is assigned state *i *and all the descendant tree vertices are assigned a unique state, and the reticulate vertices are assigned two states that are not necessarily the same. The assigned states of the reticulate vertex contributes to a *conflict *if the states are not the same. Let us suppose that state *i *is assigned to the root vertex *r*, and all tree vertices are assigned a unique state, while the reticulate vertices are assigned two states. Then the cost *S*_{r }(*i*) denotes the minimum possible sum of substitution costs along all the edges of a traversal tree with one of states assigned for reticulate vertices, plus the sum of the substitution costs along the remaining reticulate edges with the alternate assignment state at the reticulate vertices. Since we seek an assignment on the vertices of the network with no conflicts in the reticulate vertices, *S*_{r }(*i*) is a lower bound on the cost of such assignment where the root vertex is assigned *i *and all vertices are assigned with a unique assignment.

During this phase, we also store the states

to be able to backtrack the state of *w *that achieves the quantity *s*_{(v, w) }(*i*) during the pre-order phase. See Algorithm 1.

*Pre-order traversal phase: *We first choose the minimum

where

*r *is the root vertex and assign the state that attains the minimum at the root vertex,

*i.e*., let

such that

*S*_{r }(

*i*_{r}) =

*S*. For any other vertex

*w *that is not a reticulate vertex, whose parent

*v *is already assigned with a state

*i*, we assign the state

*t*_{(v, w) }(

*i*). For a reticulate vertex

*w *whose parent vertices are

*v *and

*v'*, let us suppose that

*v *and

*v' *are assigned states

*i *and

*i' *respectively when traversing by the pre-order. The possible states

*j *=

*t*_{(v, w) }(

*i*) and

*j' *=

*t*_{(v', w) }(

*i'*) of

*w *that achieve

*s*_{(v, w) }(

*i*) and

*s*_{(v', w) }(

*i'*) respectively, need not be the same. In other words, it is possible that

*j *≠

*j'*. In this case, we have a

*conflict *on the reticulate vertex

*w*. Thus, the dynamic programming technique fails to give an extension for

*λ *whose parsimony score is

*S*. In this case, we simply choose between

*j *and

*j' *for

*λ*(

*w*) according to which of the vertices among

*v *and

*v' *is traversed first in the pre-order. Thus, if the vertex

*w *satisfies the traversal condition with respect to

*v *we have

.

After completing the pre-order phase, we can get the score corresponding to the extension

by first setting

*S' *=

*S *and updating

*S' *at each reticulate vertex

*w *as follows: The upper bound score

*S' *is updated corresponding to the assignment

*j *at vertex

*w *as

*S' -c*_{i' j' }+

*c*_{i' j}. See Algorithm 2. Figure shows an example of how the algorithm runs on a network. Since

*S*_{r }(

*i*) is a lower bound on the optimum assignment where the root vertex is assigned

*i *and all vertices are assigned with a unique assignment, and since

*S *= min

_{i }*S*_{r }(

*i*), we conclude that

*S *is a lower bound of the optimum we seek to find. See Lemma 1 for a formal proof.

**Lemma 1**. *The quantity S is a lower bound of the optimum parsimony score on the network N*.

*Proof*. By the construction of *S*, we have

where the second summand is for the reticulate vertex

*w *with parents are

*v *and

*v'*, such that

*v *satisfies the traversal condition w.r.t.

*w*. Thus the cost

is the substitution cost from the assigned state

at

*v *to the state

at

*w*. On the other hand, the cost

is the substitution cost from the assigned state

at

*v *to the state

at

*w*. Note that the state

is not necessarily same as the state

, and

*S *is the minimum among all assignments that may result in conflicts at the reticulate vertices.

Suppose

is the optimum parsimony score on

*N *with the function

*μ: V *(

*N*) → {0, 1, ..., |Σ| - 1} as the extension of

*λ *we have

where in the second summand

*w *is a reticulate vertex with parents

*v *and

*v'*. Since

*μ *is a conflict-free assignment that is contained in the set of all assignments among whose costs

*S *is the minimum (compare equation (3) and (4)) we have

. □

Now for the complexity of the algorithm. Suppose the network *N *has *n *leaves and *r *reticulate vertices. Then the number of vertices in *N *is 2(*n *+ *r*) -1. At each vertex *v *and for each state *i*, the quantity *S *can be computed in *O*(*k*^{2}) time, where *k *= |Σ|. The pre-order traversal step involves finding *S *in *O*(*k*) complexity and assigning the best states for each vertex. Also, fixing conflicting reticulate vertex states takes *O*(*r*) time. Thus the complexity of the algorithm (presented here) to find a lower and an upper bound is *O*((*n *+ *r*)*k*^{2}). An alternate upper bound can be obtained in *O*(*nk*^{2}) by simply assigning during the post-order traversal phase, for each reticulate vertex the state that occurs the maximum number of times at the leaves reachable from the respective reticulate vertex; and proceeding via finding *S*_{v}(*i*) for the remaining vertices. The exact optimum can also be obtained by restricting the possible states to a single state for each reticulate vertex, by running the dynamic programming algorithm for each of the *k*^{r }combinations of states for the reticulate vertices, and choosing the minimum among all of them. The time-complexity of this process is *O*(*nk*^{r+2}).

**Algorithm 1 **Post-order traversal phase: Calculate the cost of each state at each vertex

1: Input: Network *N *and the observed states from Σ at the leaves of *N, i.e*., a state assignment function *λ *over the alphabet Σ for *N*.

2: For each leaf *v*, let *S*_{v }(*i*) = 0 if *λ*(*v*) = *i *and ∞ otherwise.

3: Repeat in post-order for each in internal vertex (root, internal tree vertex or reticulate vertex) *v *in *N*: For each state *i*, compute *S*_{v }(*i*) given in (1) and *t*_{(v, w) }(*i*) for each child *w *of *v*, given in (2).

4: Output: {(

*S*_{v}(

*i*), [

*t*_{(v, w)}(

*i*):

*w *is a child of

*v*]):

*v * *V *(

*N*),

*i * Σ}.

Minimizing the number of mutations on a phylogenetic network

The Fitch algorithm [

17] counts the number of changes in a bifurcating phylogenetic tree for any character set, where the states can change from any state to any other state. Thus, the cost matrix is such that its diagonal elements are all zeros and the off-diagonal elements are all ones. In this section, we show how Fitch's algorithm extends to finding upper and lower bounds for the number of evolutionary changes in a given phylogenetic network. First, we show that the Fitch algorithm can be extended to give an upper bound for the optimum parsimony score. As before, the post-order and the pre-order traversal phases are given in Algorithms 3 and 4 below. See Figure for an example run of the algorithm.

**Algorithm 2 **Pre-order traversal phase: Calculate lower and upper bounds of the optimum and the corresponding assignment of the upper bound

1: Input: {(

*S*_{v }(

*i*), [

*t*_{(v, w) }(

*i*):

*w *is a child of

*v*]):

*v * *V *(

*N*),

*i * Σ}.

2: Let

*S *= min

_{i }*S*_{r }(

*i*), where

*r *is the root vertex and let

.

3: Let *S' *= *S*

4: For each vertex

*w *in pre-order whose parent vertex

*v *immediately preceeds

*w *in the pre-order, let

where

.

5: Visit each reticulate vertex

*w *with parents

*v *and

*v' *such that

*w *satisfies the traversal condition with respect to

*v*, with

,

,

and update

*S' *as follows:

6: Output: (Lower bound, Upper bound) = (

*S, S'*); extension corresponding to the upper bound score

.

**Algorithm 3 **Post-order traversal phase: Calculate the optimum

1: Input: Phylogenetic network *N *and a state assignment function *λ *over the alphabet Σ for *N*.

2: For every leaf *v *of *N*, we are given *A*(*v*) = {*λ*(*v*)}, a singleton set containing the observed state at the leaf.

3: Set *UB *= 0.

4: Recurse using post-order: For a vertex *v *of *T *with children *w*_{1 }and *w*_{2}, let

and

If the vertex *v *has a single child *w*, then

and

5: ({

*A*(

*v*):

*v * *V *(

*N*)},

*UB*)

Since the pre-order traversal phase gives a conflict-free assignment on the vertices,

*UB *is an upper bound. This is a special case of the dynamic algorithm presented for general cost-matrix. Suppose we restrict

*N *to be a phylogenetic network with no sister reticulations, then any Fitch solution on any tree

*T *in

forms a lower bound for the optimal score on networks; and adding the cost on edges not in

*T *gives an upper bound for the optimal score. Thus, it is possible to calculate our lower bound for counting the number of character changes only for phylogenetic networks with no sister reticulations, where it is straightforward to find a tree in

.

**Algorithm 4 **Pre-order traversal phase: Assigning the states

1: Input: Phylogenetic tree

*N *and ({

*A*(

*v*):

*v * *V *(

*N*)},

*UB*).

2: For every vertex

*v *in the tree that is not already assigned, the algorithm computes

as follows: For the root

*r*,

, where

*σ *is an arbitrary element of

*A*(

*r*). Assign recursively via pre-order: For a vertex

*v *whose parent

*u *is assigned,

3: Fixing the score: for each reticulate vertex

*v*, if

*u' *is not the parent in pre-order, and if

, but

, then increment

*UB *by 1.

4: Output:

*UB *and extension function

of

*λ*.