2.1.

Notation and background
Let
H be an input matrix that specifies a set of
N taxa
χ, over a set of
m characters

such that
Hij represents the
jth character of the
ith taxon. The taxa of
H represent the terminal nodes of the Steiner tree inference. Further, let
nk be the number of admissible states of the
kth character
ck. The set of all possible states is the space

. We will represent the
ith character of any element

, by (
b)
i The state space

can be represented as a graph

with the vertex set

and edge set

, where
δ[
a,
b]

=

0 if
a
=
b and 1 otherwise. Furthermore, let

be a set of weights, such that
αp[
i,
j] represents an edge length for a transition between states

for character
cp. We will assume that these lengths are positive (states that share zero edge length are indistinguishable), are symmetric in
i,
j, and satisfy the triangle inequality.
Non-negativity and symmetry are basic properties for any reasonable definition of length. If a particular triplet of states (say
i,
j,
k) does not satisfy the triangle inequality in
equation 1, we can set
αp[
i,
k]

=
αp[
i,
j]

+
αp[
j,
k] and still ensure that the shortest path connecting any set of states remains the same. We can now define a distance
dα over

, such that for any two elements
Given any subgraph
K
=

(
VK,
EK) of

, we can define the length of
K to be the sum of the lengths of all the edges

. The MP phylogeny problem for
χ is equivalent to constructing the minimum Steiner tree
T* displaying the set of all specified taxa
χ, i.e., any tree
T*(
V *,
E*) such that
* and
L(
T*) is minimum. Note that
T* need not be unique.
2.3.

Buneman graph
The Buneman graph was introduced as a pruning of the complete graph for the special case of binary valued characters. For this special case, it is useful to introduce the notion of binary splits
cp(0)|
cp(1) for each character

, which partition the set of taxa
χ into two sets
cp(0) and
cp(1) corresponding to the value expressed by
cp. Each of these sets is called a block of
cp. Each vertex of the Buneman graph

can be represented by an
m-tuple of blocks

, where
ij
=

0 or 1, for

. To construct the Buneman graph, a rule is defined for discarding/retaining the subset of vertices contained in each pair of overlapping blocks [
cp(
ip),
cq(
iq)] for each pair of characters

. All vertices that satisfy

for any pair of characters (
cp,
cq) can be eliminated, while those for which

for all [
cp(
ip),
cq(
iq)] are retained. Buneman previously established for the binary case that the retained vertex set will contain all terminal and Steiner nodes of all distinct minimum length Steiner trees.
We extend this prior result to the weighted multi-state case by presenting an algorithm analogous to the binary case to construct a graph with these properties.
2.4.

Algorithm for constructing the generalized Buneman graph
Briefly, the algorithm looks at the input matrix projected onto each distinct pair of characters
p,
q and constructs a
np
×
nq matrix
C(
p,
q), where the
i
×
jth element
C(
p,
q)
ij is 1 only if there is at least one taxon
t such that (
t)
p
=
i and (
t)
q
=
j and zero otherwise. The algorithm then implements a rule for each such pair of characters
p,
q that allows us to enumerate the possible states of those characters in any optimal Steiner tree. For clarity, we will assume that each state for each character is expressed in at least one input taxon, since states that are not present in any taxa cannot be present in a minimum length tree because of the triangle inequality. The rule is defined by a
np
×
nq matrix
R(
p,
q) determined by the following algorithm:
- 1.
R(p, q)ij
←
C(p, q)ij for all
and
. - 2.
If all non-zero entries in C(p, q) are contained in the set of elements
- for a unique pair
and
then R(p, q)xy
←
1 for all x, y such that either x
=
i or y
=
j. (See where this pair of states are denoted ipq and iqp.) - 3.
If the condition in step 2 is not satisfied, then set R(p, q)ij
←
1 for all i, j.
This set of rules {
R} then defines a subgraph

for each pair of characters
p,
q, such that any vertex

if and only if

. The intersection of these subgraphs

then gives the generalized Buneman graph for
χ given any set of distance metrics

. Note that the Buneman graph of any subset of
χ is a subset of

. It is easily verified that, for binary characters, our algorithm yields the standard Buneman graph.
The remainder of this article will make two contributions. First, it will show that the generalized Buneman graph

defined above contains all minimum Steiner trees for the input taxa
χ. This will in turn establish that restricting the search space for minimum Steiner trees to

will not affect the correctness of the search. The article will then empirically demonstrate the value of these methods to efficiently finding minimum Steiner trees in practice.
Before we prove that all Steiner minimum trees connecting the taxa are displayed in

, we need to introduce the notion of a
neighborhood decomposition. Suppose we are given any tree
T(
V,
E) displaying the set of taxa
χ. We will contract each degree-two Steiner node (i.e., any node that is not present in
χ) and replace its two incident edges by a single weighted edge. Such trees are called
X-Trees (Semple and Steel,
2003). Each X-Tree can be uniquely decomposed into its
phylogenetic X-Tree components, which are maximal subtrees whose leaves are taxa. Formally, each phylogenetic X-Tree
P(
ψ) consists of a set of taxa

and a tree displaying them, such that there is a bijection or labeling
η :
lP
→
ψ between elements of
ψ and the set of leaves

(Semple and Steel,
2003) (). All vertices in
P(
ψ) with degree 3 or higher will be called
branch points. From now on, we will assume that, given any input tree, such a decomposition has already been performed (). Two phylogenetic X-Trees
P(
ψ) and
P′(
ψ) are considered
equivalent if they have identical length and the same tree topology. By identical tree topology, we mean there is a bijection between the edge set of the two trees, such that removing any edge and its image partitions the leaves into identical bi-partitions. We define two trees to be
neighborhood distinct if, after neighborhood decomposition, they differ in at least one phylogenetic X-Tree component. We define a labeling of the phylogenetic X-Tree as an injective map

between the vertices of
P(
ψ) and those of the graph

such that Γ
u represents the character string for the image of vertex
u in

. Since leaf labels are fixed to be the character strings representing the corresponding taxa,

for any leaf

Identical phylogenetic X-Trees can, however, differ in the labels Γ
u of internal branch points

.
We will use a generalization of the Fitch-Hartigan algorithm to weighted parsimony proposed by Erdos and Szekely (
1994) and Wang et al. (
1996). The algorithm uses a similar forward pass/backward pass technique to compute an optimal labeling for any phylogenetic X-Tree
T(
ψ). Arbitrarily root the tree
T(
ψ) at some taxon
ζ and starting with the leaves compute the minimum length
minL(Γ
b,
Tb) of any labeling of the subtree
Tb consisting of the vertex
b and its descendants, where the root
b is labeled Γ
b as follows:
- 1.
If Γb labels a leaf
,
and
∞
otherwise. - 2.
If b has k children
, and Tv is the subtree consisting of
and its descendants,
- where the minimum is to be taken over all possible labels Γv for each character and for each child
.
The optimal labeling of
T(
ψ) is one that minimizes the length at the root:
L(
T)

=
minL(
ηζ,
Tζ). Labels for each descendant are inferred in a backward pass from the root to the leaves and using
equation 3. Note that the minimum length of a tree is just the sum of minimum lengths for each character, i.e.,

, where

is the minimum cost of tree
Tb rooted at
b for character
cs.
Briefly, our proof is structured as follows: Given any phylogenetic X-Tree
T(
ψ) labeling (typically denoted Γ below), we will show that the generalized Buneman pruning algorithm for each pair of characters (
cp,
cq) defines a subgraph
Bpq which contains at least one possible labeling of no higher cost (typically denoted Φ below) for
T(
ψ). We will then show that the intersection of these subgraphs

thus contains an optimal labeling for
T(
ψ).
If the pruning condition in step 2 of the algorithm that defines the Buneman graph is not implemented for the pair of characters (
cp,
cq), then

and all labels are necessarily inside
Bpq. We prove the following lemma for the case when the pruning condition is satisfied, ie., there exist unique states
ipq of
cp and
iqp of
cq, such that each element in the set of leaves

either has (
ηt)
p
=
ipq or (
ηt)
q
=
iqp or both. Each time we relabel vertices, we will keep all characters except
cp and
cq fixed. To economize our notation, we will represent the sum of costs in
cp and
cq of the tree
T labeled by Γ, which has some branch point
b as the root, simply by writing
L(Γ,
T)

=
L(Γ,
T)
(p)
+
L(Γ,
T)
(q). We use the notation Γ
x
=

[(Γ
x)
p, (Γ
x)
q] to represent the label for a vertex
x and suppress the state of all other characters.
Lemma 1 Given any phylogenetic X-Tree T(
ψ)
with 
,
and a labeling Γ,
such that an internal branch point
is labeled outside Bpq,
i.e.,

,
there exists an alternate labeling Φ
of T(
ψ)
inside Bpq such that
- either L(Γ, T)
≥
L(Φ, T)
+
dα[Γb, Φb], or — - L(Γ, T)
≥
L(Φ, T) for each of the following choices: Φb
=
[ipq, iqp] or [ipq, (Γb)q] or [(Γb)p, iqp], and Φv
=
Γv for all v
≠
b. We will call a tree that satisfies this second condition a (cp, cq)-Tree
Proof We will use induction on the number of internal branch points outside
Bpq to prove the claim. Without loss of generality, we can consider all branch points of
T(
ψ) to be labeled outside
Bpq. If some branch points are labeled inside
Bpq, then they can be treated as leaves of smaller X-Tree(s) that have all branch points outside
Bpq. This is similar to the neighborhood decomposition we performed earlier for those branch points that were present in the set of input taxa. The set of branch points is then the set

.
For the base case, assume all the leaves are joined at a single branch point
b to form a star of degree |
ψ| (, without the root
ζ). We can group the leaves into three sets:
The cost of the tree for
cp and
cq, with branch point Γ
b
=

[
x,
y], is
The only way for
L(Γ,
T)
(p)
+
L(Γ
b,
T)
(q) to be minimum with
x
≠
i pq and
y
≠
iqp, is if
III
=

![[empty]](/corehtml/pmc/pmcents/empty.gif)
and |
I|

=

|
II|. For contradiction, suppose |
I|

+

|
III|

>

|
II|. We could then define a labeling Φ identical to Γ over all characters, except Φ
b
=

[
ipq,
y], such that
dα[Γ
b, Φ
b]

=
αp[Γ
b, Φ
b]. We could then reduce the length, since
where the last inequality follows from the triangle inequality. Similarly, if |
II|

+

|
III|

>

|
I|, we could define Φ
b
=

[
x,
iqp] and arrive at
L(Γ,
T)
(q)
≥
L(Φ,
T)
(q)
+
d α[Γ
b, Φ
b].
On the other hand if |
I|

=

|
II| and
III
=

![[empty]](/corehtml/pmc/pmcents/empty.gif)
setting Φ
b
=

[
ipq,
y] or Φ
b
=

[
x,
iqp] or Φ
b
=

[
ipq,
iqp] all achieve a length no more than
L(Γ,
T)
(p)
+
L(Γ,
T)
(q). Therefore, this is a (
cp,
cq)-Tree. This proves the base case for our proposition.
We will now assume that the claim is true for all trees with
n branch points or less. Suppose we have a labeled tree
T(
ψ) with
n
+

1 branch points which are all outside
Bpq. Let

be the children of a branch point
b in
T(
ψ) and

be the subtrees of each

and their descendants. Note that some of these descendants may be leaves. Since
T(
ψ) has at least two branch points, one of its descendants (say
v1) must be a branch point (). Let

be the subtree consisting of
b and all its other descendants. For clarity we will use the notation Γ
b
=

[
xb,
yb] and Γ
v1
=

[
x1,
y1]. This implies,
There are four possibilities.
- 1.
Both Tb and T1 are (cp, cq)-Trees with n or less branch points - In this case, by induction, both Tb and T1 can be relabeled with Φb and Φv1 of the form [ipq, iqp]. Since the cost in cp and cq of the edge (b, v1) is now zero, we have an optimal labeling of T(ψ) within Bpq and L(Γ, T)
≥
L(Φ, T). Note that each of the choices of the form [ipq, y1] or [x1, ipq] for relabeling of b also satisfy property 2 of the claim. Therefore, this is a (cp, cq)-Tree. - 2.
Tb is a (cp, cq)-Tree, but T1 is not. Therefore, there is a labeling Φ of T1 with either Φv1
=
[ipq, y1] and/or Φv1
=
[x1, ipq] such that
Let us assume for concreteness that Φ
v1
=

[
ipq,
y1]. It will become clear that the argument works for the other possible choices. Since,
Tb is a (
cp,
cq)-Tree, by induction, we can choose a labeling of
Tb with Φ
b
=

[
ipq,
yb], such that
L(Γ,
Tb)

≥
L(Φ,
Tb). This gives
Comparing the previous two equations with
equation 6, we get,
which satisfies the first possibility of the claim. It should be clear that if Φ
v1
=

[
x1,
iqp] then the choice Φ
b
=

[
xb,
iqp] would give an identical bound.
- 3.
T1 is a (cp, cq)-Tree, but Tb is not. This case is similar to the previous one. Since Tb has less than n branch points, which are all outside Bpq, and it is not a (cp, cq)-Tree, we have from induction a labeling Φ of Tb with either Φb
=
[ipq, yb] and/or Φb
=
[xb, ipq] such that
As before, let us assume Φ
b
=

[
ipq,
yb] for concreteness. Since
T1 is a (
cp,
cq)-Tree, we can choose a labeling with Φ
v1
=

[
ipq,
y1] such that
L(Γ,
T1)

≥
L(Φ,
T1). This gives,
Comparing the previous two equations with
equation 6, we get,
An identical argument carries through if Φ
b
=

[
xb,
iqp].
- 4.
Neither T1 or Tb are (cp, cq)-Trees. It follows from induction that there is a labeling Φ such that L(Γ, Tb)
≥
L(Φ, Tb)
+
dα[Γb, Φb] and L(Γ, T1)
≥
L(Φ, T1)
+
dα[Γv1, Φv1]. There are two possibilities in this case.
- (a) (Φb
=
[ipq, yb] and Φv1
=
[ipq, y1]) or (Φb
=
[xb, iqp] and Φv1
=
[x1, iqp]). As before, we will prove the claim for the former possibility while the later case can be proved by an identical argument.
This also satisfies the claim. The proof for Φ
b
=

[
xb,
iqp] and Φ
v1
=

[
x1,
iqp] is identical.
- (b) (Φb
=
[ipq, yb] and Φv1
=
[x1, iqp]) or (Φb
=
[xb, iqp] and Φv1
=
[ipq, y1]). As before, we show the calculation for the former possibility. In this case
But if we now relabel
b and
v1 with

and

while

for all other
v, we get

and

. This immediately gives,
Identical arguments work for the choices

and

.
This proves that, if either of the two possibilities claimed are always true for an X-Tree with
n branch points or less, then they are also true for a tree with
n
+

1 branch points. The proof for arbitrary
n follows from induction.


■
Corollary 1 Given a minimum length phylogenetic X-Tree T(
ψ),
there is an optimal labeling for each branch point within 
.
Proof Lemma 1 establishes that for any minimum Steiner tree labeled by Γ and any branch point

such that

, an alternative optimal labeling Φ exists such that Φ
b is inside the union of blocks
If we root the tree at
b, the new optimal labeling for all its descendants is inferred in a backward pass of the Erdos-Szekely algorithm. This ensures that each branch point in a minimum length phylogenetic X-Tree is labeled inside
Bpq. Let

, where the intersection is taken over all pair of characters for which the pruning condition is satisfied. It follows from Lemma 1 that
Sb also contains an alternate optimal labeling of
T(
ψ). Note that
Sb is a non-empty subset of

. This must be true because given a character pair
cp,
cq, each union of blocks contains at least one taxon and so the rule matrix
R(
p,
q) that defines the Buneman graph must have ones for each of these blocks. Therefore each element in
Sb represents a distinct vertex of the Buneman graph.


■
As argued before, any minimum Steiner tree can be decomposed uniquely into phylogenetic X-Tree components and the previous corollary ensures that each phylogenetic X-Tree can be labeled optimally inside the generalized Buneman graph. It follows that all distinct minimum Steiner trees are contained inside the generalized Buneman graph.
2.5.

Integer linear program (ILP) construction
We briefly summarize the ILP flow construction used to find the optimal phylogeny. We convert the generalized Buneman graph into a directed graph by replacing an edge between vertices
u and
v with two directed edges (
u,
v), (
v,
u) each with weight
wuv as determined by the distance metric. Each directed edge has a corresponding binary variable
su,
v in our ILP. We arbitrarily choose one of the taxa as the root
r, which acts as a source for the flow model. The remaining taxa
T
![[equivalent]](/corehtml/pmc/pmcents/equiv.gif)

χ


−

{
r} correspond to sinks. Next, we set up real-valued flow variables

, representing the flow along the edge (
u,
v) that is intended for terminal
t. The root
r outputs |
T| units of flow, one for each terminal. The Steiner tree is the minimum-cost tree satisfying the flow constraints. This ILP was described in Sridhar et al. (
2007), and we refer the reader to that article for further details. The ILP for this construction of the Steiner tree problem is the following: