PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of cmbMary Ann Liebert, Inc.Mary Ann Liebert, Inc.JournalsSearchAlerts
Journal of Computational Biology
 
J Comput Biol. 2011 March; 18(3): 445–457.
PMCID: PMC3123839

Generalized Buneman Pruning for Inferring the Most Parsimonious Multi-State Phylogeny

Abstract

Accurate reconstruction of phylogenies remains a key challenge in evolutionary biology. Most biologically plausible formulations of the problem are formally NP-hard, with no known efficient solution. The standard in practice are fast heuristic methods that are empirically known to work very well in general, but can yield results arbitrarily far from optimal. Practical exact methods, which yield exponential worst-case running times but generally much better times in practice, provide an important alternative. We report progress in this direction by introducing a provably optimal method for the weighted multi-state maximum parsimony phylogeny problem. The method is based on generalizing the notion of the Buneman graph, a construction key to efficient exact methods for binary sequences, so as to apply to sequences with arbitrary finite numbers of states with arbitrary state transition weights. We implement an integer linear programming (ILP) method for the multi-state problem using this generalized Buneman graph and demonstrate that the resulting method is able to solve data sets that are intractable by prior exact methods in run times comparable with popular heuristics. We further show on a collection of less difficult problem instances that the ILP method leads to large reductions in average-case run times relative to leading heuristics on moderately hard problems. Our work provides the first method for provably optimal maximum parsimony phylogeny inference that is practical for multi-state data sets of more than a few characters.

Key words: combinatorial optimization, linear programming, molecular evolution, phylogenetic trees

1. Introduction

One of the fundamental problems in computational biology is that of inferring evolutionary relationships between a set of observed amino acid sequences or taxa. These evolutionary relationships are commonly represented by a tree (phylogeny) describing the descent of all observed taxa from a common ancestor, a reasonable model provided we are working with sequences over small enough regions or distant enough relationships that we can neglect recombination or other sources of reticulation (Posada and Crandall, 2001). Several criteria have been implemented in the literature for inferring phylogenies, of which one of the most popular is maximum parsimony (MP). MP defines the tree(s) with the fewest mutations as the optimum, generally a reasonable assumption for short time-scales or conserved sequences. It is a simple, non-parametric criterion, as opposed to common maximum likelihood models or various popular distance-based methods (Felsenstein, 2004). Nonetheless, MP is known to be NP-hard (Foulds and Graham, 1982), and practical implementations of MP are therefore generally based on heuristics which do not guarantee optimal solutions.

For sequences where each site or character is expressed over a set of discrete states, MP is equivalent to finding a minimum Steiner tree displaying the input taxa. For example, general DNA sequences can be expressed as strings of four nucleotide states and proteins as strings of 20 amino acid states. Recently, Sridhar et al. (2007) used integer linear programming to efficiently find global optima for the special case of sequences with binary characters, which are important when analyzing single nucleotide polymorphism (SNP) data. The solution was made tractable in practice in large part by a pruning scheme proposed by Buneman and extended by others (Buneman, 1971; Barthẽlemy, 1989; Bandelt et al., 1995). The so-called Buneman graph equation M1 for a given set of observed strings is an induced sub-graph of the complete graph equation M2 (whose nodes represent all possible strings of mutations) such that equation M3 still contains all distinct minimum Steiner trees for the observed data. By finding the Buneman graph, one can often greatly restrict the space of possible solutions to the Steiner tree problem. While there have been prior generalizations of the Buneman graph to non-binary characters (Bandelt et al., 1999; Huber and Moulton, 2002), they do not provide any comparable guarantees usable for accelerating Steiner tree inference.

In this article, we provide a new generalization of the definition of Buneman graph for any finite number of states that guarantees the resulting graph will contain all distinct minimum Steiner trees of the multi-state input set. Further, we allow transitions between different states to have independent weights. We then utilize the integer linear programming techniques developed in Sridhar et al. (2007) to find provably optimal solutions to the multi-state MP phylogeny problem. We validate our method on four specific data sets chosen to exhibit different levels of difficulty: a set of nucleotide sequences from Oryza rufipogon (Zhou et al., 2008), a set of human mt-DNA sequences representing prehistoric settlements in Australia (Hudjashov et al., 2007), a set of HIV-1 reverse transcriptase amino acid sequences, and finally, a 500 taxa human mitochondrial DNA data set. We further compare the performance of our method, in terms of both accuracy and efficiency, with leading heuristics, PAUP* (Swofford, 2009) and the pars program of PHYLIP (Felsenstein, 2008), showing our method to yield comparable and often far superior run times on non-trivial data sets.

2. Methods

2.1. Notation and background

Let H be an input matrix that specifies a set of N taxa χ, over a set of m characters equation M4 such that Hij represents the jth character of the ith taxon. The taxa of H represent the terminal nodes of the Steiner tree inference. Further, let nk be the number of admissible states of the kth character ck. The set of all possible states is the space equation M5. We will represent the ith character of any element equation M6, by (b)i The state space equation M7 can be represented as a graph equation M8 with the vertex set equation M9 and edge set equation M10, where δ[a, b] = 0 if a = b and 1 otherwise. Furthermore, let equation M11 be a set of weights, such that αp[i, j] represents an edge length for a transition between states equation M12 for character cp. We will assume that these lengths are positive (states that share zero edge length are indistinguishable), are symmetric in i, j, and satisfy the triangle inequality.

equation M13
(1)

Non-negativity and symmetry are basic properties for any reasonable definition of length. If a particular triplet of states (say i, j, k) does not satisfy the triangle inequality in equation 1, we can set αp[i, k] = αp[i, j] + αp[j, k] and still ensure that the shortest path connecting any set of states remains the same. We can now define a distance dα over equation M14, such that for any two elements equation M15

equation M16
(2)

Given any subgraph K = (VK, EK) of equation M17, we can define the length of K to be the sum of the lengths of all the edges equation M18. The MP phylogeny problem for χ is equivalent to constructing the minimum Steiner tree T* displaying the set of all specified taxa χ, i.e., any tree T*(V *, E*) such that equation M19* and L(T*) is minimum. Note that T* need not be unique.

2.2. Pre-processing

Before we construct the generalized Buneman graph corresponding to an input, we perform a basic pre-processing of the data. The set of taxa in the input H might not all be distinct over the length of sequence represented in H. These correspond to identical rows in H and are eliminated. Similarly, characters that do not mutate for any taxa do not affect the true phylogeny and can be removed. Furthermore, if two characters are expressed identically in χ (modulo a relabeling of the states), we will represent them by a single character with each edge length replaced by the sum of the edge lengths of the individual characters. In case there are n such non-distinct characters, one of them is given edge lengths equal to the sum of the corresponding edges in each of the n characters and the rest are discarded. These basic pre-processing steps are often useful in considerably reducing the size of input.

2.3. Buneman graph

The Buneman graph was introduced as a pruning of the complete graph for the special case of binary valued characters. For this special case, it is useful to introduce the notion of binary splits cp(0)|cp(1) for each character equation M20, which partition the set of taxa χ into two sets cp(0) and cp(1) corresponding to the value expressed by cp. Each of these sets is called a block of cp. Each vertex of the Buneman graph equation M21 can be represented by an m-tuple of blocks equation M22, where ij = 0 or 1, for equation M23. To construct the Buneman graph, a rule is defined for discarding/retaining the subset of vertices contained in each pair of overlapping blocks [cp(ip), cq(iq)] for each pair of characters equation M24. All vertices that satisfy equation M25 for any pair of characters (cp, cq) can be eliminated, while those for which equation M26 for all [cp(ip), cq(iq)] are retained. Buneman previously established for the binary case that the retained vertex set will contain all terminal and Steiner nodes of all distinct minimum length Steiner trees.

We extend this prior result to the weighted multi-state case by presenting an algorithm analogous to the binary case to construct a graph with these properties.

2.4. Algorithm for constructing the generalized Buneman graph

Briefly, the algorithm looks at the input matrix projected onto each distinct pair of characters p, q and constructs a np × nq matrix C(p, q), where the i × jth element C(p, q)ij is 1 only if there is at least one taxon t such that (t)p = i and (t)q = j and zero otherwise. The algorithm then implements a rule for each such pair of characters p, q that allows us to enumerate the possible states of those characters in any optimal Steiner tree. For clarity, we will assume that each state for each character is expressed in at least one input taxon, since states that are not present in any taxa cannot be present in a minimum length tree because of the triangle inequality. The rule is defined by a np × nq matrix R(p, q) determined by the following algorithm:

  • 1. R(p, q)ij  C(p, q)ij for all equation M27 and equation M28.
  • 2. If all non-zero entries in C(p, q) are contained in the set of elements

equation M29

  • for a unique pair equation M30 and equation M31 then R(p, q)xy  1 for all x, y such that either x = i or y = j. (See Figure 1 where this pair of states are denoted ipq and iqp.)
    FIG. 1.
    An example of the generalized Buneman pruning condition. If all taxa in χ are present in the shaded region, vertices in all other blocks can be discarded.
  • 3. If the condition in step 2 is not satisfied, then set R(p, q)ij  1 for all i, j.

This set of rules {R} then defines a subgraph equation M32 for each pair of characters p, q, such that any vertex equation M33 if and only if equation M34. The intersection of these subgraphs equation M35 then gives the generalized Buneman graph for χ given any set of distance metrics equation M36. Note that the Buneman graph of any subset of χ is a subset of equation M37. It is easily verified that, for binary characters, our algorithm yields the standard Buneman graph.

The remainder of this article will make two contributions. First, it will show that the generalized Buneman graph equation M38 defined above contains all minimum Steiner trees for the input taxa χ. This will in turn establish that restricting the search space for minimum Steiner trees to equation M39 will not affect the correctness of the search. The article will then empirically demonstrate the value of these methods to efficiently finding minimum Steiner trees in practice.

Before we prove that all Steiner minimum trees connecting the taxa are displayed in equation M40, we need to introduce the notion of a neighborhood decomposition. Suppose we are given any tree T(V, E) displaying the set of taxa χ. We will contract each degree-two Steiner node (i.e., any node that is not present in χ) and replace its two incident edges by a single weighted edge. Such trees are called X-Trees (Semple and Steel, 2003). Each X-Tree can be uniquely decomposed into its phylogenetic X-Tree components, which are maximal subtrees whose leaves are taxa. Formally, each phylogenetic X-Tree P(ψ) consists of a set of taxa equation M41 and a tree displaying them, such that there is a bijection or labeling η : lP  ψ between elements of ψ and the set of leaves equation M42 (Semple and Steel, 2003) (Fig. 2). All vertices in P(ψ) with degree 3 or higher will be called branch points. From now on, we will assume that, given any input tree, such a decomposition has already been performed (Fig. 2). Two phylogenetic X-Trees P(ψ) and P′(ψ) are considered equivalent if they have identical length and the same tree topology. By identical tree topology, we mean there is a bijection between the edge set of the two trees, such that removing any edge and its image partitions the leaves into identical bi-partitions. We define two trees to be neighborhood distinct if, after neighborhood decomposition, they differ in at least one phylogenetic X-Tree component. We define a labeling of the phylogenetic X-Tree as an injective map equation M43 between the vertices of P(ψ) and those of the graph equation M44 such that Γu represents the character string for the image of vertex u in equation M45. Since leaf labels are fixed to be the character strings representing the corresponding taxa, equation M46 for any leaf equation M47 Identical phylogenetic X-Trees can, however, differ in the labels Γu of internal branch points equation M48.

FIG. 2.
An input tree and its phylogenetic X-Tree components, with taxa labeled by integers.

We will use a generalization of the Fitch-Hartigan algorithm to weighted parsimony proposed by Erdos and Szekely (1994) and Wang et al. (1996). The algorithm uses a similar forward pass/backward pass technique to compute an optimal labeling for any phylogenetic X-Tree T(ψ). Arbitrarily root the tree T(ψ) at some taxon ζ and starting with the leaves compute the minimum length minLb, Tb) of any labeling of the subtree Tb consisting of the vertex b and its descendants, where the root b is labeled Γb as follows:

  • 1. If Γb labels a leaf equation M49, equation M50 and  otherwise.
  • 2. If b has k children equation M51, and Tv is the subtree consisting of equation M52 and its descendants,

equation M53
(3)

  • where the minimum is to be taken over all possible labels Γv for each character and for each child equation M54.

The optimal labeling of T(ψ) is one that minimizes the length at the root: L(T) = minL(ηζ, Tζ). Labels for each descendant are inferred in a backward pass from the root to the leaves and using equation 3. Note that the minimum length of a tree is just the sum of minimum lengths for each character, i.e., equation M55, where equation M56 is the minimum cost of tree Tb rooted at b for character cs.

Briefly, our proof is structured as follows: Given any phylogenetic X-Tree T(ψ) labeling (typically denoted Γ below), we will show that the generalized Buneman pruning algorithm for each pair of characters (cp, cq) defines a subgraph Bpq which contains at least one possible labeling of no higher cost (typically denoted Φ below) for T(ψ). We will then show that the intersection of these subgraphs equation M57 thus contains an optimal labeling for T(ψ).

If the pruning condition in step 2 of the algorithm that defines the Buneman graph is not implemented for the pair of characters (cp, cq), then equation M58 and all labels are necessarily inside Bpq. We prove the following lemma for the case when the pruning condition is satisfied, ie., there exist unique states ipq of cp and iqp of cq, such that each element in the set of leaves equation M59 either has (ηt)p = ipq or (ηt)q = iqp or both. Each time we relabel vertices, we will keep all characters except cp and cq fixed. To economize our notation, we will represent the sum of costs in cp and cq of the tree T labeled by Γ, which has some branch point b as the root, simply by writing L(Γ, T) = L(Γ, T)(p) + L(Γ, T)(q). We use the notation Γx = [(Γx)p, (Γx)q] to represent the label for a vertex x and suppress the state of all other characters.

Lemma 1

Given any phylogenetic X-Tree T(ψ) with equation M60, and a labeling Γ, such that an internal branch point equation M61 is labeled outside Bpq, i.e., equation M62, there exists an alternate labeling Φ of T(ψ) inside Bpq such that

  1. either L(Γ, T)  L(Φ, T) + dαb, Φb], or —
  2. L(Γ, T)  L(Φ, T) for each of the following choices: Φb = [ipq, iqp] or [ipq, (Γb)q] or [(Γb)p, iqp], and Φv = Γv for all v  b. We will call a tree that satisfies this second condition a (cp, cq)-Tree

Proof

We will use induction on the number of internal branch points outside Bpq to prove the claim. Without loss of generality, we can consider all branch points of T(ψ) to be labeled outside Bpq. If some branch points are labeled inside Bpq, then they can be treated as leaves of smaller X-Tree(s) that have all branch points outside Bpq. This is similar to the neighborhood decomposition we performed earlier for those branch points that were present in the set of input taxa. The set of branch points is then the set equation M63.

For the base case, assume all the leaves are joined at a single branch point b to form a star of degree |ψ| (Fig 3a, without the root ζ). We can group the leaves into three sets:

  1. equation M64
  2. equation M65
  3. equation M66
FIG. 3.
(a) The base case of a degree |ψ| star that can be attached to a parent vertex ζ in the Erdos-Szekely algorithm. (b) T(ψ) for the general case (see Lemma 1).

The cost of the tree for cp and cq, with branch point Γb = [x, y], is

equation M67
(4)

The only way for L(Γ, T)(p) + Lb, T)(q) to be minimum with x  i pq and y  iqp, is if III = [empty] and |I| = |II|. For contradiction, suppose |I| + |III| > |II|. We could then define a labeling Φ identical to Γ over all characters, except Φb = [ipq, y], such that dαb, Φb] = αpb, Φb]. We could then reduce the length, since

equation M68
(5)

where the last inequality follows from the triangle inequality. Similarly, if |II| + |III| > |I|, we could define Φb = [x, iqp] and arrive at L(Γ, T)(q)  L(Φ, T)(q) + d αb, Φb].

On the other hand if |I| = |II| and III = [empty] setting Φb = [ipq, y] or Φb = [x, iqp] or Φb = [ipq, iqp] all achieve a length no more than L(Γ, T)(p) + L(Γ, T)(q). Therefore, this is a (cp, cq)-Tree. This proves the base case for our proposition.

We will now assume that the claim is true for all trees with n branch points or less. Suppose we have a labeled tree T(ψ) with n + 1 branch points which are all outside Bpq. Let equation M69 be the children of a branch point b in T(ψ) and equation M70 be the subtrees of each equation M71 and their descendants. Note that some of these descendants may be leaves. Since T(ψ) has at least two branch points, one of its descendants (say v1) must be a branch point (Fig. 3b). Let equation M72 be the subtree consisting of b and all its other descendants. For clarity we will use the notation Γb = [xb, yb] and Γv1 = [x1, y1]. This implies,

equation M73
(6)

There are four possibilities.

  • 1. Both Tb and T1 are (cp, cq)-Trees with n or less branch points - In this case, by induction, both Tb and T1 can be relabeled with Φb and Φv1 of the form [ipq, iqp]. Since the cost in cp and cq of the edge (b, v1) is now zero, we have an optimal labeling of T(ψ) within Bpq and L(Γ, T)  L(Φ, T). Note that each of the choices of the form [ipq, y1] or [x1, ipq] for relabeling of b also satisfy property 2 of the claim. Therefore, this is a (cp, cq)-Tree.
  • 2. Tb is a (cp, cq)-Tree, but T1 is not. Therefore, there is a labeling Φ of T1 with either Φv1 = [ipq, y1] and/or Φv1 = [x1, ipq] such that
equation M74
(7)

Let us assume for concreteness that Φv1 = [ipq, y1]. It will become clear that the argument works for the other possible choices. Since, Tb is a (cp, cq)-Tree, by induction, we can choose a labeling of Tb with Φb = [ipq, yb], such that L(Γ, Tb)  L(Φ, Tb). This gives

equation M75
(8)

Comparing the previous two equations with equation 6, we get,

equation M76
(9)

which satisfies the first possibility of the claim. It should be clear that if Φv1 = [x1, iqp] then the choice Φb = [xb, iqp] would give an identical bound.

  • 3. T1 is a (cp, cq)-Tree, but Tb is not. This case is similar to the previous one. Since Tb has less than n branch points, which are all outside Bpq, and it is not a (cp, cq)-Tree, we have from induction a labeling Φ of Tb with either Φb = [ipq, yb] and/or Φb = [xb, ipq] such that
equation M77
(10)

As before, let us assume Φb = [ipq, yb] for concreteness. Since T1 is a (cp, cq)-Tree, we can choose a labeling with Φv1 = [ipq, y1] such that L(Γ, T1)  L(Φ, T1). This gives,

equation M78
(11)

Comparing the previous two equations with equation 6, we get,

equation M79
(12)

An identical argument carries through if Φb = [xb, iqp].

  • 4. Neither T1 or Tb are (cp, cq)-Trees. It follows from induction that there is a labeling Φ such that L(Γ, Tb)  L(Φ, Tb) + dαb, Φb] and L(Γ, T1)  L(Φ, T1) + dαv1, Φv1]. There are two possibilities in this case.
  • (a) (Φb = [ipq, yb] and Φv1 = [ipq, y1]) or (Φb = [xb, iqp] and Φv1 = [x1, iqp]). As before, we will prove the claim for the former possibility while the later case can be proved by an identical argument.
equation M80
(13)
equation M81
(14)

This also satisfies the claim. The proof for Φb = [xb, iqp] and Φv1 = [x1, iqp] is identical.

  • (b) (Φb = [ipq, yb] and Φv1 = [x1, iqp]) or (Φb = [xb, iqp] and Φv1 = [ipq, y1]). As before, we show the calculation for the former possibility. In this case
equation M82
(15)

Combining this with equation 6 we get,

equation M83
(16)

But if we now relabel b and v1 with equation M84 and equation M85 while equation M86 for all other v, we get equation M87 and equation M88. This immediately gives,

equation M89
(17)

Identical arguments work for the choices equation M90 and equation M91.

This proves that, if either of the two possibilities claimed are always true for an X-Tree with n branch points or less, then they are also true for a tree with n + 1 branch points. The proof for arbitrary n follows from induction.

Corollary 1

Given a minimum length phylogenetic X-Tree T(ψ), there is an optimal labeling for each branch point within equation M92.

Proof

Lemma 1 establishes that for any minimum Steiner tree labeled by Γ and any branch point equation M93 such that equation M94, an alternative optimal labeling Φ exists such that Φb is inside the union of blocks

equation M95

If we root the tree at b, the new optimal labeling for all its descendants is inferred in a backward pass of the Erdos-Szekely algorithm. This ensures that each branch point in a minimum length phylogenetic X-Tree is labeled inside Bpq. Let equation M96, where the intersection is taken over all pair of characters for which the pruning condition is satisfied. It follows from Lemma 1 that Sb also contains an alternate optimal labeling of T(ψ). Note that Sb is a non-empty subset of equation M97. This must be true because given a character pair cp, cq, each union of blocks contains at least one taxon and so the rule matrix R(p, q) that defines the Buneman graph must have ones for each of these blocks. Therefore each element in Sb represents a distinct vertex of the Buneman graph.

As argued before, any minimum Steiner tree can be decomposed uniquely into phylogenetic X-Tree components and the previous corollary ensures that each phylogenetic X-Tree can be labeled optimally inside the generalized Buneman graph. It follows that all distinct minimum Steiner trees are contained inside the generalized Buneman graph.

2.5. Integer linear program (ILP) construction

We briefly summarize the ILP flow construction used to find the optimal phylogeny. We convert the generalized Buneman graph into a directed graph by replacing an edge between vertices u and v with two directed edges (u, v), (v, u) each with weight wuv as determined by the distance metric. Each directed edge has a corresponding binary variable su, v in our ILP. We arbitrarily choose one of the taxa as the root r, which acts as a source for the flow model. The remaining taxa T [equivalent] χ   {r} correspond to sinks. Next, we set up real-valued flow variables equation M98, representing the flow along the edge (u, v) that is intended for terminal t. The root r outputs |T| units of flow, one for each terminal. The Steiner tree is the minimum-cost tree satisfying the flow constraints. This ILP was described in Sridhar et al. (2007), and we refer the reader to that article for further details. The ILP for this construction of the Steiner tree problem is the following:

equation M99
(18)

3. Validation

We implemented our generalized Buneman pruning and the ILP in C++. The Buneman graph was constructed using the method described in Sridhar et al. (2007). The ILP was solved using the Concert callable library of CPLEX 10.0. We compared the performance of our method with popular heuristic methods for maximum parsimony phylogeny inference—pars and protpars, which are part of the freely available PHYLIP package (Felsenstein, 2008), and PAUP* (Swofford, 2009), the leading commercial phylogenetics package. We attempted to use PHYLIP's exact branch-and-bound method DNA penny for nucleotide sequences, but discontinued the tests when it failed to solve any of the data sets in under 24 hours. In each case, pars and PAUP* were run with default parameters.

3.1. Experiments

We first report results from three moderate-sized data sets selected to provide varying degrees of difficulty: a set of 1,043 sites from a set of 41 sequences of O. rufipogon (red rice) (Zhou et al., 2008), 245 positions from a set of 80 human mt-DNA sequences reported by Hudjashov et al. (2007), and 176 positions from 50 HIV-1 reverse transcriptase amino acid sequences. The HIV sequences were retrieved by NCBI BLASTP (Altschul et al., 1997) searching for the top 50 best aligned taxa for the query sequence GI 19571541 and default parameters. We performed three sets of experiments for this protein data set. The first set considered simple unweighted parsimony (HIV-1 RT simple), the second used weighted parsimony with step matrices derived from the genetic code (HIV-1 RT protpars) as used in protpars of the PHYLIP package, and the third used the BLOSUM62 substitution matrix1. (HIV-1 RT BLOSUM) by setting the most negative score to zero and increasing all entries accordingly. Diagonal entries representing the cost of no mutation were set to zero. Since PHYLIP does not allow arbitrarily weighted unordered characters, analysis using BLOSUM62 could only be performed for ILP and PAUP*.

We next added additional tests on larger data sets all derived from human mitochondrial DNA. The mtDNA data was retrieved from NCBI BLASTN, searching for the 500 best aligned taxa for the query sequence GI 61287976 and default parameters. The complete set of 16,546 characters (after removing indels) was then broken into four windows of varying sizes and characteristics: the first 3,000 characters (mt3000), the first 5,000 characters (mt5000a), the next 5,000 characters (mt5000b), and the first 10,000 characters (mt10000).

Finally, we added an additional set of experiments assessing average-case performance over smaller instances, also based on the mtDNA data. For these experiments, we considered only those mutating sites that were phylogenetically informative (mt Windows). Phylogenetically informative sites are those where at least two different nucleotides are present in two or more taxa. Screening out uninformative sites reduced the data set to 415 sites and 414 taxa after preprocessing. Next, we split the remaining sites into overlapping windows of 10 characters each, so that successive windows shared nine characters. We then computed the phylogenies for each window by sliding the starting position across the entire set of phylogenetically informative characters, yielding a total of 406 distinct phylogenetic inferences.

Table 1 summarizes the results of all of the analyses.

Table 1.
Pruning and Run Time Results for the Data Sets Reported

3.2. Results

For the set of 41 sequences of lhs-1 gene from O. rufipogon (red rice) (Zhou et al., 2008), our method pruned the full graph of 218 * 32 nodes (after screening out redundant characters) to 58. Figure 4a shows the resulting phylogeny. Both PAUP* and pars yielded an optimal tree although more slowly than the ILP (2.09 and 2.57 seconds, respectively, as opposed to 0.29 seconds).

FIG. 4.
Most parsimonious phylogenies. (a) lhs-1 gene for O. rufipogon (Zhou et al., 2008). (b) Human mt-DNA (Hudjashov et al., 2007). (c) HIV-1 RT proteins (unweighted) (Altschul et al., 1997). Edges are labeled by their lengths in parentheses, followed by sites ...

For the 245-base human mt-DNA sequences, the generalized Buneman pruning was again highly efficient, reducing the state set from 228 after removing redundant sequences to 64. Figure 4b shows the phylogeny returned. While PAUP* was able to find the optimal phylogeny (although it was again slower at 5.69 versus 0.48 seconds), pars yielded a slightly sub-optimal phylogeny (length 45 instead of 44) in a comparable run time (0.56 seconds).

For HIV-1 sequences, our method pruned the full graph of 216 * 3 * 42 possible nodes to a generalized Buneman graph of 297 nodes, allowing solution of the ILP in about two minutes. Figure 4c shows an optimal phylogeny for the data. PAUP* was again able to find the optimal phylogeny and in this case was faster than the ILP (3.84 as opposed to 127.5 seconds). pars required a shorter run time of 0.30 seconds, but yielded a sub-optimal tree of length 42, as opposed to the true minimum of 40.

For weighted parsimony analysis, we preprocessed the protein data to remove sites containing the ambiguous states X and Z. This reduced the data to 30 taxa and 16 characters, resulting in a generalized Buneman graph of just 88 nodes. For the protpars step matrix, all three methods gave the correct cost of 33. protpars was the fastest at 0.05 seconds followed by ILP (1.31 seconds) and PAUP* (4.30 seconds). For the BLOSUM62 step matrix, PAUP* yielded the correct length of 296 but was slightly slower than ILP (3.12 seconds as compared to 0.49 seconds).

For the four larger mitochondrial data sets, Buneman pruning was again highly effective in reducing graph size relative to the complete graph, although the ILP approach eventually proves impractical when Buneman graph size grows sufficiently large. Two of the data sets yielded Buneman graphs of size below 400, resulting in ILP solutions orders of magnitude faster than the heuristics. mt5000a, however, yielded a Buneman graph of over 1,000 nodes, resulting in an ILP that ran more slowly than the heuristics. mt10000 resulted in a Buneman graph of over 6,000 nodes, leading to an ILP too large to solve. pars was faster than PAUP* in all cases, but PAUP* found optimal solutions for all three instances we can verify while pars found a sub-optimal solution in one instance.

The last set of experiments used a sliding window of 10 phylogenetically informative characters across the entire sequence. Figure 5 shows the variation in tree length across the entire set of 406 windows. Here, ILP was orders of magnitude faster than either heuristic. ILP was able to scan the entire set of windows in 19.09 seconds. In comparison, pars took over 14 hours. pars was able to find the minimal length in all but 5 instances. PAUP* results are not reported, because it proved too slow to conduct the test in a reasonable time scale. When run for 24 hours, it solved only the first window (which took 6 hours), although it did find the optimal answer in this case. We tried to implement PAUP* on several randomly selected windows, but it failed to terminate in under 24 hours on any of them. However, we note that, in each instance, PAUP* was able to find the correct answer in less than a minute.

FIG. 5.
Variation in tree length for sliding windows of 10 phylogenetically informative characters for the mitochondrial DNA data set.

3.3. Remarks

This last set of experiments demonstrates that our method is often much faster than the heuristics on easy to moderately hard data sets. This is an advantage of the ILP approach, since it is quickly able to prove the optimality of the solution, whereas both pars and PAUP* terminate search according to heuristic estimates. Furthermore, we conjecture that the size of the generalized Buneman graph provides a good indication of the true hardness of the problem. In practice, we have found that problem instances where equation M100 is equation M101 are solvable to optimality within a day. For harder instances, memory requirements increase rapidly and an exact solution is not feasible.

We can thus conclude that the generalized Buneman pruning approach developed here is very effective at reducing problem size, but solving provably to optimality does eventually become impractical for large data sets. Heuristic approaches remain a practical necessity for such cases even though they cannot guarantee, and do not always deliver, optimality. Comparison of PAUP* to pars and the ILP suggests that more aggressive sampling over possible solutions by the heuristics can lead to optimality even on very difficult instances but at the cost of generally greatly increased run time on the easy to moderate instances.

4. Conclusion

We have presented a new method for finding provably optimal maximum parsimony phylogenies on multi-state characters with weighted state transitions, using ILP. The method builds on a novel generalization of the Buneman graph for characters with arbitrarily large but finite state sets and for arbitrary weight functions on character transitions. Although the method has an exponential worst-case performance, empirical results show that it is fast in practice. The method provides a feasible alternative for data sets as large as a few hundred taxa and is often far more efficient than standard heuristics on easier problem instances. While there are many efficient heuristics for recontructing MP phylogenies, our results cater to the need for provably exact methods that are fast enough to solve the problem for biologically relevant multi-state data sets. Our work could potentially be extended to include more sophisticated integer programming techniques that have been successful in solving large instances of other hard optimization problems—for instance, the recent solution of the 85,900-city traveling salesman problem pla85900 (Applegate et al., 2009). The theoretical contributions of this article may also prove useful to work on open problems in multi-state MP phylogenetics, to accelerating methods for related objectives, and to sampling among optimal or near-optimal solutions.

Footnotes

1We used sample files available at http://paup.csit.fsu.edu/nfiles.html

Acknowledgments

N.M. would like to thank Ming-Chi Tsai for several useful discussions. N.M., G.B., R.R., and R.S. were supported in this work by the National Science Foundation (grant 0612099). R.S. was additionally supported by the National Institutes of Health (grant 1R01CA140214).

Disclosure Statement

No competing financial interests exist.

References

  • Altschul S.F. Madden T.L. Schaffer A.A., et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
  • Applegate D.L. Bixby R.E. Chvatal V., et al. Certification of an optimal TSP tour through 85,900 cities. Oper. Res. Lett. 2009;37:11–15.
  • Bandelt H.J. Forster P. Sykes B.C., et al. Mitochondrial portraits of human populations using median networks. Genetics. 1995;141:743–753. [PubMed]
  • Bandelt H.J. Forster P. Rohl A. Median-joining networks for inferring intraspecific phylogenies. Mol. Biol. Evol. 1999;16:37–48. [PubMed]
  • Barthẽlemy J. From copair hypergraphs to median graphs with latent vertices. Discrete Math. 1989;76:9–28.
  • Buneman P. Hodson F. Kendall D.G. Tautu P. Mathematics in the Archeological and Historical Sciences. Edinburgh University Press; Edinburgh, UK: 1971. The recovery of trees from measures of dissimilarity, 387–395.
  • Erdos P.L. Szekely L.A. On weighted multiway cuts in trees. Math. Program. 1994;65:93–105.
  • Felsenstein J. Inferring Phylogenies. Sinauer Associates; Sunderland, MA: 2004.
  • Felsenstein J. PHYLIP (phylogeny Inference package), version 3.6. Distributed by author, Department of Genome Sciences. University of Washington; Seattle: 2008.
  • Foulds L.R. Graham R.L. The Steiner problem in phylogeny is NP-complete. Adv. Appl. Math. 1982;3:43–49.
  • Huber K.T. Moulton V. The relation graph. Discrete Math. 2002;244:153–166.
  • Hudjashov G. Kivisild T. Underhill P.A., et al. Revealing the prehistoric settlement of Australia by Y chromosome and mtDNA analysis. Proc. Natl. Acad. Sci. USA. 2007;104:8726–8730. [PubMed]
  • Posada D. Crandall K. Intraspecific gene genealogies: trees grafting into networks. Trends Ecol. Evol. 2001;16:37–45. [PubMed]
  • Semple C. Steel M. Phylogenetics. Oxford University Press; New York: 2003.
  • Sridhar S. Lam F. Blelloch G., et al. Efficiently finding the most parsimonious phylogenetic tree. Lect. Notes Comput. Sci. 2007;4463:37–48.
  • Swofford D. PAUP* 4.0. Sinauer Associates; Sunderland, MA: 2009.
  • Wang L. Jiang T. Lawler L. Approximation algorithms for tree alignment with a given phylogeny. Algorithmica. 1996;16:302–315.
  • Zhou H.F. Zheng X.M. Wei R.X., et al. Contrasting population genetic structure and gene flow between Oryza rufipogon and Oryza nivara. Theor. Appl. Genet. 2008;117:1181–1189. [PubMed]

Articles from Journal of Computational Biology are provided here courtesy of Mary Ann Liebert, Inc.