Home | About | Journals | Submit | Contact Us | Français |

**|**J Comput Biol**|**PMC3123839

Formats

Article sections

Authors

Related links

Journal of Computational Biology

J Comput Biol. 2011 March; 18(3): 445–457.

PMCID: PMC3123839

Address correspondence to: *Dr. Russell Schwartz, 654B Mellon Institute, Department of Biological Sciences, Carnegie Mellon University, 4400 Fifth Avenue, Pittsburgh, PA 15213. E-mail:*Email: russells/at/andrew.cmu.edu

Copyright 2011, Mary Ann Liebert, Inc.

This article has been cited by other articles in PMC.

Accurate reconstruction of phylogenies remains a key challenge in evolutionary biology. Most biologically plausible formulations of the problem are formally NP-hard, with no known efficient solution. The standard in practice are fast heuristic methods that are empirically known to work very well in general, but can yield results arbitrarily far from optimal. Practical exact methods, which yield exponential worst-case running times but generally much better times in practice, provide an important alternative. We report progress in this direction by introducing a provably optimal method for the weighted multi-state maximum parsimony phylogeny problem. The method is based on generalizing the notion of the Buneman graph, a construction key to efficient exact methods for binary sequences, so as to apply to sequences with arbitrary finite numbers of states with arbitrary state transition weights. We implement an integer linear programming (ILP) method for the multi-state problem using this generalized Buneman graph and demonstrate that the resulting method is able to solve data sets that are intractable by prior exact methods in run times comparable with popular heuristics. We further show on a collection of less difficult problem instances that the ILP method leads to large reductions in average-case run times relative to leading heuristics on moderately hard problems. Our work provides the first method for provably optimal maximum parsimony phylogeny inference that is practical for multi-state data sets of more than a few characters.

One of the fundamental problems in computational biology is that of inferring evolutionary relationships between a set of observed amino acid sequences or taxa. These evolutionary relationships are commonly represented by a tree (phylogeny) describing the descent of all observed taxa from a common ancestor, a reasonable model provided we are working with sequences over small enough regions or distant enough relationships that we can neglect recombination or other sources of reticulation (Posada and Crandall, 2001). Several criteria have been implemented in the literature for inferring phylogenies, of which one of the most popular is maximum parsimony (MP). MP defines the tree(s) with the fewest mutations as the optimum, generally a reasonable assumption for short time-scales or conserved sequences. It is a simple, non-parametric criterion, as opposed to common maximum likelihood models or various popular distance-based methods (Felsenstein, 2004). Nonetheless, MP is known to be NP-hard (Foulds and Graham, 1982), and practical implementations of MP are therefore generally based on heuristics which do not guarantee optimal solutions.

For sequences where each site or character is expressed over a set of discrete states, MP is equivalent to finding a minimum Steiner tree displaying the input taxa. For example, general DNA sequences can be expressed as strings of four nucleotide states and proteins as strings of 20 amino acid states. Recently, Sridhar et al. (2007) used integer linear programming to efficiently find global optima for the special case of sequences with binary characters, which are important when analyzing single nucleotide polymorphism (SNP) data. The solution was made tractable in practice in large part by a pruning scheme proposed by Buneman and extended by others (Buneman, 1971; Barthẽlemy, 1989; Bandelt et al., 1995). The so-called Buneman graph for a given set of observed strings is an induced sub-graph of the complete graph (whose nodes represent all possible strings of mutations) such that still contains all distinct minimum Steiner trees for the observed data. By finding the Buneman graph, one can often greatly restrict the space of possible solutions to the Steiner tree problem. While there have been prior generalizations of the Buneman graph to non-binary characters (Bandelt et al., 1999; Huber and Moulton, 2002), they do not provide any comparable guarantees usable for accelerating Steiner tree inference.

In this article, we provide a new generalization of the definition of Buneman graph for any finite number of states that guarantees the resulting graph will contain all distinct minimum Steiner trees of the multi-state input set. Further, we allow transitions between different states to have independent weights. We then utilize the integer linear programming techniques developed in Sridhar et al. (2007) to find provably optimal solutions to the multi-state MP phylogeny problem. We validate our method on four specific data sets chosen to exhibit different levels of difficulty: a set of nucleotide sequences from *Oryza rufipogon* (Zhou et al., 2008), a set of human mt-DNA sequences representing prehistoric settlements in Australia (Hudjashov et al., 2007), a set of HIV-1 reverse transcriptase amino acid sequences, and finally, a 500 taxa human mitochondrial DNA data set. We further compare the performance of our method, in terms of both accuracy and efficiency, with leading heuristics, PAUP* (Swofford, 2009) and the pars program of PHYLIP (Felsenstein, 2008), showing our method to yield comparable and often far superior run times on non-trivial data sets.

Let *H* be an input matrix that specifies a set of *N* taxa *χ*, over a set of *m* characters such that *H _{ij}* represents the

(1)

Non-negativity and symmetry are basic properties for any reasonable definition of length. If a particular triplet of states (say *i*, *j*, *k*) does not satisfy the triangle inequality in equation 1, we can set *α _{p}*[

(2)

Given any subgraph *K*=(*V _{K}*,

Before we construct the generalized Buneman graph corresponding to an input, we perform a basic pre-processing of the data. The set of taxa in the input *H* might not all be distinct over the length of sequence represented in *H*. These correspond to identical rows in *H* and are eliminated. Similarly, characters that do not mutate for any taxa do not affect the true phylogeny and can be removed. Furthermore, if two characters are expressed identically in *χ* (modulo a relabeling of the states), we will represent them by a single character with each edge length replaced by the sum of the edge lengths of the individual characters. In case there are *n* such non-distinct characters, one of them is given edge lengths equal to the sum of the corresponding edges in each of the *n* characters and the rest are discarded. These basic pre-processing steps are often useful in considerably reducing the size of input.

The Buneman graph was introduced as a pruning of the complete graph for the special case of binary valued characters. For this special case, it is useful to introduce the notion of binary splits *c _{p}*(0)|

We extend this prior result to the weighted multi-state case by presenting an algorithm analogous to the binary case to construct a graph with these properties.

Briefly, the algorithm looks at the input matrix projected onto each distinct pair of characters *p*, *q* and constructs a *n _{p}*×

- 1.
*R*(*p*,*q*)_{ij}←*C*(*p*,*q*)_{ij}for all and . - 2.If all non-zero entries in
*C*(*p*,*q*) are contained in the set of elements

- for a unique pair and then
*R*(*p*,*q*)_{xy}←1 for all*x*,*y*such that either*x=i*or*y=j*. (See Figure 1 where this pair of states are denoted*i*and_{pq}*i*.)_{qp} - 3.If the condition in step 2 is not satisfied, then set
*R*(*p*,*q*)_{ij}←1 for all*i*,*j*.

This set of rules {*R*} then defines a subgraph for each pair of characters *p*, *q*, such that any vertex if and only if . The intersection of these subgraphs then gives the generalized Buneman graph for *χ* given any set of distance metrics . Note that the Buneman graph of any subset of *χ* is a subset of . It is easily verified that, for binary characters, our algorithm yields the standard Buneman graph.

The remainder of this article will make two contributions. First, it will show that the generalized Buneman graph defined above contains all minimum Steiner trees for the input taxa *χ*. This will in turn establish that restricting the search space for minimum Steiner trees to will not affect the correctness of the search. The article will then empirically demonstrate the value of these methods to efficiently finding minimum Steiner trees in practice.

Before we prove that all Steiner minimum trees connecting the taxa are displayed in , we need to introduce the notion of a *neighborhood decomposition*. Suppose we are given any tree *T*(*V*, *E*) displaying the set of taxa *χ*. We will contract each degree-two Steiner node (i.e., any node that is not present in *χ*) and replace its two incident edges by a single weighted edge. Such trees are called *X-Trees* (Semple and Steel, 2003). Each X-Tree can be uniquely decomposed into its *phylogenetic X-Tree* components, which are maximal subtrees whose leaves are taxa. Formally, each phylogenetic X-Tree *P*(*ψ*) consists of a set of taxa and a tree displaying them, such that there is a bijection or labeling *η* : *l _{P}*→

We will use a generalization of the Fitch-Hartigan algorithm to weighted parsimony proposed by Erdos and Szekely (1994) and Wang et al. (1996). The algorithm uses a similar forward pass/backward pass technique to compute an optimal labeling for any phylogenetic X-Tree *T*(*ψ*). Arbitrarily root the tree *T*(*ψ*) at some taxon *ζ* and starting with the leaves compute the minimum length *minL*(Γ_{b}, *T _{b}*) of any labeling of the subtree

- 1.If Γ
_{b}labels a leaf , and∞otherwise. - 2.If
*b*has*k*children , and*T*is the subtree consisting of and its descendants,_{v}

(3)

- where the minimum is to be taken over all possible labels Γ
_{v}for each character and for each child .

The optimal labeling of *T*(*ψ*) is one that minimizes the length at the root: *L*(*T*)=*minL*(*η _{ζ}*,

Briefly, our proof is structured as follows: Given any phylogenetic X-Tree *T*(*ψ*) labeling (typically denoted Γ below), we will show that the generalized Buneman pruning algorithm for each pair of characters (*c _{p}*,

If the pruning condition in step 2 of the algorithm that defines the Buneman graph is not implemented for the pair of characters (*c _{p}*,

*Given any phylogenetic X-Tree T*(*ψ*) *with* , *and a labeling* Γ, *such that an internal branch point* *is labeled outside B _{pq}*,

*either L*(Γ,*T*)≥*L*(Φ,*T*)+*d*[Γ_{α}_{b}, Φ_{b}],*or —**L*(Γ,*T*)≥*L*(Φ,*T*)*for each of the following choices:*Φ_{b}=[*i*,_{pq}*i*]_{qp}*or*[*i*, (Γ_{pq}_{b})_{q}]*or*[(Γ_{b})_{p},*i*],_{qp}*and*Φ_{v}=Γ_{v}*for all v≠b. We will call a tree that satisfies this second condition a*(*c*,_{p}*c*)_{q}*-Tree*

We will use induction on the number of internal branch points outside *B _{pq}* to prove the claim. Without loss of generality, we can consider all branch points of

For the base case, assume all the leaves are joined at a single branch point *b* to form a star of degree |*ψ*| (Fig 3a, without the root *ζ*). We can group the leaves into three sets:

The cost of the tree for *c _{p}* and

(4)

The only way for *L*(Γ, *T*)^{(p)}+*L*(Γ _{b}, *T*)^{(q)} to be minimum with *x≠i* _{pq} and *y≠i _{qp}*, is if

(5)

where the last inequality follows from the triangle inequality. Similarly, if |*II*|+|*III*|>|*I*|, we could define Φ_{b}=[*x*, *i _{qp}*] and arrive at

On the other hand if |*I*|=|*II*| and *III*= setting Φ_{b}=[*i _{pq}*,

We will now assume that the claim is true for all trees with *n* branch points or less. Suppose we have a labeled tree *T*(*ψ*) with *n*+1 branch points which are all outside *B _{pq}*. Let be the children of a branch point

(6)

There are four possibilities.

- 1.Both
*T*and_{b}*T*_{1}are (*c*,_{p}*c*)-Trees with_{q}*n*or less branch points - In this case, by induction, both*T*and_{b}*T*_{1}can be relabeled with Φ_{b}and Φ_{v1}of the form [*i*,_{pq}*i*]. Since the cost in_{qp}*c*and_{p}*c*of the edge (_{q}*b*,*v*_{1}) is now zero, we have an optimal labeling of*T*(*ψ*) within*B*and_{pq}*L*(Γ,*T*)≥*L*(Φ,*T*). Note that each of the choices of the form [*i*,_{pq}*y*_{1}] or [*x*_{1},*i*] for relabeling of_{pq}*b*also satisfy property 2 of the claim. Therefore, this is a (*c*,_{p}*c*)-Tree._{q} - 2.
*T*is a (_{b}*c*,_{p}*c*)-Tree, but_{q}*T*_{1}is not. Therefore, there is a labeling Φ of*T*_{1}with either Φ_{v1}=[*i*,_{pq}*y*_{1}] and/or Φ_{v1}=[*x*_{1},*i*] such that_{pq}

(7)

Let us assume for concreteness that Φ_{v1}=[*i _{pq}*,

(8)

Comparing the previous two equations with equation 6, we get,

(9)

which satisfies the first possibility of the claim. It should be clear that if Φ_{v1}=[*x*_{1}, *i _{qp}*] then the choice Φ

- 3.
*T*_{1}is a (*c*,_{p}*c*)-Tree, but_{q}*T*is not. This case is similar to the previous one. Since_{b}*T*has less than_{b}*n*branch points, which are all outside*B*, and it is not a (_{pq}*c*,_{p}*c*)-Tree, we have from induction a labeling Φ of_{q}*T*with either Φ_{b}_{b}=[*i*,_{pq}*y*] and/or Φ_{b}_{b}=[*x*,_{b}*i*] such that_{pq}

(10)

As before, let us assume Φ_{b}=[*i _{pq}*,

(11)

Comparing the previous two equations with equation 6, we get,

(12)

An identical argument carries through if Φ_{b}=[*x _{b}*,

- 4.Neither
*T*_{1}or*T*are (_{b}*c*,_{p}*c*)-Trees. It follows from induction that there is a labeling Φ such that_{q}*L*(Γ,*T*)≥_{b}*L*(Φ,*T*)+_{b}*d*[Γ_{α}_{b}, Φ_{b}] and*L*(Γ,*T*_{1})≥*L*(Φ,*T*_{1})+*d*[Γ_{α}_{v1}, Φ_{v1}]. There are two possibilities in this case.

- (a) (Φ
_{b}=[*i*,_{pq}*y*] and Φ_{b}_{v1}=[*i*,_{pq}*y*_{1}]) or (Φ_{b}=[*x*,_{b}*i*] and Φ_{qp}_{v1}=[*x*_{1},*i*]). As before, we will prove the claim for the former possibility while the later case can be proved by an identical argument._{qp}

(13)

(14)

This also satisfies the claim. The proof for Φ_{b}=[*x _{b}*,

- (b) (Φ
_{b}=[*i*,_{pq}*y*] and Φ_{b}_{v1}=[*x*_{1},*i*]) or (Φ_{qp}_{b}=[*x*,_{b}*i*] and Φ_{qp}_{v1}=[*i*,_{pq}*y*_{1}]). As before, we show the calculation for the former possibility. In this case

(15)

Combining this with equation 6 we get,

(16)

But if we now relabel *b* and *v*_{1} with and while for all other *v*, we get and . This immediately gives,

(17)

Identical arguments work for the choices and .

This proves that, if either of the two possibilities claimed are always true for an X-Tree with *n* branch points or less, then they are also true for a tree with *n*+1 branch points. The proof for arbitrary *n* follows from induction.■

*Given a minimum length phylogenetic X-Tree T*(*ψ*), *there is an optimal labeling for each branch point within* .

Lemma 1 establishes that for any minimum Steiner tree labeled by Γ and any branch point such that , an alternative optimal labeling Φ exists such that Φ_{b} is inside the union of blocks

If we root the tree at *b*, the new optimal labeling for all its descendants is inferred in a backward pass of the Erdos-Szekely algorithm. This ensures that each branch point in a minimum length phylogenetic X-Tree is labeled inside *B _{pq}*. Let , where the intersection is taken over all pair of characters for which the pruning condition is satisfied. It follows from Lemma 1 that

As argued before, any minimum Steiner tree can be decomposed uniquely into phylogenetic X-Tree components and the previous corollary ensures that each phylogenetic X-Tree can be labeled optimally inside the generalized Buneman graph. It follows that all distinct minimum Steiner trees are contained inside the generalized Buneman graph.

We briefly summarize the ILP flow construction used to find the optimal phylogeny. We convert the generalized Buneman graph into a directed graph by replacing an edge between vertices *u* and *v* with two directed edges (*u*, *v*), (*v*, *u*) each with weight *w _{uv}* as determined by the distance metric. Each directed edge has a corresponding binary variable

(18)

We implemented our generalized Buneman pruning and the ILP in C++. The Buneman graph was constructed using the method described in Sridhar et al. (2007). The ILP was solved using the Concert callable library of CPLEX 10.0. We compared the performance of our method with popular heuristic methods for maximum parsimony phylogeny inference—pars and protpars, which are part of the freely available PHYLIP package (Felsenstein, 2008), and PAUP* (Swofford, 2009), the leading commercial phylogenetics package. We attempted to use PHYLIP's exact branch-and-bound method DNA penny for nucleotide sequences, but discontinued the tests when it failed to solve any of the data sets in under 24 hours. In each case, pars and PAUP* were run with default parameters.

We first report results from three moderate-sized data sets selected to provide varying degrees of difficulty: a set of 1,043 sites from a set of 41 sequences of *O. rufipogon* (red rice) (Zhou et al., 2008), 245 positions from a set of 80 human mt-DNA sequences reported by Hudjashov et al. (2007), and 176 positions from 50 HIV-1 reverse transcriptase amino acid sequences. The HIV sequences were retrieved by NCBI BLASTP (Altschul et al., 1997) searching for the top 50 best aligned taxa for the query sequence GI 19571541 and default parameters. We performed three sets of experiments for this protein data set. The first set considered simple unweighted parsimony (HIV-1 RT simple), the second used weighted parsimony with step matrices derived from the genetic code (HIV-1 RT protpars) as used in protpars of the PHYLIP package, and the third used the BLOSUM62 substitution matrix^{1}. (HIV-1 RT BLOSUM) by setting the most negative score to zero and increasing all entries accordingly. Diagonal entries representing the cost of no mutation were set to zero. Since PHYLIP does not allow arbitrarily weighted unordered characters, analysis using BLOSUM62 could only be performed for ILP and PAUP*.

We next added additional tests on larger data sets all derived from human mitochondrial DNA. The mtDNA data was retrieved from NCBI BLASTN, searching for the 500 best aligned taxa for the query sequence GI 61287976 and default parameters. The complete set of 16,546 characters (after removing indels) was then broken into four windows of varying sizes and characteristics: the first 3,000 characters (mt3000), the first 5,000 characters (mt5000a), the next 5,000 characters (mt5000b), and the first 10,000 characters (mt10000).

Finally, we added an additional set of experiments assessing average-case performance over smaller instances, also based on the mtDNA data. For these experiments, we considered only those mutating sites that were phylogenetically informative (mt Windows). Phylogenetically informative sites are those where at least two different nucleotides are present in two or more taxa. Screening out uninformative sites reduced the data set to 415 sites and 414 taxa after preprocessing. Next, we split the remaining sites into overlapping windows of 10 characters each, so that successive windows shared nine characters. We then computed the phylogenies for each window by sliding the starting position across the entire set of phylogenetically informative characters, yielding a total of 406 distinct phylogenetic inferences.

Table 1 summarizes the results of all of the analyses.

For the set of 41 sequences of lhs-1 gene from *O. rufipogon* (red rice) (Zhou et al., 2008), our method pruned the full graph of 2^{18}*3^{2} nodes (after screening out redundant characters) to 58. Figure 4a shows the resulting phylogeny. Both PAUP* and pars yielded an optimal tree although more slowly than the ILP (2.09 and 2.57 seconds, respectively, as opposed to 0.29 seconds).

For the 245-base human mt-DNA sequences, the generalized Buneman pruning was again highly efficient, reducing the state set from 2^{28} after removing redundant sequences to 64. Figure 4b shows the phylogeny returned. While PAUP* was able to find the optimal phylogeny (although it was again slower at 5.69 versus 0.48 seconds), pars yielded a slightly sub-optimal phylogeny (length 45 instead of 44) in a comparable run time (0.56 seconds).

For HIV-1 sequences, our method pruned the full graph of 2^{16}*3*4^{2} possible nodes to a generalized Buneman graph of 297 nodes, allowing solution of the ILP in about two minutes. Figure 4c shows an optimal phylogeny for the data. PAUP* was again able to find the optimal phylogeny and in this case was faster than the ILP (3.84 as opposed to 127.5 seconds). pars required a shorter run time of 0.30 seconds, but yielded a sub-optimal tree of length 42, as opposed to the true minimum of 40.

For weighted parsimony analysis, we preprocessed the protein data to remove sites containing the ambiguous states X and Z. This reduced the data to 30 taxa and 16 characters, resulting in a generalized Buneman graph of just 88 nodes. For the protpars step matrix, all three methods gave the correct cost of 33. protpars was the fastest at 0.05 seconds followed by ILP (1.31 seconds) and PAUP* (4.30 seconds). For the BLOSUM62 step matrix, PAUP* yielded the correct length of 296 but was slightly slower than ILP (3.12 seconds as compared to 0.49 seconds).

For the four larger mitochondrial data sets, Buneman pruning was again highly effective in reducing graph size relative to the complete graph, although the ILP approach eventually proves impractical when Buneman graph size grows sufficiently large. Two of the data sets yielded Buneman graphs of size below 400, resulting in ILP solutions orders of magnitude faster than the heuristics. mt5000a, however, yielded a Buneman graph of over 1,000 nodes, resulting in an ILP that ran more slowly than the heuristics. mt10000 resulted in a Buneman graph of over 6,000 nodes, leading to an ILP too large to solve. pars was faster than PAUP* in all cases, but PAUP* found optimal solutions for all three instances we can verify while pars found a sub-optimal solution in one instance.

The last set of experiments used a sliding window of 10 phylogenetically informative characters across the entire sequence. Figure 5 shows the variation in tree length across the entire set of 406 windows. Here, ILP was orders of magnitude faster than either heuristic. ILP was able to scan the entire set of windows in 19.09 seconds. In comparison, pars took over 14 hours. pars was able to find the minimal length in all but 5 instances. PAUP* results are not reported, because it proved too slow to conduct the test in a reasonable time scale. When run for 24 hours, it solved only the first window (which took 6 hours), although it did find the optimal answer in this case. We tried to implement PAUP* on several randomly selected windows, but it failed to terminate in under 24 hours on any of them. However, we note that, in each instance, PAUP* was able to find the correct answer in less than a minute.

This last set of experiments demonstrates that our method is often much faster than the heuristics on easy to moderately hard data sets. This is an advantage of the ILP approach, since it is quickly able to prove the optimality of the solution, whereas both pars and PAUP* terminate search according to heuristic estimates. Furthermore, we conjecture that the size of the generalized Buneman graph provides a good indication of the true hardness of the problem. In practice, we have found that problem instances where is are solvable to optimality within a day. For harder instances, memory requirements increase rapidly and an exact solution is not feasible.

We can thus conclude that the generalized Buneman pruning approach developed here is very effective at reducing problem size, but solving provably to optimality does eventually become impractical for large data sets. Heuristic approaches remain a practical necessity for such cases even though they cannot guarantee, and do not always deliver, optimality. Comparison of PAUP* to pars and the ILP suggests that more aggressive sampling over possible solutions by the heuristics can lead to optimality even on very difficult instances but at the cost of generally greatly increased run time on the easy to moderate instances.

We have presented a new method for finding provably optimal maximum parsimony phylogenies on multi-state characters with weighted state transitions, using ILP. The method builds on a novel generalization of the Buneman graph for characters with arbitrarily large but finite state sets and for arbitrary weight functions on character transitions. Although the method has an exponential worst-case performance, empirical results show that it is fast in practice. The method provides a feasible alternative for data sets as large as a few hundred taxa and is often far more efficient than standard heuristics on easier problem instances. While there are many efficient heuristics for recontructing MP phylogenies, our results cater to the need for provably exact methods that are fast enough to solve the problem for biologically relevant multi-state data sets. Our work could potentially be extended to include more sophisticated integer programming techniques that have been successful in solving large instances of other hard optimization problems—for instance, the recent solution of the 85,900-city traveling salesman problem pla85900 (Applegate et al., 2009). The theoretical contributions of this article may also prove useful to work on open problems in multi-state MP phylogenetics, to accelerating methods for related objectives, and to sampling among optimal or near-optimal solutions.

^{1}We used sample files available at http://paup.csit.fsu.edu/nfiles.html

N.M. would like to thank Ming-Chi Tsai for several useful discussions. N.M., G.B., R.R., and R.S. were supported in this work by the National Science Foundation (grant 0612099). R.S. was additionally supported by the National Institutes of Health (grant 1R01CA140214).

No competing financial interests exist.

- Altschul S.F. Madden T.L. Schaffer A.A., et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
- Applegate D.L. Bixby R.E. Chvatal V., et al. Certification of an optimal TSP tour through 85,900 cities. Oper. Res. Lett. 2009;37:11–15.
- Bandelt H.J. Forster P. Sykes B.C., et al. Mitochondrial portraits of human populations using median networks. Genetics. 1995;141:743–753. [PubMed]
- Bandelt H.J. Forster P. Rohl A. Median-joining networks for inferring intraspecific phylogenies. Mol. Biol. Evol. 1999;16:37–48. [PubMed]
- Barthẽlemy J. From copair hypergraphs to median graphs with latent vertices. Discrete Math. 1989;76:9–28.
- Buneman P. Hodson F. Kendall D.G. Tautu P. Mathematics in the Archeological and Historical Sciences. Edinburgh University Press; Edinburgh, UK: 1971. The recovery of trees from measures of dissimilarity, 387–395.
- Erdos P.L. Szekely L.A. On weighted multiway cuts in trees. Math. Program. 1994;65:93–105.
- Felsenstein J. Inferring Phylogenies. Sinauer Associates; Sunderland, MA: 2004.
- Felsenstein J. PHYLIP (phylogeny Inference package), version 3.6. Distributed by author, Department of Genome Sciences. University of Washington; Seattle: 2008.
- Foulds L.R. Graham R.L. The Steiner problem in phylogeny is NP-complete. Adv. Appl. Math. 1982;3:43–49.
- Huber K.T. Moulton V. The relation graph. Discrete Math. 2002;244:153–166.
- Hudjashov G. Kivisild T. Underhill P.A., et al. Revealing the prehistoric settlement of Australia by Y chromosome and mtDNA analysis. Proc. Natl. Acad. Sci. USA. 2007;104:8726–8730. [PubMed]
- Posada D. Crandall K. Intraspecific gene genealogies: trees grafting into networks. Trends Ecol. Evol. 2001;16:37–45. [PubMed]
- Semple C. Steel M. Phylogenetics. Oxford University Press; New York: 2003.
- Sridhar S. Lam F. Blelloch G., et al. Efficiently finding the most parsimonious phylogenetic tree. Lect. Notes Comput. Sci. 2007;4463:37–48.
- Swofford D. PAUP* 4.0. Sinauer Associates; Sunderland, MA: 2009.
- Wang L. Jiang T. Lawler L. Approximation algorithms for tree alignment with a given phylogeny. Algorithmica. 1996;16:302–315.
- Zhou H.F. Zheng X.M. Wei R.X., et al. Contrasting population genetic structure and gene flow between Oryza rufipogon and Oryza nivara. Theor. Appl. Genet. 2008;117:1181–1189. [PubMed]

Articles from Journal of Computational Biology are provided here courtesy of **Mary Ann Liebert, Inc.**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |