PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of sysbioLink to Publisher's site
 
Syst Biol. 2010 October; 59(5): 584–593.
Published online 2010 September 3. doi:  10.1093/sysbio/syq044
PMCID: PMC2950835

An Intuitive, Informative, and Most Balanced Representation of Phylogenetic Topologies

Abstract

The recent explosion in the availability of genetic sequence data has made large-scale phylogenetic inference routine in many life sciences laboratories. The outcomes of such analyses are, typically, a variety of candidate phylogenetic relationships or tree topologies, even when the power of genome-scale data is exploited. Because much phylogenetic information must be buried in such topology distributions, it is important to reveal that information as effectively as possible; however, existing methods need to adopt complex structures to represent such information. Hence, researchers, in particular those not experts in evolutionary studies, sometimes hesitate to adopt these methods and much phylogenetic information could be overlooked and wasted. In this paper, we propose the centroid wheel tree representation, which is an informative representation of phylogenetic topology distributions, and which can be readily interpreted even by nonexperts. Furthermore, we mathematically prove this to be the most balanced representation of phylogenetic topologies and efficiently solvable in the framework of the traveling salesman problem, for which very sophisticated program packages are available. This theoretically and practically superior representation should aid biologists faced with abundant data. The centroid representation introduced here is fairly general, so it can be applied to other fields that are characterized by high-dimensional solution spaces and large quantities of noisy data. The software is implemented in Java and available via http://cwt.cb.k.u-tokyo.ac.jp/.

Keywords: Centroid wheel tree, centroid representation, phylogenetic tree, probability distribution, traveling salesman problem

In this era of 1000 sequenced genomes, it is routine in most life sciences laboratories to search sequence databases for evolutionarily related sequences and build phylogenetic trees, which often become very large. The most common outcomes of such analyses are a variety of candidate tree topologies, even when the power of genome-scale data is exploited (Ciccarelli et al. 2006). More than one topology is usually produced by a number of methods, for example, the traditional bootstrap analysis (Felsenstein 1985) and the more modern Bayesian Markov chain Monte Carlo method (Huelsenbeck et al. 2001). Biologists today, therefore, tend to be faced with more and more candidate phylogenetic topologies.

Figure 1 schematically illustrates two widely used representations for summarizing phylogenetic topology distributions. Consider a phylogenetic relationship of four clades is investigated with three candidate topologies inferred as shown (Fig. 1a). The most commonly adopted representation is to simply put bootstrap value or posterior probability on each internal branch of the “best tree”, which is usually the tree with the highest likelihood or that created using the original sequence alignment (Fig. 1b). These values are called “supports”, defined as the proportion that the candidate topologies contain the corresponding “splits”, that is, the bipartition systems of all clades. An apparent drawback of this representation is that it cannot convey information on candidate splits that are absent from the best tree. For example, although the second topology in Figure 1a is actually supported almost as strongly as is the first, Figure 1b does not reveal anything about it and focuses only on the first topology. To avoid such potentially misleading bias, researchers sometimes adopt another common representation, the consensus tree (Fig. 1c). This tree consists of splits with support greater than an arbitrary value. However, this representation also fails to reveal valuable information about the distribution. Although neither the first nor the second topology in Figure 1a is supported with enough confidence individually, they are still more strongly supported than that shown on the third, but this information is not reflected at all in Figure 1c.

FIGURE 1.
Commonly used representations of phylogenetic topology distributions. a) Three candidate topologies for the four clades and their example occurrence probabilities. b) The best tree representation. c, d) Two possible consensus tree representations. They ...

To better represent topology distributions, three major approaches have been intensively investigated. The first is to simply provide reliability information on every possible split, for example, in the form of bar graphs, by abandoning use of geometrical structures. One of the most established methods in this category is related to spectral analysis (Hendy and Penny 1993, Charleston 1998). Given sequence or distance data, this method estimates the support for every possible split by correcting for the effects of parallel and multiple substitutions, independently of the choice of a phylogenetic topology. Then, the estimated support values are presented as a bar graph along with a tree that best describes the data. Hence, for splits that do not appear in this tree, the phylogenetic relationships between them need to be inferred by researchers. This becomes particularly painstaking when clade numbers grow because, in this case, the numbers of possible splits grow exponentially and the relationships between them become much more complicated. The second approach presents multiple topologies instead of just one. Some methods try to choose as few topologies as possible to represent the distribution (Wilkinson 1996, Stockham et al. 2002, Bonnard et al. 2006), whereas others regard each topology as a data point and project these points onto some space to visualize the distribution (Hillis et al. 2005, Nye 2008). A similar drawback of these methods is that they ultimately require researchers to investigate the multiple topologies and combine their information, by themselves, to interpret the distribution. The third approach is the application of phylogenetic networks, which try to visualize information within networks instead of trees (Huson and Bryant 2006). This approach comprises an array of sophisticated methods, including so-called reticulate networks, which are particularly effective in illustrating some specific evolutionary events such as hybridizations, horizontal gene transfers, and recombination (Huson and Kloepper 2005). Nonetheless, phylogenetic networks also have several drawbacks. First, from the viewpoint of biology, trees would be still more appropriate than networks for representing some basic aspects of evolution, such as the evolution of species and most eukaryotic genes (Galtier and Daubin 2008). Second, from the viewpoints of mathematics and statistics, it is sometimes difficult to interpret phylogenetic networks (Galtier and Daubin 2008). Complicating the situation is the fact that they are based on evolutionary models that vary considerably between individual methods (Huson and Bryant 2006) and techniques whose statistical properties are not clear (e.g., greedy algorithms; Holland and Moulton 2003, Bryant and Moulton 2004, Huson et al. 2004). Finally, from the viewpoint of intuitiveness, they can become just complex, especially for non-experts of evolutionary studies (see, e.g., Huson et al. 2004). In light of this, researchers have sometimes hesitated to adopt the use of phylogenetic networks.

In this paper, we propose the “centroid wheel tree” (CWT) representation, which best reflects the entire distribution of candidate phylogenetic topologies and which can be interpreted intuitively even by nonexperts. The CWT is based on the sound mathematics of “centroid representation”, which we introduce as a theoretically balanced representation of probability distributions. This is an extension of the “centroid estimation”, by which estimation is carried out with the “centroid” of candidate solutions instead of the best solution (Carvalho and Lawrence 2008). This technique has recently been reported to be effective in bioinformatics when applied to problems characterized by high-dimensional solution space and considerable noise (Hamada et al. 2009, jdmvam09). We show that the problem of finding the CWT can be solved within the framework of the traveling salesman problem (TSP), a rigorously studied branch of optimization problems for which very sophisticated program packages are available (Applegate et al.). The name wheel tree is taken from the English common name of Trochodendron aralioides (an eudicot native to East Asia) for two of its defining characteristics: leaves growing in wheel-like shapes and a mixture of advanced and specialized characters in the eudicot family.

FORMULATION AND PROPERTIES

Key Idea

The key idea behind the CWT is to assign special meaning to circular orderings of branches around multifurcating nodes (i.e., nodes adjacent to ≥4 branches) in consensus trees. These nodes are generated by collapsing weakly supported edges to zero lengths while the consensus trees are being built. Traditionally, in the context of phylogenetic analysis, and in the wider field of graph theory, only connectivity is considered and such circular orderings in layout are usually ignored. For example, the consensus trees in Figure 1c,d have been treated identically. However, humans can interpret the two trees differently. In Figure 1d, clade human can be interpreted as more distant from opossum than from mouse and dog. What is most important is that, in this miniature data set from Figure 1a, human appears next to opossum less frequently and would be represented by Figure 1d more adequately than by Figure 1c. Hereafter, we call such multifurcations that consider circular orderings of the branches “wheel nodes”, and consensus trees that contain wheel nodes “wheel trees.”

Formulation

The concrete procedure for building CWTs is given in Figure 2a. The procedure accepts phylogenetic trees containing the same set of taxonomic units and first builds a consensus tree with a given threshold. It can also accept weight values that reflect observed frequencies or probabilities of the trees. Then, for each multifurcating node (v) on the consensus tree, the best circular ordering of the branches adjacent to v (v-branches) is calculated through the following steps. First, for each branch pair of v-branches, sum the distances (the minimum number of edges) between the corresponding branch pairs over all input trees. See the next paragraph for trees that do not contain all splits of v-branches. Second, let these values be the “traveling cost” between each branch pair. Third, solve the TSP tour that minimizes the total traveling cost to obtain the best circular ordering of v-branches. Intuitively, this traveling cost was designed to become large if distant splits in the input trees are successive around the wheel nodes. In other words, the TSP solution minimizing the total cost makes close splits in the input trees successive as much as possible, embodying the key idea described in the previous paragraph. After repeating the above for all multifurcating nodes, finally visualize the consensus tree by using the TSP circular orderings for branches around each multifurcation (wheel node).

FIGURE 2.
Procedure for building CWTs. a) The procedure first builds a consensus tree with a given threshold. Then, all branches adjacent to its multifurcating nodes are circularly ordered to best represent the topology distributions, as close branches in the input ...

It is possible that the input trees contain a tree T that does not contain all splits of v-branches. In this case, the distances on T are defined by converting T into a set of trees containing those splits (Fig. 2b). The basic idea is that, for example, in Figure 2b, the branch leading to (e,f) would be observed at the positions of taxa e and f with the same probabilities. More precisely, for each branch b of v-branches, collect all its descendant taxa on the consensus tree by regarding v as the ancestor and choose one of them randomly. Then, mark the external branch adjacent to this taxon on the tree T as corresponding to b. After repeating these steps for all branches of v-branches, remove all unnecessary parts from T (external branches that are not marked and internal nodes that are attached to only two branches). The distances between each pair of v-branches are then calculated by using the processed tree. Finally, by repeating the above steps for all possible combinations of the taxa choice and averaging the distances, the expected distance between each split pair is obtained. Note that, if T contains all splits of v-branches, the procedure above always results in the same tree, in which topological relationships among the branches corresponding to v-branches are the same as in the original T. Therefore, this is a natural extension of the original definition of distance, letting the circular orderings reflect the information in the whole topology distribution as much as possible.

It should also be noted that although trees are treated as unrooted throughout this paper, it is not difficult to root a CWT because it retains its tree shape and our current implementation can actually accept rooted trees (see Software Availability). The more precise pseudocode and formulation are given in the Appendix.

CWT is the Most Balanced Representation

In this section, we show that the CWT produced by the procedure outlined above not only embodies the key idea we sketched out but, mathematically, is the centroid representation, which is the most balanced representation of the topology distributions regarding the loss functions we define below. We also provide an intuitive explanation in the last paragraph.

Let us first introduce the concept of the centroid representation. Let θ, D, and P(θ||D) be unobserved data, observed data, and the posterior probability of θ given D, respectively. Traditionally, a maximum a posteriori estimator An external file that holds a picture, illustration, etc.
Object name is sysbiosyq044fx1_ht.jpg = argmaxθP(θ||D) is often adopted to represent the whole distribution of candidate θ. However, An external file that holds a picture, illustration, etc.
Object name is sysbiosyq044fx1_ht.jpg may be an ineffective representation of collections of small-probability solutions in high-dimensional spaces with considerable noise (Carvalho and Lawrence 2008), features that are shared by phylogenetic inference. Instead, the centroid representation tries to capture the distribution of the posterior probability mass in the ensemble. Given a “loss function” L([var phi],θ) that quantifies how each θ deviates from the representation [var phi], the centroid representation is defined by [var phi]Centroid = argmin[var phi]θL([var phi],θ)P(θ||D), which is theoretically the best for minimizing the expected value of the loss function. If [var phi]Centroid itself is a candidate solution, it is called the “centroid estimator,” which is introduced in detail in (Carvalho and Lawrence 2008), along with important extensions to the Hamming loss function.

A CWT is the centroid representation regarding the two loss functions below. Assume that a wheel tree (Φ) and one of the input trees (Ti[set membership]{T1,…,TN}) are given, and let lLayout(Φ,v,Ti) be the sum of the expected distances between any Ti branch pair whose corresponding branch pair is successive around a wheel node v in Φ. Then the “layout incongruity loss function” LLayout(Φ,Ti) is defined as the total sum of lLayout(Φ,v,Ti) over every wheel node, to quantify how all circular orderings in Φ agree with Ti. Among all possible Φ, a CWT minimizes the expected loss function against all input trees T1,…,TN. In addition, the traditional consensus tree that CWT relies on is also a centroid representation regarding the split incongruity loss function, which quantifies how the split sets of two trees disagree with each other. See the Appendix for details and proofs.

It is useful to provide some intuitive explanation of the fact that a CWT is the centroid representation here. Imagine a heap of stones, each of which has a drawing of a phylogenetic topology and whose mass is proportional to the probability of the tree. If we place the stones so that similar topologies are close to each other, a map of a topology distribution is created, as in Hillis et al. (2005) and Nye (2008). If this distribution is unimodal and unbiased, the heaviest stone or the best tree is anticipated to be at the “center” of the distribution and represent it well, although sometimes this assumption does not hold true, as discussed in the references above. Instead, the CWT representation tries to find the centroid (or the center of gravity) of the topologies, by placing them on a light plate on which each position corresponds to a wheel tree. In classical mechanics, the centroid is the point that balances the moment of gravity and minimizes the sum of the products of mass and squared distances. This clearly resembles the definition of a CWT, as we use the cost functions defined above instead of the squared distances, to quantify how close a wheel tree and a phylogenetic topology are. Therefore, the CWT representation can be regarded as the centroid that balances the background topology distributions best.

APPLICATIONS

Visualization of CWTs

In addition to making the circular orderings around multifurcating wheel nodes best represent the topology distributions, it is useful if some statistical information on them is available. Our current implementation of the CWT program provides visualization as in Figure 3.

FIGURE 3.
Information that the CWT representation conveys. Branches around multifurcating nodes (wheel nodes) are circularly ordered by the procedure in Figure 2. As in the ordinary phylogenetic representations, numbers at internal branches (X) are the support ...

First, as in ordinary phylogenetic representations, numbers at internal branches are the proportions of the input trees containing the corresponding splits. Second, numbers around the wheel nodes indicate the proportions that the flanking splits constitute in a monophyletic group. For example, in Figure 3, Z1% of the trees have the 3-furcation {a || d || b,c,e,f} (i.e., the three splits {a || b,c,d,e,f}, {d || a,b,c,e,f}, and {a,d || b,c,e,f}). Third, numbers within wheel nodes indicate the extent to which each circular ordering of the branches naturally represents the candidate topologies. In other words, the numbers indicate the proportion of the input topologies that can be restored by just “pulling out” branches without changing the orderings. (See the right-hand box in Figure 3 for an example. As another example, the value for the wheel node in Figure 1d is 85%, which is the total support of the first and second topologies in Figure 1a). Mathematically, this equals the proportion of the ordering that constitutes a shortest TSP tour, given the distance matrices derived using each topology only.

Note that, at the default settings, the values within and around the wheel nodes are calculated “on the assumption of the consensus tree topology.” For the input trees that do not have all the split sets adjacent to the wheel node, the proportion of topologies within all possible combinations of the corresponding branch positions (see “Formulation” and Fig. 2b) that contain the corresponding topologies are added to those values. Therefore, they are not strict support values (like the ones at internal branches) but expected values based on the assumption of the consensus tree topology. For the users' sake, the current implementation also offers options for showing the strict values that input trees contain the corresponding topologies as well as the average distances used for the TSP calculation.

CWT Applied to a Real Data Set

Figure 4a is a CWT representation derived from a real data set. The input trees were 246 phylogenetic trees obtained by applying the maximum-likelihood method to reliable single-copy orthologs conserved among 21 fungal species (downloaded from FUNYBASE; Marthey et al. 2008); a 60% threshold was used in building the consensus tree.

FIGURE 4.
CWT for 264 trees based on single orthologs conserved over 21 fungal species. a) CWT based on the 60% consensus tree. The values at the wheel nodes include the estimated values for trees that do not contain all splits adjacent to them (default option). ...

The wheel node indicated by the thick arrow connects the four splits Ago, Kla, X, and Y. This node indicates that, given the topology of the consensus tree, 73% of the input trees are expected to contain either of the splits {Ago,Kla || X,Y} or {Kla,X || Ago,Y}, and 39% and 34% contain the 3-furcations {Ago || Kla || X,Y} and {Kla || X || Ago,Y}, respectively. Neither the best tree nor the consensus tree representation provides such detailed information, unless multiple trees are used or the tree shapes are abandoned as in the phylogenetic network representation (Fig. 4b). Note that the values on the opposite sides of the wheel nodes are the same. This is because in cases of 4-furcating nodes, for example, if the splits Ago and Kla are adjacent (i.e., {Ago || Kla || X,Y}) then X and Y are adjacent (i.e., {X || Y || Ago, Kla}). This property does not hold true without the consensus tree topology assumption, because the input trees can contain trees that do not contain the splits X and Y and in such cases trees with {Ago || Kla || X,Y} can be without {X || Y || Ago,Kla} (Fig. 4a, left sides of the dotted rectangles). It is also worthwhile to examine the average distance around each wheel node used in the TSP calculation (Fig. 4a, right sides of the dotted rectangles); in cases of 4-furcating nodes, the distances between splits are equal to the proportions that they do not constitute in a monophyletic group.

In addition, because a CWT uses a tree shape, taxonomic groups at multiple levels can be recognized fairly intuitively. In particular, a CWT can suggest the existence of taxonomic groups with support below the consensus-tree threshold, in addition to those making splits on the consensus trees, as successive branches around wheel nodes. Figure 5a is a CWT representation of the same data set as in Figure 4, based on an 80% consensus tree. To avoid a cluttered appearance because of the many multifurcations, a color visualization option that uses colors instead of numbers is used to show the support values around the wheel nodes. For example, the thick gray arc lines α, β, γ, and δ indicate the class Eurotiomycetes, class Sordariomycetes, order Hypocreales, and subphylum Agaricomycotina, respectively. An interesting application is the star-like CWT, obtained by specifying high threshold values where no split is supported at a level above the threshold (Fig. 5b; branch lengths are ignored and only the topology is shown). The star-like CWT shows “the optimal sequential ordering” of all taxa based on the distribution of the input topologies and, as a result, many biological groups appear as successive branches around the wheel node (thick gray arc lines). By virtue of the sophistication of the present TSP solvers (Applegate et al.), it takes only a few seconds to obtain such star-like CWTs of this size.

FIGURE 5.
Taxonomic groups as successive branches around wheel nodes. a) CWT based on the 80% consensus tree for the same data set as in Figure 4. This is the color visualization, which uses colors instead of numbers to show values around the wheel nodes, to avoid ...

Characteristics of CWTs Clarified by an Artificial Data Set

To help understand how the CWT representation works, we show an artificially created example that sheds light on the characteristics of CWTs. Figure 6a is a data set containing 15 trees, and Figure 6b,c are the produced CWT representations based on the 50% majority-rule consensus tree showing support values and average distances, respectively. The strict option does not change the support values because there is no nontrivial split exceeding the 50% threshold (Fig. 6d) and the consensus tree contains trivial splits only. Figure 6e shows the sums of the distances between each split pair, which are used for the TSP calculation for the circular ordering determination.

FIGURE 6.
Characteristics of CWTs demonstrated with an artificial data set. a) The 15 input tree topologies. b) The CWT based on the 50% majority rule consensus tree. The support values do not change by the strict value option because this CWT contains trivial ...

Two notable characteristics of CWT are demonstrated with this example. First, though the most frequent split is {c,e || a,b,d} (Fig. 6d), the distance between c and e is not the smallest (Fig. 6e). In other words, they are given the highest priority if we construct a consensus tree by adopting highly supported splits in a greedy manner (Greedy consensus tree, Fig. 6f) but not if we build the CWT. This is because the CWT considers more information than the existences of splits, that is, it also considers the distances between them. In this example, though the split {a,b || c,d,e} appears only once, the splits a and b are always within a distance of 1 in all trees and thus the cost for making them successively becomes small. Such differences can be made clear by visualizing and comparing the support values and the average distances (Fig. 6b,c; we recommend trying these options, especially if weak support values are displayed). Though the support for the 3-furcation {c || e || a,b,d} is over twice as ones of {a || d || b,c,e} and {b || d || a,c,e}, the distances between c and e, a and d, and b and d are the same. Such information cannot be easily obtained by glancing at the input trees (Fig. 6a), the split sets (Fig. 6d), or the greedy consensus tree (Fig. 6f). The other characteristic of CWTs that can be seen in this example is that, although the smallest distance is between the splits a and b (Fig. 6e), they are not successive around the resultant wheel node (Fig. 6b,c). This occurs because of the intrinsic nature of TSP: choosing locally optimal paths does not always derive the globally best tour, and in this example the total distance of any tour containing the path a–b exceeds that of the tour a–c–e–b–d (Fig. 6e). This is intentional and reflects the very objective of a CWT: the most balanced representation of the topology distributions.

Software Availability

The software for building a CWT was implemented in Java, and is available at http://cwt.cb.k.u-tokyo.ac.jp/. Users can run the program on the Web or download it under the GNU General Public License. To run the program locally, it is necessary to install Java SE 5.0 (or higher), Concorde (one of the best exact TSP solvers currently available; Applegate et al.), Phyutility Java archive; Smith and Dunn 2008), Apache XML Graphics Commons Java archives, and Args4j Java archive.

The software accepts any set of rooted or unrooted phylogenetic trees in the Newick format covering the same taxa set, with one tree per line. A real value separated by space at the beginning of each line can be provided to indicate the weight of the tree. A threshold value for building the consensus tree should be designated. If a threshold >100 is given, no split is supported stronger than the threshold and thus a star-like CWT is produced. Options to use rooted trees, show strict support values, show average distances, use the color visualization, and output only topologies are available.

The software produces the CWT representation as .nhx, .ps, and .png files. The .nhx files are in the New Hampshire eXtended format (http://www.phylosoft.org/NHX/), which is an extension of the conventional Newick format. The extra information described in the main text is given by using “:XN=” tags wrapped by “[&&NHX” and “]”, which mean “custom data associated with nodes” in the NHX format. For example, if (i) a 4-furcating wheel node is surrounded by Clade1 to Clade4, (ii) the values flanked by Clade1 and Clade2, Clade2 and Clade3, Clade3 and Clade4, and Clade4 and Clade1 are Val1, Val2, Val3, and Val4, respectively, and (iii) the value within the wheel node is ValC, the notation is “(Clade1, Clade2, Clade3, Clade4)[&&NHX: XN=ValC||Val1,Val2,Val3,Val4];” (in the case that the node is the root) or “(Clade2, Clade3, Clade4)[&&NHX: XN=ValC||Val1,Val2,Val3,Val4]Clade1” (otherwise). That is, circular orderings of branches are designated by appearance order and statistical information is given in the square brackets. “:B=” tags can also be inserted before the “:XN=” tags, to represent confidence values for parent branches.ps (PostScript) and .png (Portable Network Graphics) files provide 2-dimensional visualizations of the CWT. The visualization is based on a modified radial layout that follows the TSP-based circular orderings of branches.

CONCLUSION AND PERSPECTIVES

In this paper, we introduced the CWT representation, which provides rich information on phylogenetic topology distributions in a highly intuitive and most balanced manner. Examples based on real and artificial data sets were also provided to show the advantages and characteristics of CWTs. Better phylogenetic representations are of increasing importance because DNA sequencing technologies are advancing at an exponential rate, leading to ubiquitous demands for interpreting large-scale phylogenetic analyses. In addition, because a CWT conceptually resembles an intermediate data structure in phylogenetic network construction (Dress and Huson 2004), CWT could possibly be used to extend them. Furthermore, because the basic concept of the centroid representation is fairly general, it may also be applied to fields beyond the life sciences that are tending toward high-dimensional solution space, considerable noise, and data explosions.

FUNDING

This work was supported by Grant-in-Aid for Scientific Research on Priority Areas “Systems Genomics” (17017002) from the Ministry of Education, Culture, Sports, Science and Technology of Japan and Grant-in-Aid for Young Scientists (Start-up) (21810005) from the Japan Society for the Promotion of Science.

Acknowledgments

The authors thank M. Hamada, K. Kin, Y. Watanabe, J. Sullivan, M. Holder, and the anonymous reviewers for valuable suggestions and comments on the manuscript.

APPENDIX

Definitions

Let T1,…,TN be N input phylogenetic trees and w1,…,wN be their weights. Each phylogenetic tree is defined by Ti = (Vi,Ei), where Vi is the set of all external and internal nodes and Ei is the set of all branches (edges). Hereafter, all trees are assumed to have the same set of taxa X = {x1,…,xM} on the external nodes.

Any branch e on a tree T = (V,E) splits the taxa set X into two groups. We call such a bipartition system Split(e,T). Let SplitSet(T) = {Split(e,T)||e[set membership]E} and Freq(s) be the weighted frequency of the trees that contain the split s in T1,…,TN, i.e., Freq(s) = ∑i = 1Nδiwi where δi = 1ifs[set membership]SplitSet(Ti) and δi = 0otherwise. AdjBranches(v,T) is the set of branches directly attached to node v on T, and MultiFurcatingNodes(T) = {v[set membership]V||||AdjBranches(v,T)|| ≥ 4}. If e[set membership]AdjBranches(v,T), Descendants(e,v,T) is the subset of X that can be traversed from the node adjacent to e that is not v, without crossing e. Let ExtBranch(x,T) be an external branch on T attached to the taxon x.

For Xsub[subset, dbl equals]X and {xa,xb}[subset, dbl equals]Xsub, we define Dist(xa,xb,Xsub,T) as follows. First, recursively remove any external branch e from T if e[negated set membership]{ExtBranch(x,T)||x[set membership]Xsub}. Second, recursively remove any internal node v if ||AdjBranches(v,T)|| = 2 by replacing AdjBranches(v,T) with one internal branch to keep the tree T connected. Then Dist(xa,xb,Xsub,T) is the minimum number of edges on the converted tree T that separate ExtBranch(xa,T) and ExtBranch(xb,T). Note that ExtBranch(xa,T) and ExtBranch(xb,T) themselves are not counted (i.e., if they are adjacent each other, then Dist(xa,xb,Xsub,T) = 0).

Pseudocode for Obtaining CWTs

Given T1,…,TN, w1,…,wN, and a threshold value γ for building the consensus tree, the following pseudocode produces a CWT. Note that the function c is symmetric. i.e., c(ep,eq)[equivalent]c(eq,ep).

Construct the consensus tree TC = (VC,EC), where An external file that holds a picture, illustration, etc.
Object name is sysbiosyq044fx2_ht.jpg

An external file that holds a picture, illustration, etc.
Object name is sysbiosyq044fx3_ht.jpg

Lay out TC by following the TSP-cycle orderings for the branches around MultiFurcatingNodes(TC).

Consensus Tree is the Centroid Representation

It can be shown that the consensus tree is in fact the centroid representation of all candidate trees regarding the loss function of “split incongruity”, which quantifies the degree of disagreement between split sets of two trees. More precisely, given a tree representation T = (V,E) and a candidate tree Ti = (Vi,Ei), the “split incongruity loss function” is defined as

An external file that holds a picture, illustration, etc.
Object name is sysbiosyq044fx4_ht.jpg

where

An external file that holds a picture, illustration, etc.
Object name is sysbiosyq044fx5_ht.jpg

This function becomes large if T contains splits that are absent from Ti (“false positives”) and vice versa (“false negatives”). ξ designates the relative penalties for the two types of errors and usually ξ ≥ 1 because by convention false positives are more undesirable, given that they would exaggerate weak phylogenetic signals. To obtain the centroid representation, it is necessary to know the posterior probability P(Ti||D) in addition to the loss function and, for example, both bootstrap and Bayesian methods have been developed to give approximate values for P(Ti||D) (Felsenstein 1985, Huelsenbeck et al. 2001). If we define w1,…,wN as wi[proportional, variant]P(Ti||D), the “split incongruity centroid tree” TSplitCentroid is the tree that fulfills

An external file that holds a picture, illustration, etc.
Object name is sysbiosyq044fx6_ht.jpg

(Berry and Gascuel 1996, Holder et al. 2008, Margush and McMorris 1981). Therefore, TSplitCentroid with the penalty parameter ξ is the consensus tree with the threshold ξ/(1 + ξ). In other words, the consensus tree with the threshold γ is TSplitCentroid with the penalty parameter γ/(1 − γ). Note that if 1 ≤ ξ then 1/2 ≤ γ < 1.

CWT is the Centroid Representation in Double Metrics

As was already described, the consensus tree TC = (VC,EC) still possesses ambiguity with regard to branch orderings around the multifurcating nodes. Each layout can be specified by a function f(v,t):VC×N→N that is defined for v[set membership]MultiFurcatingNodes(TC) and t = 1,…,k(v,TC), where k(v,TC) = ||AdjBranches(v,TC)|| and {e1,e2,…,ek(v,TC)} = AdjBranches(v,TC), and the cycle (ef(v,1)ef(v,2)ef(v,k(v,TC))) specifies the circular ordering of AdjBranches(v,TC). In the following paragraph, we show that, among every possible layout of consensus trees, the CWT is the centroid representation regarding the loss function of “layout incongruity”.

Let Φ = (TC,f) be a layout-specified tree representation of TC. Then, the “layout incongruity loss function” is defined as

An external file that holds a picture, illustration, etc.
Object name is sysbiosyq044fx7_ht.jpg

where

An external file that holds a picture, illustration, etc.
Object name is sysbiosyq044fx8_ht.jpg

Intuitively, for each split pair sa and sa that correspond to successive branches around each multifurcating node v in Φ, lLayout sums the distances on the tree Ti between the expected positions of the corresponding branches on the assumption of the topology TC. Then, LLayout sums it for all wheel nodes to quantify how well Φ represents Ti. Then the “layout incongruity centroid tree” ΦLayoutCentroid is the tree representation that minimizes the expected LLayout for all T1,…,TN:

An external file that holds a picture, illustration, etc.
Object name is sysbiosyq044fx9_ht.jpg

where c is the traveling cost calculated in the pseudocode. The final term is a minimum if we choose f to minimize the term in the bracket for each multifurcating node v, because they are independent of each other. Such an f is exactly the TSP tour obtained in the pseudocode; therefore, CWT is ΦLayoutCentroid and the most balanced according to the two measures of split incongruity and layout incongruity.

References

  • Applegate D, Bixby RE, Chvátal V, Cook W. http://www.tsp.gatech.edu/concorde.html.
  • Berry V, Gascuel O. On the interpretation of bootstrap trees: appropriate threshold of clade selection and Induced gain. Mol. Biol. Evol. 1996;13:999–1011.
  • Bonnard C, Berry V, Lartillot N. Multipolar consensus for phylogenetic trees. Syst. Biol. 2006;55:837–843. [PubMed]
  • Bryant D, Moulton V. Neighbor-net: an agglomerative method for the construction of phylogenetic networks. Mol. Biol. Evol. 2004;21:255–265. [PubMed]
  • Carvalho LE, Lawrence CE. Centroid estimation in discrete high-dimensional spaces with applications in biology. Proc. Natl. Acad. Sci. U S A. 2008;105:3209–3214. [PubMed]
  • Charleston MA. Spectrum: spectral analysis of phylogenetic data. Bioinformatics. 1998;14:98–99. [PubMed]
  • Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P. Toward automatic reconstruction of a highly resolved tree of life. Science. 2006;311:1283–1287. [PubMed]
  • Dress AW, Huson DH. Constructing splits graphs. IEEE/ACM Trans Comput Biol. Bioinform. 2004;1:109–115. [PubMed]
  • Felsenstein J. Confidence limits on phylogenies: an approach using the bootstrap. Evolution. 1985;39:783–791.
  • Galtier N, Daubin V. Dealing with incongruence in phylogenomic analyses. Phil. Trans. R. Soc. Lond. Ser. B. 2008;363:4023–4029. [PMC free article] [PubMed]
  • Hamada M, Kiryu H, Sato K, Mituyama T, Asai K. Prediction of RNA secondary structure using generalized centroid estimators. Bioinformatics. 2009;25:465–473. [PubMed]
  • Hendy MD, Penny D. Spectral Analysis of Phylogenetic Data. J Classif. 1993;10:5–24.
  • Hillis DM, Heath TA, St John K. Analysis and visualization of tree space. Syst Biol. 2005;54:471–482. [PubMed]
  • Holder MT, Sukumaran J, Lewis PO. A justification for reporting the majority-rule consensus tree in Bayesian phylogenetics. Syst Biol. 2008;57:814–821. [PubMed]
  • Holland B, Moulton V. In: Consensus networks: a method for visualising incompatibilities in collections of trees. Benson G, Page R, editors. 2003. WABI 2003, LNBI 2812. Berlin (Germany): Springer. p. 165–176.
  • Huelsenbeck JP, Ronquist F, Nielsen R, Bollback JP. Bayesian inference of phylogeny and its impact on evolutionary biology. Science. 2001;294:2310–2314. [PubMed]
  • Huson DH, Bryant D. Application of phylogenetic networks in evolutionary studies. Mol Biol Evol. 2006;23:254–267. [PubMed]
  • Huson DH, Dezulian T, Klopper T, Steel MA. Phylogenetic super-networks from partial trees. IEEE/ACM Trans. Comput. Biol. Bioinform. 2004;1:151–8. [PubMed]
  • Huson DH, Kloepper TH. Computing recombination networks from binary sequences. Bioinformatics. 2005;21(Suppl 2):ii159–ii165. [PubMed]
  • Joshi A, De Smet R, Marchal K, Van de Peer Y, Michoel T. Module networks revisited: computational assessment and prioritization of model predictions. Bioinformatics. 2009;25:490–6. [PubMed]
  • Margush T, McMorris FR. Consensus n-Trees. Bull. Math. Biol. 1981;43:239–244.
  • Marthey S, Aguileta G, Rodolphe F, Gendrault A, Giraud T, Fournier E, Lopez-Villavicencio M, Gautier A, Lebrun MH, Chiapello H. FUNYBASE: a FUNgal phYlogenomic dataBASE. BMC Bioinform. 2008;9:456. [PMC free article] [PubMed]
  • Nye TM. Trees of trees: an approach to comparing multiple alternative phylogenies. Syst. Biol. 2008;57:785–794. [PubMed]
  • Smith SA, Dunn CW. Phyutility: a phyloinformatics tool for trees, alignments and molecular data. Bioinformatics. 2008;24:715–6. [PubMed]
  • Stockham C, Wang LS, Warnow T. Statistically based postprocessing of phylogenetic analysis by clustering. Bioinformatics. 2002;18(Suppl 1):S285–S293. [PubMed]
  • Wilkinson M. Majority-rule reduced consensus trees and their use in bootstrapping. Mol. Biol. Evol. 1996;13:437–44. [PubMed]

Articles from Systematic Biology are provided here courtesy of Oxford University Press