Home | About | Journals | Submit | Contact Us | Français |

**|**Front Neurosci**|**v.4; 2010**|**PMC2928700

Formats

Article sections

- Abstract
- Introduction
- Materials and Methods
- Results
- Discussion
- Supplementary Material
- Conflict of Interest Statement
- References

Authors

Related links

Front Neurosci. 2010; 4: 47.

PMCID: PMC2928700

Elissaveta Arnaoudova,^{1,}^{†} David C. Haws,^{2,}^{†} Peter Huggins,^{3} Jerzy W. Jaromczyk,^{1} Neil Moore,^{1} Christopher L. Schardl,^{4} and Ruriko Yoshida^{2,}^{*}

Edited by: Raina Robeva, Sweet Briar College, USA

Reviewed by: Tom M. W. Nye, Newcastle University, UK; Liang Liu, Harvard University, USA

*Correspondence: Ruriko Yoshida, Department of Statistics, University of Kentucky, 817 Patterson Office Tower, Lexington, KY 40506-0027, USA. e-mail: ude.yku@adihsoy.okirur

This article was submitted to Frontiers in Systems Biology, a specialty of Frontiers in Neuroscience.

Received 2010 April 13; Accepted 2010 June 9.

Copyright © 2010 Arnaoudova, Haws, Huggins, Jaromczyk, Moore, Schardl and Yoshida.

This is an open-access article subject to an exclusive license agreement between the authors and the Frontiers Research Foundation, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are credited.

This article has been cited by other articles in PMC.

We propose a statistical method to test whether two phylogenetic trees with given alignments are significantly incongruent. Our method compares the two distributions of phylogenetic trees given by two input alignments, instead of comparing point estimations of trees. This statistical approach can be applied to gene tree analysis for example, detecting unusual events in genome evolution such as horizontal gene transfer and reshuffling. Our method uses difference of means to compare two distributions of trees, after mapping trees into a vector space. Bootstrapping alignment columns can then be applied to obtain *p*-values. To compute distances between means, we employ a “kernel method” which speeds up distance calculations when trees are mapped in a high-dimensional feature space, e.g., splits or quartets feature space. In this pilot study, first we test our statistical method on data sets simulated under a coalescence model, to test whether two alignments are generated by congruent gene trees. We follow our simulation results with applications to data sets of gophers and lice, grasses and their endophytes, and different fungal genes from the same genome. A companion toolkit, `Phylotree`, is provided to facilitate computational experiments.

Estimating differences between phylogenetic trees is one of the fundamental questions in computational biology. Conflicting phylogenies arise when, for example, different phylogenetic reconstruction methods are applied to the same data set, or even with one reconstruction method applied to multiple different genes. Gene phylogenies may be codivergent by virtue of congruence (identical trees) or insignificant incongruence. Otherwise, they may be significantly incongruent Maddison (1997). All of these outcomes are fundamentally interesting. Congruence of gene trees (or subtrees) is often considered the most desirable outcome of phylogenetic analysis, because such a result indicates that all sequences in the clade are orthologs (homologs derived from the same ancestral sequence without a history of gene duplication or lateral transfer), and that discrete monophyletic clades can be unambiguously identified, perhaps supporting novel or previously described taxa. In contrast, gene trees that are incongruent are often considered problematic because the precise resolution of speciation events seems to be obscured. Thus, it would also be very useful to identify significant incongruencies in gene trees because these represent non-canonical evolutionary processes (e.g., Maddison and Knowles, 2006; Edwards et al., 2007; Liu et al., 2008). In this paper we propose a statistical hypothesis test which tells whether two phylogenetic trees are significantly incongruent to each other by comparing two distributions for phylogenetic trees, instead of comparing two point estimations. More specifically we will compare two distributions of trees using *difference of means*. Our statistical hypotheses are:

*H*_{0}: Phylogenetic trees *T*_{1} and *T*_{2} are congruent.

*H*_{1}: Phylogenetic trees *T*_{1} and *T*_{2} are incongruent.

Usually a statistical test on the above hypotheses considers point estimates of the trees obtained by a tree reconstruction method, such as maximum likelihood (ML) estimates (Felsenstein, 1981; Galtier et al., 2005) or the neighbor-joining method (Saitou and Nei, 1987). See Schardl et al. (2008) and references within for an overview. Variation of reasonable tree estimates can be assessed, for example, by using the bootstrap or jackknife method.

There are several techniques to test if gene trees are codiverged. For example, the Bayesian estimation methods (e.g., Ane et al., 2007; Edwards et al., 2007; Liu and Pearl, 2007), the Templeton test implemented in `paup*` (Swofford, 1998; e.g., Ge et al., 1999), the partition-homogeneity test (PHT) also implemented in `paup*` (e.g., Voigt et al., 1999), Kishino–Hasegawa (KH) test (Kishino and Hasegawa, 1989), Shimodaira–Hasegawa (SH) test (Shimodaira and Hasegawa, 1999), and the likelihood ratio test (LRT; e.g., Vilaa et al., 2005) are statistical methods to see if there is a “significant” level of incongruence between the trees [these methods are also called partition likelihood support (PLS; Lee and Hugall, 2003)]. However, there is a limitation in many methods for comparing two phylogenetic trees: It is implicitly assumed that the two given trees are actually correctly estimated phylogenies. In reality, trees are estimated from observed data (e.g., fossil record, sequence data), and tree uncertainty is the rule instead of the exception. Holmes (2005) summarized a framework for statistical hypothesis testing on trees, including methods using distributions of phylogenetic trees, such as posterior distribution or bootstrap sampling distribution of trees. Holmes (2005) briefly described a statistical method to compare two bootstrap sampling distributions trees, using the mean and variance of each distribution. Here we expand these methods to use posterior means, instead of tree-valued tree estimators, to estimate trees. We propose using posterior means to estimate trees, and we apply the bootstrap method to assess variation in the posterior means.

This paper is organized as follows: In Section “Materials and Methods,” we state our method. In Section “Results,” we show simulation studies with data generated by the software `Mesquite` (Maddison and Knowles, 2006) and we compared our method with the method described in Example 3 of Section 4.4.1 in Holmes (2005) as well as SH test. In Section “Discussion,” we apply our method to well-known gopher-louse data sets from Hafner and Nadler (1990) and grass-endophyte data sets from Schardl et al. (2008). We end with a discussion.

Let * _{n}* be the space of trees on the set

**Definition 1:** *Given a map into a normed space v*:* _{n}*→

The difference between trees *T*_{1},*T*_{2}* _{n}* can be quantified as the distance $\left|\right|v({T}_{1})-v({T}_{2})\left|\right|$, where ||·|| is any norm. In this paper we will focus on

A notable example of our framework is the *dissimilarity map distance*.

**Definition 2:** *For T**T _{n}, let* $v(T)=({d}_{1,2}^{T},{d}_{1,3}^{T},\dots ,{d}_{n-1,n}^{T}){\text{n(n\u22121)/2}}^{}$

$$d({T}_{1},{T}_{2})=\left|\right|v({T}_{1})-v({T}_{2})\left|\right|=\sqrt{{({d}_{1,2}^{{T}_{1}}-{d}_{1,2}^{{T}_{2}})}^{2}+\dots +{({d}_{n-1,n}^{{T}_{1}}-{d}_{n-1,n}^{{T}_{2}})}^{2}},$$

*where ||·|| represents the L _{2} norm (Euclidean length)*.

In our computational experiments, we will use the dissimilarity map distance. Dissimilarity map distance was studied in Buneman (1971). One can also consider a variation where all edge lengths are set to 1. The arising dissimilarity map distance is called the *path difference* (Steel and Penny, 1993) and only depends on tree topologies.

In our framework, given are *D*_{1},*D*_{2}, each a collection of *n* aligned homologous sequences. We assume *D*_{1},*D*_{2} were generated by models of sequence evolution on unknown trees *T*_{1},*T*_{2}* _{n}*. After mapping trees into a vector space, we define our statistical hypotheses:

$$\begin{array}{c}{H}_{0}:\left|\right|v({T}_{1})-v({T}_{2})\left|\right|=0;\\ {H}_{1}:\left|\right|v({T}_{1})-v({T}_{2})\left|\right|>0.\end{array}$$

(1)

For convenience, we describe our approach as comparing two gene trees *T*_{1},*T*_{2}* _{n}* from the same set of species. One can also compare a phylogeny for host species and a phylogeny for corresponding parasites, as we do in Section “Experiments with Real Data Sets.”

Random fluctuations in sequence evolution can cause reconstructed gene trees for *D*_{1} and *D*_{2} to look at least slightly different, even if the true underlying trees are equal. Thus we need a way to tell if the difference between two estimated trees is “significant.”

One classical approach to assess variability in reconstructed trees is the bootstrap (Felsenstein, 1981). The bootstrap generates new hypothetical sequence alignments, by sampling (with replacement) columns of aligned sequence. Then trees can be re-estimated for each hypothetical alignment. One common application of the bootstrap is to measure support for each clade; clades that appear in most bootstrap replicate trees are regarded as likely clades in the true tree.

Here we propose a bootstrap procedure to assess significance of the distance between two trees. Our method is based on the triangle inequality. Namely, if $v({\widehat{T}}_{1}),v({\widehat{T}}_{2})$ are estimators for *v*(*T*_{1}),*v*(*T*_{2}), then the triangle inequality says

$$\left|\right|v\left({T}_{1}\right)-v\left({T}_{2}\right)\left|\right|\ge \left|\right|v\left({\widehat{T}}_{1}\right)-v\left({\widehat{T}}_{2}\right)\left|\right|-\left|\right|v\left({T}_{1}\right)-v\left({\widehat{T}}_{1}\right)\left|\right|-\left|\right|v\left({T}_{2}\right)-v\left({\widehat{T}}_{2}\right)\left|\right|,$$

(2)

which gives a lower bound on the distance between the true trees *T*_{1},*T*_{2}* _{n}*. Here the test statistics is $\left|\right|v({\widehat{T}}_{1})-v({\widehat{T}}_{2})\left|\right|$. Under the null hypothesis we have ||

The bootstrap procedure we have proposed can be applied with any tree estimator, such as neighbor-joining or ML. Since we are presuming tree uncertainty is high, and Bayes estimator trees are more accurate than neighbor-joining or ML (Huggins et al., 2010), we prefer a Bayes estimator approach.

Given an alignment *D*, generated by sequence evolution on an unknown tree *T** _{n}*, Bayesian MCMC sampling methods will approximately sample from the posterior distribution

$$\widehat{\Delta}=1/{N}_{1}{\displaystyle {\sum}_{i=1}^{{N}_{1}}v\left({t}_{i}\right)}-1/{N}_{2}{\displaystyle {\sum}_{i=1}^{{N}_{2}}v\left({s}_{i}\right)},$$

(3)

and $\left|\right|\widehat{\Delta}\left|\right|$ is an estimator for ||*v*(*T*_{1})−*v*(*T*_{2})||.

Some feature space maps produce very high-dimensional feature vectors *v*(*T*_{1}),*v*(*T*_{2}) for trees *T*_{1},*T*_{2}* _{n}*, yet the distance ||

**Proposition 1:** *Let x _{1},x_{2},y_{1},y_{2}*

$${\Vert {\mu}_{x}-{\mu}_{y}\Vert}^{2}=\text{(\Vert x1\u2212y1\Vert )2\u221212[(\Vert x1\u2212x2\Vert )2]\u221212[(\Vert y1\u2212y2\Vert )2].}$$

(4)

A proof of Proposition 1 is provided in Supplementary Material. Using the proposition and a subroutine which computes the norm in Definition 2, the length $\left|\right|\widehat{\Delta}\left|\right|=\left|\right|\text{v(T1)\u2212v(T2)||}$ can be estimated from the samples $\{{t}_{1},\dots ,{t}_{{N}_{1}}\},\{{s}_{1},\dots ,{s}_{{N}_{2}}\}$.

In this section we estimate posterior distributions of phylogenetic trees via MCMC-based software `MrBayes` (Huelsenbeck and Ronquist, 2001) and apply the difference of means method to test whether two phylogenetic trees are incongruent, i.e., the hypotheses in Eq. 1. For our exploratory simulation study, we compare two gene trees generated under coalescent models (Maddison and Knowles, 2006). For two gene trees generated under two respective species trees, there are two different congruences that could be tested. Namely, (a) whether underlying species trees are congruent, and (b) whether gene trees are congruent. Our method is designed for (b); however, it is not designed for (a) and we do not propose a test for (a) in this paper. Simulated data sets were generating using the software `Mesquite` (Maddison and Knowles, 2006) with parameters chosen similar to Maddison and Knowles (2006), to emulate real data and test the effectiveness of our method. `Mesquite` takes two parameters; the species depth in terms of number of generations and the population size in terms of number of individuals. Three simulation sets were generated, determined by the species depths of 100,000, 600,000, and 1,000,000. The effective population size was fixed to 100,000 for all data sets. For each simulation set, two species trees, species tree 1 and 2, with eight species were generated using the pure birth Yule process in `Mesquite`. Sequence alignments were generated by `Mesquite` under HKY85 model with transition–transversion ratio of 3.0, a discrete gamma distribution with four categories and shape parameters 0.8. In all our simulations, we set the stationary probability distribution π=(0.3, 0.2, 0.2, 0.3) for A, C, G, T, respectively, the 3:2 AT:GC ratio was maintained through all trees, and our sequences were generated with 1000 base pairs. The coalescence gene trees generated had branch lengths in terms of the coalescence model and therefore a scaling factor of 3·10^{−8} was used to yield sequences with sequence divergence similar to real data. Table Table11 shows sequence divergences. The sequence divergence was calculated in two ways: (i) the average percent pairwise difference between all sequences (Maddison and Knowles, 2006), and (ii) the minimum of the pairwise percent differences among sequences (Guindon and Gascuel, 2003).

In order to estimate posterior distributions we used the MCMC-based software `MrBayes` with the following parameters: (i) for the model: HKY85+Gamma, shape parameter: 0.8, transition–transversion ratio: 3.0; and (ii) for MCMC runs: number of runs: 1, number of chains: 2, chain length: 100,000, sample frequency: 1,000, burn-in: 25%. For bootstrap sampling we sampled 100 bootstrap samples with sample size of 1,000 columns since the simulated sequences are generated with 1,000 base pairs.

We generated simulated data sets in three different ways; (i) two separate sequence data sets generated from the same gene tree, (ii) sequence data sets generated from two different gene trees under the same species tree, (iii) sequence data sets generated by two sequence data sets generated from two different gene trees whose species trees are also different. We tested 10 gene trees for each species depth (i.e., 30 different gene trees in total) generated under the same species tree. One can find the species trees we used in Figure Figure2.2. We used two sets of sequences generated under the HKY model with the same tree for each test. We have the three species depths of 1000,000, 600,000, and 100,000, with fixed population size of 100,000. Notice that we do not observe any Type I errors with our testing method, however, in within-species comparisons at species depth of 1,000,000 the *p*-values were high in general. Also notice that with pairs of gene trees where each pair of gene trees are generated from different species trees under the coalescence model, the *p*-values were less than 0.001 for all pairs of genes from 1,000,000 and 600,000 species depth. However, in the case of species depth 100,000 we see that only one pair (Species1_g0/Species2_g7) has a *p*-value less than 0.05 (see **Table S4** in Supplementary Material).

*p*-values and distance between true trees appear strongly correlated. We fitted correlations between *p*-values and distance between true trees as well as correlation between *p*-values and the difference of means for the posterior distributions given the original sequence data sets, using a function called *loess* (Figure (Figure3A).3A). The fitted lines show negative correlation between the *p*-values and the distance between true trees and also negative correlation between the *p*-values and the difference of means. Note that the fitted lines for distances between true trees and for differences of means in Figure Figure3A3A any *p*-values below the α-level (0.05 in our case) are within their confidence intervals. Actually they are within their confidence intervals up to the *p*-value equals to 0.3. This means the differences of means with posterior distributions given the original sequence data sets are good measurements for distance between true trees for our statistical tests. This is particularly important since we usually do not know the true trees with biological data sets. For complete results of our simulations see **Tables S1 and S3** in Supplementary Material. We appear to have Type II errors, since the distance between the true gene trees are very close to each other (see **Table S4** in Supplementary Material). Also, since the bound provided in Eq. 2 is not tight for some cases, the bound coming from Eq. 2 is conservative, i.e., it tends to give higher *p*-values. Thus we have some power loss in our method.

We also compared our method with two others: the statistical hypothesis testing described in Example 3 of Section 4.4.1 in Holmes (2005), and an application of the SH test (Shimodaira and Hasegawa, 1999). For the method in Holmes (2005), to compute the ML trees we used `Raxml` (Stamatakis, 2006), and to compute *p*-values we used `R` (Feinerer and Hornik, 2009). We used a bootstrap sample size of 1,000. In our simulations, the method in Holmes (2005) had higher power than ours, but it exhibited a 13% of Type I error, while our method committed no Type I errors (see **Tables S1 and S3** in Supplementary Material for details).

For SH test we used `paup*` (Swofford, 1998). The bootstrap sample size was chosen to be 100 (the same as our method), and the number of random tree topologies was chosen to be 1000. Note that SH is designed to test whether a given tree *T*_{1} is contained in the confidence region for an unknown tree *T*_{2}. In our framework, both *T*_{1} and *T*_{2} are unknown. Thus we applied the SH procedure twice: once to test whether the ML estimate ${\widehat{T}}_{1}$ is in the confidence region for *T*_{2}, and once to test whether ${\widehat{T}}_{2}$ is in the confidence region for *T*_{1}. If both tests reject, then we declare that the overall procedure rejects *T*_{1}=*T*_{2}. We call this the “paired SH test.” To run the paired SH test at level α, each of the two individual SH tests is run at level α.

With these parameters, neither SH nor our method exhibited any false positives when the nominal Type I error rate was set to α≤0.1. For α≥0.05, SH had slightly more power, but our method was much more powerful than SH for small α. See Figure Figure3B3B for a power comparison of our method against SH; also **Tables S1 and S3** in Supplementary Material contain detailed *p*-value information for each test.

We tested our method with a well-known gopher-louse data set (Hafner and Nadler, 1990), see Table Table2.2. This data set contains 17 taxa of lice and 15 taxa of gophers. In order to satisfy the requirement for an equal number of leaves for tree comparison we constructed four individual data sets reflecting all possible pairings of the two gopher species involved in the possible host jumps with their apparent parasitic louse species: (dataset 1) Thomomys talpoides–Thomomydoecus barbarae, Thomomys bottae–Thomomydoecus minor; (dataset 2) Thomomys talpoides–Geomydoecus thomomyus, Thomomys bottae–Thomomydoecus minor; (dataset 3) Thomomys talpoides–Thomomydoecus barbarae, Thomomys bottae–Geomydoecus actuosi; (dataset 4) Thomomys talpoides–Geomydoecus thomomyus, Thomomys bottae–Geomydoecus actuosi.

The posterior distributions were estimated using MrBayes with the following parameters: (i) for the model: GTR+Gamma+Invariant sites; (ii) for MCMC: number of runs: 1, number of chains: 2, chain length: 100,000, sample frequency: 1,000, burn-in: 25%; and (iii) for bootstrap sampling: 100 bootstrap samples with sample size of 379 columns which is the length of sequence alignments in the data sets.

We also tested our Method with the data sets from Schardl et al. (2008). After removing cases of apparent host jumps, the data sets contain sequences from 20 taxa of grasses and 20 taxa of endophytes. Sequences were aligned with the aid of `PILEUP` implemented in `SEQWeb` Version 1.1 with `Wisconsin Package` Version 10 (Genetics Computer Group, Madison, WI). PILEUP parameters were adjusted empirically; a gap penalty of 2 and a gap extension penalty of 0 resulted in reasonable alignment of intron–exon junctions and intron regions of endophyte sequences, and of intergenic spacer and intron regions of cpDNA sequences. Alignments were scrutinized and adjusted by eye, using tRNA or protein coding regions as anchor points. For phylogenetic analysis of the symbionts, sequences from *tubB* (encoding β-tubulin) and *tefA* (encoding translation elongation factor 1-α) were concatenated to create a single, contiguous sequence of approximately 1400bp for each endophyte, of which 357bp was exon sequence and the remainder was intron sequence. For phylogenetic analysis of the hosts, sequences for both cpDNA intergenic regions (*trnT*-*trnL* and *trnL*-*trnF*) and the *trnL* intron were aligned individually then concatenated to give a combined alignment of approximately 2200bp. Analysis was also performed using the sequences from *tubB* and *tefA* separately.

The posterior distributions were estimated using MrBayes with the following parameters: (i) for the model: GTR+Gamma+Invariant sites; (ii) for MCMC: number of runs: 1, number of chains: 2, chain length: 100,000, sample frequency: 1,000, burn-in: 25%; and (iii) for bootstrap sampling: 100 bootstrap samples, number of bootstrap columns equals length of original alignment.

These results are interesting in comparison with the prior finding of significant relationship between the phylogenies of the grasses and their endophytes (Schardl et al., 2008). The previous analysis indicated a significant relationship between ages of corresponding nodes in endophyte and grass phylogenies, addressing whether divergences of grass and endophyte clades tended to occur at approximately the same time. In contrast, results of the analysis above suggest that the grass and endophyte phylogenies are significantly different (Table (Table2).2). We conclude that such a relationship of node ages does not necessarily imply similar phylogenetic histories. This is reasonable because the relationships of grasses and their endophytes is expected to be one of diffuse cospeciation at best. Individual species of endophyte may be associated with genera or tribes of grasses, but rarely with individual species. This contrasts with the gopher–gopher louse situation, where evidence suggests a much stricter coevolutionary relationship (Table (Table22).

We chose an additional biological data set to compare phylogenies of genes that occur together in endophyte genomes. Whereas *tefA* and *tubB* are housekeeping genes present in all isolates, *lolC* is a secondary metabolism gene sporadically present in endophyte isolates (Spiering et al., 2002). It has been suggested that such sporadically occurring secondary metabolism genes may be distributed in fungi largely by horizontal gene transfer (Walton, 2000). To investigate this possibility in the case of *lolC*, we used our approach to test whether the phylogenies of these three genes were significantly different. The most likely trees obtained by MCMC showed related but non-identical topologies (Figure (Figure4;4; note placement of genes from *Epichloe festucae* and *Epichloe brachyelytri*). Our test found no significant difference between the phylogenies, although the *p*-values appear stochastically smaller than the *p*-values observed for simulated data under the null. This perhaps reflects the conservative nature of our test. Removing either *Epichloe festucae* or *Epichloe brachyelytri* altered the results only slightly (Table (Table3).3). These results indicate that *lolC* evolution was largely or exclusively by decent, and disfavored horizontal transfer as an explanation for the sporadic distribution of this gene.

To facilitate computations for our experiments, we developed a set of programs, collectively called `Phylotree`. `Phylotree` is organized as a collection of scripts for running a complete computational experiment starting from sequence alignments, then sampling phylogenetic trees and computing distances between phylogenetic trees and their distributions (see Section “Materials and Methods”). Supported distance measures include path difference, dissimilarity map distance, Robinson–Foulds distance. Available scripts allow for selecting the number of columns and the number of bootstrap samples, linking taxa in the alignments and provide flexibility for using different sampling methods (e.g., `MrBayes` or `BEAST`) and distance measures. This is free software, and will be distributed under the terms of the GNU General Public License. One can download the software at `http://csurs7.csr.uky.edu/phylotree/`. The login information can be obtained at `http://cophylogeny.net/research.php`.

In this paper we presented a method to determine if two phylogenetic trees with given alignments are significantly incongruent. Our method computes the difference of means of posterior distributions of trees, which has the advantage of using entire tree distributions, as opposed to single tree estimators.

In this paper we used the triangle inequality (*d*_{1}≤*d*_{2}+*d*_{3} in Figure Figure1)1) to derive a bootstrap procedure to compute *p*-values (we included the box plots for *p*-values and the ROC curve for our method, see **Figures S1 and S2** in Supplementary Material). However, our bootstrap procedure appears to be very conservative, producing *p*-values whose null distribution is stochastically much larger than uniform *U*(0,1). Thus in order to increase the power we might want to consider different criteria for computing *p*-values. One approach may be to define *v(T),v(T )* to be the average of bootstraps $\{v({T}_{1}^{*})\},\{v({T}_{2}^{*})\},$ rather than the initial tree estimates. Another possibility is to replace the triangle inequality with a max condition [e.g., in Figure Figure11 use the condition *d*_{1}≤max(*d*_{2},*d*_{3})]. We explored this in the Supplementary Material, and it seems that the max condition provides much more power, but is somewhat anti-conservative.

In this paper we used the dissimilarity map as a feature space. However, there are other common tree features which can be used to define different feature spaces. Examples of distances derived from tree features include (normalized) Robinson–Foulds distance (Robinson and Foulds, 1981); quartet distance (Estabrook et al., 1985); and the path difference metric (Steel and Penny, 1993). Of course, in all the above examples, we could choose any vector space norm, such as *L _{p}* for any

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The Supplementary Material for this article can be found online at http://www.frontiersin.org/neuroscience/systemsbiology/paper/10.3389/fnins.2010.00047

David Haws, Elissaveta Arnaoudova, Jerzy W. Jaromczyk, Christopher L. Schardl, Ruriko Yoshida are supported by NIH R01 grant 5R01GM086888. Elissaveta Arnaoudova, Jerzy W. Jaromczyk, and Neil Moore developed the software `Phylotree`. Peter Huggins is supported by the Lane Fellowship in Computational Biology at Carnegie Mellon University. We thank anonymous referees for useful comments which improve this paper.

- Ane C., Larget B., Baum D. A., Smith S. D., Rokas A. (2007). Bayesian estimation of concordance among gene trees. Mol. Biol. Evol. 24, 412–42610.1093/molbev/msl170 [PubMed] [Cross Ref]
- Buneman P. (1971). “The recovery of trees from measures of similarity,” in Mathematics of the Archaeological and Historical Sciences, eds Hodson F., Kendall D., Tautu P., editors. (Edinburgh: Edinburgh University Press; ), 387–395
- Edwards S., Liu L., Pearl D. (2007). High-resolution species trees without concatenation. Proc. Natl. Acad. Sci. U.S.A. 104, 5936–594110.1073/pnas.0607004104 [PubMed] [Cross Ref]
- Estabrook G., McMorris F., Meaeham C. (1985). Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Syst. Zool. 34, 193–20010.2307/2413326 [Cross Ref]
- Feinerer I., Hornik K. (2009). wordnet: WordNet Interface. R package version 0.1-5. http://CRAN.R-project.org/package=wordnet
- Felsenstein J. (1981). Evolutionary trees from DNA sequences. J. Mol. Evol. 17, 368–37610.1007/BF01734359 [PubMed] [Cross Ref]
- Galtier N., Gascuel O., Jean-Marie A. (2005). “An introduction to Markov models in molecular evolution,” in Statistical Methods in Molecular Evolution, ed. Nielsen R., editor. (New York: Springer; ), 3–24
- Ge S., Sang T., Lu B., Hong D. (1999). Phylogeny of rice genomes with emphasis on origins of allotetraploid species. Proc. Natl. Acad. Sci. U.S.A. 96, 14400–1440510.1073/pnas.96.25.14400 [PubMed] [Cross Ref]
- Guindon S., Gascuel O. (2003). A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52, 696–70410.1080/10635150390235520 [PubMed] [Cross Ref]
- Hafner M. S., Nadler S. A. (1990). Cospeciation in host parasite assemblages: comparative analysis of rates of evolution and timing of cospeciation events. Syst. Zool. 39, 192–20410.2307/2992181 [Cross Ref]
- Holmes S. (2005). “Statistical approach to tests involving phylogenies,” (Chapter 4) in Mathematics of Phylogeny and Evolution, ed. Gascuel O., editor. (New York: Oxford University Press; ), 91–117
- Huelsenbeck J., Ronquist F. (2001). Mrbayes: Bayesian inference in phylogenetic trees. Bioinformatics 17, 754–75510.1093/bioinformatics/17.8.754 [PubMed] [Cross Ref]
- Huggins P., Li W., Haws D., Friedrich T., Liu J., Yoshida R. (2010). Bayes estimators for phylogenetic reconstruction. Syst. Biol. (in press). [PMC free article] [PubMed]
- Kishino H., Hasegawa M. (1989). Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data. J. Mol. Evol. 29, 170–17910.1007/BF02100115 [PubMed] [Cross Ref]
- Lee M. S. Y., Hugall A. F. (2003). Partitioned likelihood support and the evaluation of data set conflict. Syst. Biol. 52, 15–2210.1080/10635150390132650 [PubMed] [Cross Ref]
- Liu L., Pearl D. (2007). Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst. Biol. 56, 504–51410.1080/10635150701429982 [PubMed] [Cross Ref]
- Liu L., Pearl D., Brumfield R., Edwards S. (2008). Estimating species trees using multiple-allele DNA sequence data. Evolution 62, 2080–209110.1111/j.1558-5646.2008.00414.x [PubMed] [Cross Ref]
- Maddison W. (1997). Gene trees in species trees. Syst. Biol. 46, 523–536
- Maddison W., Knowles L. (2006). Inferring phylogeny despite incomplete lineage sorting. Syst. Biol. 55, 21–3010.1080/10635150500354928 [PubMed] [Cross Ref]
- R Development Core Team. (2004). R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing. http://www.R-project.org
- Robinson D. F., Foulds L. R. (1981). Comparison of phylogenetic trees. Math. Biosci. 53, 131–14710.1016/0025-5564(81)90043-2 [Cross Ref]
- Saitou N., Nei M. (1987). The neighbor joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 [PubMed]
- Schardl C. L., Craven K. D., Speakman S., Lindstrom A., Stromberg A., Yoshida R. (2008). A novel test for host-symbiont codivergence indicates ancient origin of fungal endophytes in grasses. Syst. Biol. 57, 483–49810.1080/10635150802172184 [PubMed] [Cross Ref]
- Shimodaira H., Hasegawa M. (1999). Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol. Biol. Evol. 16, 1114–1116
- Spiering M., Wilkinson H., Blankenship J., Schardl C. (2002). Expressed sequence tags and genes associated with loline alkaloid expression by the fungal endophyte neotyphodium uncinatum. Fungal Genet. Biol. 36, 242–25410.1016/S1087-1845(02)00023-3 [PubMed] [Cross Ref]
- Stamatakis A. (2006). RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688–269010.1093/bioinformatics/btl446 [PubMed] [Cross Ref]
- Steel M., Penny D. (1993). Distributions of tree comparison metrics-some new results. Syst. Biol. 42, 126–141
- Swofford D. L. (1998). PAUP*. Phylogenetic Analysis Using Parsimony (* and Other Methods). Sunderland, MA: Sinauer Associates
- Vilaa M., Vidal-Romani J. R., Björklund M. (2005). The importance of time scale and multiple refugia: incipient speciation and admixture of lineages in the butterfly
*Erebia triaria*(Nymphalidae). Mol. Phylogenet. Evol. 36, 249–26010.1016/j.ympev.2005.02.019 [PubMed] [Cross Ref] - Voigt K., Cicelnik E., O‘Donnel K. (1999). Phylogeny and PCR identification of clinically important zygomycetes based on nuclear ribosomal-DNA sequence data. J. Clin. Microbiol. 37, 3957–3964 [PMC free article] [PubMed]
- Walton J. (2000). Horizontal gene transfer and the evolution of secondary metabolite gene clusters in fungi: an hypothesis. Fungal Genet. Biol. 30, 167–17110.1006/fgbi.2000.1224 [PubMed] [Cross Ref]
- Yang Z., Rannala B. (1997). Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method. Mol. Biol. Evol. 14, 717–724 [PubMed]

Articles from Frontiers in Neuroscience are provided here courtesy of **Frontiers Media SA**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |