Despite huge advances both in evolutionary theory and in sequencing technology, estimating the “Tree of Life” even for a small subset of species can be challenging. Better mathematical models and more data improve our ability to infer a single gene phylogeny, but a gene history may be different from the species phylogeny. The potential for a discrepancy between the gene tree and the species tree has been known for decades and is especially problematic for closely related species or species with large population sizes. Building a species tree requires combining information from multiple genes; all gene phylogenies need to be “embedded” inside the species history while not violating the species tree constraints: The time of a common ancestor of a gene cannot be more recent than the time of divergence of the respective species.
This simple yet useful view assumes no significant gene flow between species such as horizontal gene transfer, reassortment, or introgression.
Early theoretical work included the analytical derivation of the probabilities for different gene tree topologies relating four individuals from two different species and showed that when the two populations diverged only recently an incorrect tree is not the exception but a common occurrence (Tajima 1983
). Analytical results were also known for three individuals from three species (Nei 1987
). By the late 1980s, the discrepancy between species trees and gene trees was considered common knowledge, and Pamilo and Nei (1988)
suggested that combining information from several independent loci was better than adding more samples. Pamilo and Nei also mentioned that a short branch in a species tree makes it likely that a gene tree has a different topology irrespective of the rest of the tree.
There are many potential sources of discrepancy between gene trees and species trees, including horizontal transfer, lineage sorting, and gene duplication/extinction. Early approaches to species tree estimation in the face of multiple gene trees included a parsimony-based method for constructing a species trees topology from gene trees (Maddison 1997
). Many of the sources of inconsistency between gene trees and species trees have since been subject to further research, with the focus being on the development of statistical inference procedures. In this paper, we will term models that emphasize incomplete lineage sorting as the main source of inconsistency between gene trees and species trees, “multispecies coalescent models.”
Recent research into multigene phylogenetics demonstrates that the common approach of concatenating sequences from multiple genes generates the wrong kind of average (Degnan and Rosenberg 2006
) and can lead to poor estimation of the species tree (Kubatko 2007
). Although this common practice of concatenation can result in a well supported but incorrect tree, it is still a widely used method (Rokas et al. 2003
; Wu and Eisen 2008
), largely because of a lack of alternatives.
It has also been shown that the straightforward procedure of using the estimated gene tree topology that occurs most often among set of loci can be asymptotically guaranteed to produce the wrong estimate of the species tree in the so-called anomaly zone (Degnan and Rosenberg 2006
). Two recent studies examined the performance of various methods in this problematic region of species trees (Huang and Knowles 2009
; Liu and Edwards 2009
A number of researchers have taken advantage of the multispecies coalescent model to develop methods that reconcile a set of gene trees with a shared species tree (Wilson and Balding 1998
; Rannala and Yang 2003
; Wilson et al. 2003; Liu and Pearl 2007
; Liu et al. 2008
). The multispecies coalescent assumes that each gene tree represents the relationships between orthologous genes from a small sample of individuals from multiple species and that there is no horizontal gene transfer or admixture between individuals from different species. A number of Bayesian approaches to inference have been developed in this context. The software package BATWING (Wilson and Balding 1998
; Wilson et al. 2003
) was developed to estimate a species tree from a single gene tree, including the times of speciation, population sizes, and growth rates. The MCMCcoal (Rannala and Yang 2003
) software package estimates ancestral population sizes and divergence times on a known species tree based on a strict molecular clock and multiple gene trees. Finally, BEST provides estimates of the species tree topology, divergence times, and ancestral population sizes from a set of gene trees via an importance sampling method (Liu and Pearl 2007
; Liu et al. 2008
The Species Tree
Coalescent theory explicitly links the effective population size with the ancestral history of a small sample of genes from a population. In the context of Bayesian phylogenetic analysis, the coalescent acts as a prior distribution for gene trees. In its basic form, it is restricted to analyzing genes of individuals from the same species but it can be extended in a natural way to serve as a prior when building a multiple species phylogeny.
Please bear in mind that “species” above and in the rest of the text is not necessarily the same as a taxonomic rank, but designates any group of individuals that, after some “divergence” time, have no history of breeding with individuals outside that group. A species tree defines barriers for gene flow, and so the term is a catch all for taxonomic rank, subspecies, or any diverging “population structure.”
A species tree specifies ancestral relationships (tree topology), the times ancestral species separated into two species (divergence times), and the population size history for each species. Each species (extant or ancestral) is represented by one branch of the species tree.
Gene trees are “embedded” inside a species tree by following the stochastic coalescent process back in time from the present within each branch, a process known as a multispecies coalescent or the “censored coalescent” (Rannala and Yang 2003
). A species tree can be visualized by setting the y
axis proportional to time and the intervals on the x
axis proportional to population size as shown in (Wilson et al. 2003
Species tree visualization. One locus for three individuals from each of the three species giving a total of nine samples. Current population size (t = 0) of A is 2 and at time 1.5 (where it split from B) the population size is 1.
Multiple samples per species are necessary for a complete estimation. Even two samples per species are sufficient, given enough loci. A single sample means no coalescent events for that extant species and so no information to estimate population size. This may in turn have a detrimental effect on inferring speciation times and perhaps even species topology.
In this paper, we describe a full Bayesian framework for species tree estimation. We have attempted to combine the best aspects of previous methods to provide joint inference of a species tree topology, divergence times, population sizes, and gene trees from multiple genes sampled from multiple individuals across a set of closely related species. We have achieved this by extending BEAST, a large existing open source software package for Bayesian phylogenetic inference (Drummond and Rambaut 2007
). The new method is named *BEAST (pronounced “star beast”).