Search tips
Search criteria 


Logo of molbiolevolLink to Publisher's site
Mol Biol Evol. 2011 January; 28(1): 523–532.
Published online 2010 August 27. doi:  10.1093/molbev/msq224
PMCID: PMC3002242

Choosing among Partition Models in Bayesian Phylogenetics


Bayesian phylogenetic analyses often depend on Bayes factors (BFs) to determine the optimal way to partition the data. The marginal likelihoods used to compute BFs, in turn, are most commonly estimated using the harmonic mean (HM) method, which has been shown to be inaccurate. We describe a new more accurate method for estimating the marginal likelihood of a model and compare it with the HM method on both simulated and empirical data. The new method generalizes our previously described stepping-stone (SS) approach by making use of a reference distribution parameterized using samples from the posterior distribution. This avoids one challenging aspect of the original SS method, namely the need to sample from distributions that are close (in the Kullback–Leibler sense) to the prior. We specifically address the choice of partition models and find that using the HM method can lead to a strong preference for an overpartitioned model. In contrast to the HM method and the original SS method, we show using simulated data that the generalized SS method is strikingly more precise (repeatable BF values of the same data and partition model) and yields BF values that are much more reasonable than those produced by the HM method. Comparisons of HM and generalized SS methods on an empirical data set demonstrate that the generalized SS method tends to choose simpler partition schemes that are more in line with expectation based on inferred patterns of molecular evolution. The generalized SS method shares with thermodynamic integration the need to sample from a series of distributions in addition to the posterior. Such dedicated path-based Markov chain Monte Carlo analyses appear to be a cost of estimating marginal likelihoods accurately.

Keywords: phylogenetics, Bayes factor, marginal likelihood, harmonic mean method, stepping-stone method, partitioning


Partitioned analyses are now routine for multi-gene data sets in Bayesian phylogenetics (Nylander et al. 2004, Brandley et al. 2005, Brown and Lemmon 2007, Clarke and Middleton 2008, Brown et al. 2009, Liu et al. 2010). It is widely known that different genes or different codon positions experience different selection pressures. By partitioning, a better fit of model to data can be achieved, and the models used better reflect the molecular evolutionary forces at work. However, more partitions or more complex models mean more parameters are estimated, increasing the variability of estimates given a fixed and finite amount of data. The question is how to choose an economical partition strategy for the data that allows the model to fit the data well but discourages unnecessary partitions that contribute little to goodness-of-fit.

The Bayes factor (BF) has been shown to be a useful criterion for model selection in Bayesian inference:

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx1_ht.jpg

The BF is the ratio of the marginal likelihood under one model, f(y|M0), to the marginal likelihood under an alternative model, f(y|M1), for fixed data y. If the ratio is larger than 1.0, model M0 is favored; if less than 1.0, model M1 is favored. The language of odds ratios is used in discussions of BFs: For example, BF01 represents the BF for model M0 and against model M1. The marginal likelihood of model M is a weighted average (expected value) of the likelihood, f(y|θ,M), where the weights are provided by the prior, π(θ|M), and θ[set membership]Θ may be multidimensional and model specific:

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx2_ht.jpg

The marginal likelihood is thus a measure of the average fit of model M to data y, which contrasts with the maximized likelihood used by likelihood ratio tests (Wilks 1938), the Akaike information criterion (Akaike 1974), and the Bayesian information criterion (Schwarz 1978), all of which make use of the fit of the model at its best-fitting point in parameter space Θ.

The estimation of marginal likelihoods is a challenging task because no closed-form expression exists for most phylogenetic applications. The solution has been to resort to numerical approximation using Markov chain Monte Carlo (MCMC), and many methods have been proposed, including the harmonic mean (HM) method (Newton and Raftery 1994), bridge sampling (Meng and Wong 1996), path sampling (Gelman and Meng 1998), thermodynamic integration (TI; Lartillot and Philippe 2006), reversible jump MCMC (Huelsenbeck et al. 2004), and the stepping-stone (SS) method (Xie et al. 2010). Among these, the HM of the likelihoods computed from samples taken from the Bayesian posterior probability distribution is the most broadly used in Bayesian phylogenetics. The popularity of the HM estimator is due to its easy calculation and the fact that currently no other choice is provided by most Bayesian phylogenetics software. In contrast to its popularity in Bayesian phylogenetics, HM has been controversial in the statistics community from the moment it was proposed due to the fact that it is a biased estimator of the marginal likelihood (the estimate is expected to be higher than the true value; Xie et al. 2010) and has a large and unpredictable variance (which can be infinite) (Neal 1994).

Recently, improved means of estimating marginal likelihoods have been introduced into Bayesian phylogenetics. Lartillot and Philippe (2006) described TI and Xie et al. (2010) introduced the SS method, both of which exceed HM greatly in both accuracy and precision. The Savage–Dickey ratio (Verdinelli and Wasserman 1995, Suchard et al. 2001) and reversible jump MCMC (Huelsenbeck et al. 2004) can also be used to accurately estimate the BF directly when models are nested.

The primary question addressed in this paper is: “If the marginal likelihoods of models were more accurately estimated, would less-partitioned models be used in Bayesian phylogenetic analyses?” This is an important question because large Bayesian analyses generally require longer run times, which leads to small effective sample sizes if run times are not adjusted to be proportional to the number of estimated parameters. Reducing the number of data subsets should therefore yield higher effective sample sizes for a given amount of computational effort. Unnecessarily complex models also have more diffuse posterior distributions, so using less-partitioned models is expected to increase confidence in the inferences made. Finally, partitioned analyses can lead to bizarre parameter estimates. For example, it is possible for second codon positions to appear to evolve faster than first or even third codon positions (Marshall 2010), and the estimated proportion of invariable sites can be as high as 0.96 even when all sites are variable (Appendix 1). Such abnormalities do not appear to occur with unpartitioned models. Although eliminating all partitions is an extreme solution that may reduce performance due to poor goodness-of-fit (Brown and Lemmon 2007), using a more accurate marginal likelihood estimator may favor a less-partitioned model that alleviates some of these pathologies without reducing goodness-of-fit appreciably.

Another question addressed is: “Is it possible to further improve the efficiency of SS (Xie et al. 2010) so that results of comparable accuracy can be obtained with less computational effort?” Inspired by the geometric path approach taken in Lefebvre et al. (2010), we show that use of a straightforward reference distribution substantially increases the computational efficiency of SS.

Our interest in the relationship of HM to partitioning was initially aroused by figure 6 of Brown and Lemmon (2007), which plotted twice the natural logarithm of the BF (estimated using HM) for a partitioned model against an unpartitioned model when the unpartitioned model was the true model. In such cases, partitioning is unnecessary and while 2log(BF) is not guaranteed to be less than zero in this case, it is reasonable to expect few, if any, positive values. In contrast to expectation, approximately 31% of Brown and Lemmon's 2log(BF) values were above 0, and more than 5% were above 10. This means that in nearly one third of the simulated data sets analyzed, a clearly overpartitioned model would have been chosen using HM-based BF comparisons. We decided to conduct a study similar to the one represented in figure 6 of Brown and Lemmon (2007) to evaluate the effect of marginal likelihood estimation accuracy on model choice. Our expectation was that few, if any, data sets simulated under an unpartitioned model would be chosen by a BF for a partitioned model against the unpartitioned (true) model when marginal likelihoods of the two models were estimated accurately.

Materials and Methods

Simulated Data

Simulations were similar to those described in Xie et al. (2010). The number of taxa in each simulated data set was decided by drawing from the set of integers from 4 to 20 uniformly (and inclusively), and the number of nucleotide sites was an even number uniformly chosen from 100 to 5,000. For each simulated data set, a tree topology was chosen at random from all possible labeled unrooted binary tree topologies (i.e., the proportional-to-distinguishable model), and internal branch lengths, external branch lengths, base frequencies, and general time reversible (GTR) exchangeabilities were drawn from Gamma(10,0.001), Gamma(1,0.1), Dirichlet(100,100,100,100), and Dirichlet (100,100,100,100,100,100) distributions, respectively. The discrete gamma distribution (ten categories) was used to impart among-site rate heterogeneity, with the gamma shape parameter drawn from a Gamma(2,3) distribution. The 200 data sets used for this example were thus independently and identically distributed. Although, technically, this generating model produced all data sets from a GTR + G distribution, the distributions of parameters were chosen so that many simulation replicates come close to various submodels of the GTR + G model. For example, the Gamma(2,3) distribution used to choose the shape parameter for among-site rate heterogeneity produces values greater than 5 (i.e., effective rate homogeneity) about 50% of the time. Thus, about 50% of data sets could be fit nearly as well by the GTR model as by the (true) GTR + G model. Previously (Xie et al. 2010), we used this simulation scheme to address choice of substitution models in the context of unpartitioned analyses; here, our purpose is to instead explore choice among different possible ways to partition data.

Empirical Data

In addition to simulations, we also evaluated empirical data that have been a focal point for discussions of artifacts associated with partitioning in Bayesian analyses (Marshall et al. 2006). This data set is available at TreeBase (, study accession number S1679) and comprises 32 taxa and 2,152 nucleotide sites. For our analyses, we omitted the tRNA gene, leaving data for four protein-coding genes (COI, COII, ATPase8, and ATPase6), and reducing the total number of sites to 2,090.

Generalized SS Method

We describe here a modification of the SS method described by Xie et al. (2010). The modified version is considerably more efficient and does not require sampling from distributions close to the prior (which can be problematic for vague priors). This generalized SS introduces a reference distribution, which in practice is a product of independent probability densities parameterized using samples from the posterior distribution. Although the original SS method does not require samples from the posterior distribution, in practice, the posterior is explored as a means of burning-in the chain and this modified version uses this burn-in period to parameterize its reference distribution. However, if samples from a previous extensive MCMC analysis of the posterior are available, it is advisable to use this previous sample to parameterize the reference distribution. A reference distribution that approximates the posterior closely requires less computational effort to accurately estimate the marginal likelihood. Remarkably, as we will later show, if the reference distribution exactly equals the posterior distribution, the marginal likelihood can be estimated exactly (i.e., in this case, MCMC sampling error does not affect the estimate).

Consider the unnormalized density function qβ, which has normalizing constant cβ yielding the normalized density pβ:

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx3_ht.jpg

where y represents the data (e.g., nucleotide sequences), θ is the vector of model parameters, M is the model under consideration, f(y|θ,M) is the likelihood function, π(θ|M) the actual model prior, and π0(θ|M) is the reference distribution. The density pβ is a form of power posterior that is equivalent to the posterior distribution when β = 1 but equivalent to the reference distribution when β = 0. This differs from the original SS method (Xie et al. 2010), where the actual prior distribution is sampled when β = 0. The power posterior can be difficult to sample if β is near 0.0 and the prior is diffuse (normally the case), so using a reference distribution facilitates sampling from pβ regardless of the value of β. The goal is to estimate the ratio c1.0/c0.0, which is equivalent to the marginal likelihood because c0.0 = 1.0 if the reference distribution is proper (which is assumed throughout). Similar to the original SS method, this ratio can be expressed as a product of K ratios:

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx4_ht.jpg

where An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx5_ht.jpg. Each ratio cβk/cβk − 1 is estimated by importance sampling, using pβk − 1 as the importance sampling density. Because pβk − 1 is only slightly different from pβk, it serves as an excellent importance distribution. One of the K ratios, rk, can thus be expressed as follows:

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx6_ht.jpg

An estimator An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx7_ht.jpg is constructed using samples θk − 1,i(i = 1,2,…,n) from pβk − 1:

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx8_ht.jpg

Numerical stability can be improved by factoring out the largest sampled term, An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx9_ht.jpg:

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx10_ht.jpg

On the log scale,

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx11_ht.jpg

Finally, summing An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx7_ht.jpg over all K ratios yields the overall estimator:

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx12_ht.jpg

This approach reduces to the original SS method if the reference distribution is equal to the actual prior. However, the reference distribution would normally be chosen to be closer (in the Kullback–Leibler sense) to the posterior than the actual prior, resulting in importance distributions that better approximate the distribution in the numerator of each ratio. In practice, samples from the posterior distribution (βk = 1) are used to parameterize the joint reference distribution π0(θ|M). For each component θ of the model (where a component could be an individual parameter or a block of correlated parameters, such as base frequencies), the marginal posterior sample mean (An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx13_ht.jpg) and variance (An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx14_ht.jpg) are used to parameterize an independent reference distribution for θ. For example, if θ represents the gamma shape parameter used for modeling among-site rate heterogeneity, a Gamma(An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx15_ht.jpg) distribution would be used as the reference distribution for θ because the mean of a Gamma(a,b) distribution is ab and its variance is ab2. Relative base frequencies are assigned a Dirichlet(a,c,g,t) distribution. The means (An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx16_ht.jpg) and variances (An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx17_ht.jpg) of the sampled base frequencies may be used to parameterize the Dirichlet reference distribution as follows (using least squares to estimate m, the sum of all parameters),

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx18_ht.jpg

Similarly, a Dirichlet(a,b,c,d,e,f) reference distribution can be constructed for the GTR exchangeability parameters using sample means (An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx19_ht.jpg) and variances (An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx20_ht.jpg). The joint reference distribution is simply the product of these independent reference distributions.

Different subsets of a partition scheme are often given their own relative substitution rate. These subset relative rates are known as rate multipliers in MrBayes (Ronquist and Huelsenbeck 2003), where they are introduced using the command prset ratepr = variable. Subset relative rates, by definition, have mean 1.0, which precludes the use of a Dirichlet prior or reference distribution. We use instead a transformed Dirichlet distribution (which we term here a subset relative rate distribution) to accommodate subset relative rates. Consider the case of a partition that defines n subsets, with proportion pi of the total sites assigned to subset i. Let Y~Dirichlet(c1,c2,…,cn) and Y = {yi:yi = xipi}. The variable X = {xi} has a subset relative rate distribution with density function

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx21_ht.jpg

To parameterize a subset relative rates reference distribution, we transform sampled relative rate vectors using the subset proportions to form samples that are Dirichlet distributed. The method described above for parameterizing a Dirichlet reference distribution is then used to obtain c1,c2,…,cn for the subset relative rates reference distribution.

Simulated Data Analysis

Each of the 200 simulated data sets was subjected to six separate MCMC analyses for the purpose of estimating marginal likelihoods: 1) an analysis of unpartitioned data in which HM was used to estimate the marginal likelihood, 2) an analysis in which data were partitioned into two equal-sized subsets and HM was used to estimate the marginal likelihood, 3) an analysis in which original SS was used to estimate the marginal likelihood for the unpartitioned data, 4) an analysis in which original SS was used to estimate the marginal likelihood for the bipartitioned data, 5) an analysis in which generalized SS was used to estimate the marginal likelihood for the unpartitioned data, and 6) an analysis in which generalized SS was used to estimate the marginal likelihood for the bipartitioned data. Separate analyses were required for HM, original SS, and generalized SS because both SS methods require special MCMC analyses to be performed in which the target distribution varies from the posterior to either the actual prior or the reference distribution over the course of the run. HM analyses were allotted approximately the same amount of computational effort as SS analyses. For bipartitioned analyses, the first subset was always the first n/2 sites and the second subset always the last n/2 sites in a data set of size n.

The GTR + G model was used for all analyses, and the tree topology was fixed to the true tree topology used to generate the data. In the case of partitioned models, all parameters were unlinked except the branch lengths. Prior distributions (π(θ|M)) were as follows: exponential(10) for all branch lengths, Dirichlet(1,1,1,1) for base frequencies, Dirichlet(1,1,1,1,1,1) for the GTR exchangeabilities, and exponential(0.01) for the discrete gamma shape parameter. The subset relative rates were fixed to 1.0 for all subsets in these analyses.

For HM analyses, 22,000 MCMC cycles were employed, and all parameters were updated once per cycle. A slice sampler (Neal 2003) was used to update branch lengths and the discrete gamma shape parameter. A Metropolis–Hastings Dirichlet proposal (Metropolis et al. 1953, Hastings 1970) was used to update base frequencies and GTR exchangeabilities. The Markov chain was sampled every ten cycles, providing 2,200 samples in which the first 200 samples were discarded as burn-in.

For the generalized SS analyses, 2,000 MCMC cycles were devoted to each of the 11 β-power posteriors (K = 10), again sampling every ten cycles. The first step served the dual purpose of serving as a burn-in period and providing samples from the posterior distribution for parameterizing the reference distribution. The 11 β values were equally spaced along the path from 1.0 to 0.0 (using a reference distribution that approximates the posterior obviates the need to place more sampling effort near β = 0).

For the original SS analyses, all the settings were the same as the generalized SS analyses except that: 1) the first step served as a burn-in period but no reference distribution was parameterized and 2) the 11 β values were chosen according to evenly spaced quantiles of the Beta(0.3,1) distribution, placing most values of β near 0 as recommended by Xie et al. (2010).

All analyses were repeated with a different pseudorandom number generator seed so HM, original SS, and generalized SS could be compared on the basis of repeatability.

Empirical Data Analysis

Four partition schemes were compared: “None” (unpartitioned data, data from all four genes concatenated), “Gene” (4 data subsets, each corresponding to a gene), “Codon” (3 data subsets, each corresponding to a codon position), and “Both” (12 data subsets, with each of the three codon positions in each of the four genes given its own partition subset). A GTR + G model was applied to each subset regardless of the number of subsets (1, 3, 4, or 12) in the partition model. Branch lengths were linked across subsets and the tree topology was fixed at the maximum likelihood topology found by heuristic (Tree-Bisection-Reconnection, or TBR, branch swapping) search in PAUP* 4b10 (Swofford 2002) assuming the “None” partition model and the GTR + G substitution model. Thus, the simplest model (“None”) has 70 free parameters (2×32 − 3 = 61 branch lengths, 5 GTR exchangeabilities, 3 relative base frequencies, and 1 discrete gamma shape parameter), whereas the most complex model (“Both”) tested has 180 free parameters (2×32 − 3 = 61 branch lengths, 5×12 = 60 GTR exchangeabilities, 3×12 = 36 relative base frequencies, 1×12 = 12 discrete gamma shape parameters, and 12 − 1 = 11 subset relative rates). The marginal likelihood of each partition model was estimated using two methods: HM and generalized SS. The software Phycas (Lewis et al. 2008) was used.

For HM analyses, a single Markov chain was allowed to burn-in for 500 cycles, where one cycle involved updating all parameters at least once (base frequencies, GTR exchangeabilities, and subset relative rates were updated ten times per cycle). In addition, an update affecting all branch lengths (tree rescaling) was attempted once per cycle. These updates were effected either by slice sampling (branch lengths and discrete gamma shape parameters; Neal 2003) or Metropolis–Hastings proposals (base frequencies, GTR exchangeabilities, subset relative rates, and tree rescaling; Metropolis et al. 1953, Hastings 1970). Following the burn-in period, the chain was allowed to run for 25,000 additional cycles and was sampled once per cycle.

Generalized SS analyses were identical to HM analyses with the exception that after 500 burn-in cycles, 1,000 MCMC cycles were devoted to each of the 25 β-power posteriors (K = 24). The first step provided samples from the posterior distribution for parameterizing the reference distribution. The 25 β values were equally spaced along the path from 1.0 to 0.0. The amount of computation was intentionally made identical for HM and the generalized SS analyses so that comparisons of the performance of HM versus the generalized SS would be fair.



The simulation experiment compared BF estimated using the HM, original SS, and generalized SS methods. Two hundred data sets of varying sizes (both number of taxa and number of sites) were simulated using a GTR + G model, but the data sets varied in the parameter values used in the generating model. The 200 points in the plots in figure 1 represent the quantity 2log(BF) calculated for each data set. Each BF value measures the marginal likelihood of the partitioned model (two equal-sized subsets) divided by the marginal likelihood of the unpartitioned model. Because the true model was unpartitioned, the expectation is that 2log(BF) would be negative for all data sets, indicating that the unpartitioned model fits the data better on average than the (arbitrarily and unnecessarily) partitioned model. For HM analyses (fig. 1a), 43 (21.5%) points are greater than zero. This result is qualitatively similar to that reported by Brown and Lemmon (2007) in their figure 6, even though those authors used a more complicated generating model. In contrast, only one 2log(BF) value was above zero when BF values were estimated using the original and generalized SS methods (fig. 1b and c), and the variance of generalized SS estimates is clearly smaller than the variance of estimates made using the original SS method. The single 2log(BF) greater than zero was from one of the smallest data sets simulated (only 130 sites and 4 taxa). Using both SS methods, increasing the number of sites generally resulted in a stronger preference for the unpartitioned model, whereas no such trend was evident for HM analyses.

FIG. 1.
Plots relating the number of sites to twice the natural logarithm of the BF (2log(BF)) in favor of the partitioned model (with two equal-size subsets) over the unpartitioned model for 200 data sets simulated under a diversity of unpartitioned GTR + G ...

We also investigated repeatability of HM, original SS, and generalized SS by analyzing each of the 200 data sets twice using different pseudorandom number seeds. Ideally, the same estimate of 2log(BF) should result from independent analyses. Plotting the 2log(BF) values obtained from seed 1 against the 2log(BF) values from seed 2 should therefore result in all values being very close to the 45о diagonal line indicating perfect identity. It is clear that HM (fig. 2a) is far less repeatable than the generalized SS (fig. 2c). Principal component analyses reveal that 99.9% of the variance is explained by the first principal component for the generalized SS estimates, whereas only 70.2% of the variance is explained by the first principal component for HM estimates. The original SS (fig. 2b) is intermediate in repeatability.

FIG. 2.
Scatterplots showing twice the natural logarithm of the BF (2log(BF)) estimated using two independent analyses started with different pseudorandom number seeds. (a) Left: 2log(BF) estimated using the HM method. (b) Middle: 2log(BF) estimated using the ...

Empirical Example

Although we have shown that HM behaves poorly when used for choosing a partition model on the basis of simulated data, it can be argued that our simulations present a scenario (each site evolving under exactly the same model) that is perhaps never found in the real world. Also, biologists use their experience in choosing a partitioning scheme so that partition placement is not arbitrary, as it was in our simulated data example. The value of the simulation experiment lies in the fact that we know the true model and can judge whether methods are behaving as they should under ideal circumstances. The poor performance of the HM method under such straightforward conditions does not bode well for its use in much more complex real data situations.

A natural question at this point might be: “Does using generalized SS instead of HM make any difference in actual practice?” To answer this question, we reanalyzed a four-gene New Zealand Kikihia (cicada) data set used by Marshall et al. (2006) to illustrate problems with branch lengths that can arise in partitioned Bayesian analyses. Table 1 and figure 3 show the results of applying both HM and generalized SS to these data under four possible partition models. Both HM and generalized SS agree that partitioning by codon is a good idea but disagree on whether partitioning by gene is beneficial. Given the choice between partitioning by gene (four subsets) and not partitioning (one subset), HM prefers the partitioned model, whereas generalized SS chooses the unpartitioned model. Likewise, given the choice between partitioning by codon (3 subsets) and partitioning by both gene and codon (12 subsets), HM chooses to partition by both gene and codon, whereas generalized SS chooses the simpler model that partitions by codon only.

Table 1.
Mean Log Marginal Likelihoods and Standard Deviations Based on 20 Independent Replicates from Analysis of the Four-gene New Zealand Cicada Data Set
FIG. 3.
Results of applying the HM and generalized SS methods to the empirical New Zealand cicada data set for four different partitioning schemes: unpartitioned (None), partitioned by gene (Gene, 4 subsets), partitioned by codon (Codon, 3 subsets), and partitioned ...

Figure 3 also clearly shows the bias in HM: For each partition model, the HM estimate of the marginal likelihood is considerably greater than the generalized SS estimate. The greater variability of HM estimates is also evident.


Generalized SS Method

The SS method described here is an important generalization of the original method described by Xie et al. (2010). When the reference distribution equals the actual prior, the generalized method equals the original method; however, choosing a reference distribution that approximates the posterior rather than the prior results in a much more stable and efficient estimator. It is more stable because the series of power posterior distributions being explored are all much more similar to one another, and sampling does not become problematic when the power β is close to zero (if anything, sampling becomes more straightforward at this end of the path). Because it is a generalization of the previous method, we prefer to retain the name SS for the new method. To avoid potential confusion, when the original SS method is applied researchers should explicitly state that the reference distribution equals the actual prior. Because of its improved performance, the generalized version described here (where the reference distribution approximates the posterior distribution) should be the default form of the SS method. To see how the generalized SS method can be more efficient than the original SS method, consider the case in which the reference distribution exactly equals the posterior distribution. In this special case, the overall ratio r can be determined exactly with only a single sampled point! Substituting π0(θ) = f(y|θ)π(θ)/f(y) into equation (2) and assuming only n = 1 point was sampled for each value of k (and assuming K = 1, if desired),

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx22_ht.jpg

(Dependence on the model M suppressed for notational clarity.) Although this result has no application in practice (because the exact value of f(y) must be known in order to compute the reference distribution density), it illustrates the importance of choosing a reference distribution that is a good approximation of the posterior distribution. If considerable effort has already gone into approximating the posterior, it behooves the investigator to use that information in constructing the reference distribution.

Despite the efficiency and stability improvements, the original SS still has a place in Bayesian model selection, particularly in models involving latent variables. For example, assume that each site is assigned a rate category and the number and composition of these rate categories is determined by a Dirichlet process (DP) prior (e.g., see Huelsenbeck and Suchard 2007). The DP prior governs not only model parameters (the rates of rate categories) but also latent variables (the assignments of each site to a rate category), complicating the definition of the reference distribution. In such cases, a hybrid SS approach is possible in which all model parameters unrelated to the DP prior are included in the parameterized reference distribution, with elements such as the DP prior being given a reference distribution equivalent to their actual prior.

BFs and Data Set Size

Intuitively, support for the true model over an overparameterized competing model will grow with the size of data sets (more taxa and longer sequences), and in our simulation experiment, this trend is easy to see when generalized SS is used (fig. 1c) but not when HM is used (fig. 1a) to estimate marginal likelihoods. To support our intuition, we devised the following example using normal distributions, which has the advantage that results are exact (see Appendix 2 for a detailed derivation). Suppose x1,…,xn are drawn from a normal distribution with mean 0 and variance σ2. This is analogous to simulating an unpartitioned nucleotide sequence data set from a given tree topology τ with a branch length set υ and a known nucleotide substitution model. These data may be analyzed using two models. Model M0 treats x1,…,xn/2 as if drawn from one normal distribution, N(μ1,σ2), and xn/2 + 1,…,xn as if drawn from a second, potentially distinct normal distribution, N(μ2,σ2). The means of the two normal distributions are allowed to be different but variance σ2 is shared, which is analogous to a Bayesian phylogenetic analysis in which one or more substitution model parameters are estimated for each partition subset but some (e.g., branch lengths and tree topology) are linked across subsets. Model M1 treats x1,…,xn as if drawn from a single normal distribution N(μ3,σ2), which is the analogue of an unpartitioned Bayesian phylogenetic analysis. Both models assume that the variance σ2 is identical for all n observations.

To obtain a closed-form expression, conjugate priors are used for both mean μ and variance σ2, and the mean μ is dependent on the variance σ2. That is, priors are

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx23_ht.jpg

After centering the observations An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx24_ht.jpg, the BF is

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx25_ht.jpg

which demonstrates that BF01 is a monotonically decreasing function of data size, hence, the unpartitioned model is always the expected winner, even for data sets containing as few as n = 2 observations. A plot of 2log(BF01) against n (see Supplementary fig. S1, Supplementary Material online) is similar to the trend in generalized SS-based 2log(BF) values for the simulated data (fig. 1c).

Performance of Generalized SS on Simulated and Empirical Data

Our simulation experiment demonstrated that use of the HM method to compute BFs can potentially be very misleading when using BFs to make decisions about which partition model is best. In more than 1/5 of the data sets analyzed, the HM method would lead one to choose a partitioned model when an unpartitioned model was the true model. In contrast, the generalized SS method described here would have recommended partitioning in only 1 of the 200 data sets. Repeated independent analyses showed that generalized SS is much more repeatable than the HM method.

Our analysis of data on New Zealand Kikihia cicadas illustrated that using HM can result in overpartitioning in real data as well. The four genes used in this study are quite similar in their pattern of substitution (see supplementary table S1, Supplementary Material online). Based on the tree length estimates, all the genes evolve at a rate within 35% of the average rate. Likewise, patterns of rate heterogeneity (shape parameter α), base frequencies (πi), and GTR exchangeabilities (rij) (with the exception of the transition rates rAG and rCT) are similar in magnitude across genes. The similarity of parameter estimates across genes makes it surprising that partitioning by gene would be favored. It therefore makes sense that accurately estimated marginal likelihoods do not support partitioning by gene.

One drawback of SS is that it requires a special MCMC analysis that explores a series of power posteriors. This appears to be a requirement for accurate direct estimates of marginal likelihoods. The HM estimate, on the other hand, can be obtained essentially for free because it requires only samples already needed to approximate the posterior distribution. The extra cost of SS does not appear to be prohibitive, however. If a sample from the posterior is available, it can be used to parameterize the reference distribution and provide good starting values for the power posterior MCMC. Very few additional samples are required if a very accurate reference distribution is available. Even if no previous posterior sample is available, the SS method requires less computational effort than a HM estimate would require to deliver comparable accuracy. One slight advantage of SS over HM is that the last step (β = 0.0) requires only draws from ordinary probability distributions and is thus relatively much faster due to the fact that the likelihood need not be computed for proposed values that are ultimately rejected.

In this study, the topology was fixed when estimating the marginal likelihood using the SS method; however, there is often interest in estimating marginal likelihoods that account for uncertainty in the topology. There is nothing to prevent the estimation of marginal likelihoods using the original SS method when the topology is allowed to vary during an MCMC run, but varying the topology complicates generalized SS because of the need to define a reference distribution for topologies that provides a good approximation to the posterior. This important expansion of SS to accommodate topological uncertainty is the subject of ongoing research in our group.

We have demonstrated that the SS method using a reference distribution that approximates the posterior is much more accurate and repeatable for estimating marginal likelihoods than the currently popular HM method. We have shown that using SS can result in the choice of a different (and simpler) partition model than HM method for empirical data. We therefore recommend that SS be used instead of HM when using BFs to decide which partitioning scheme is best. The SS method is implemented in the free, open source software Phycas (Lewis et al. 2008).

Supplementary Material

Supplementary figure S1 and table S1 are available at Molecular Biology and Evolution online (

Supplementary Material

Supplementary Data:


This work was stimulated by very constructive criticism we received from the reviewers of our original paper describing SS method. In particular, Nicolas Lartillot and Marc Suchard pointed out the limitations of sampling distributions near the prior. We hope that the new approach eliminates their concerns and thank them for their very helpful reviews of our previous work. We thank Nicolas Lartillot again and one anonymous reviewer for their relevant suggestions on this paper. We also thank the UConn Biotechnology Center's Bioinformatics Facility for use of their Linux cluster. This work was supported by the National Institutes of Health (GM70335 and CA74015 to M.H.C.) and the National Science Foundation (EF0331495 to P.O.L. and DMS0723557 to M.H.C.).


Case in Which the Proportion of Invariable Sites Exceeds the Proportion of Constant Sites

To demonstrate that invariable sites (I) models can misbehave when data are partitioned, we used Seq-Gen 1.3.2 (Rambaut and Grassly 1997) to simulate a partitioned data set in which 5,000 sites were assigned to subset 1 (the “large slow” subset), and 100 sites were assigned to subset 2 (the “small fast” subset). The sites in the small fast subset evolved 50 times faster than those in the large slow subset. The tree model was the maximum likelihood tree (GTR + G model) for the same 32-taxon cicada data set used as the empirical example in this paper. The generating model was Jukes–Cantor (JC) with no among-site rate heterogeneity other than the difference in rate among subsets. We analyzed this single simulated data set with Phycas using a JC + I model for both subsets, with topology fixed to the true tree and branch lengths linked across subsets but unlinking the proportion of invariable sites parameter, pinvar. Results are shown in the first row of table 2. Despite the fact that all but one of the sites in the small fast subset were variable, pinvar was estimated to be 0.96 for this subset. The model thus considers 96% of the sites in this subset to be incapable of varying when in reality 99% of them are, in fact, indisputably variable. This serves to show that partitioning can force otherwise well-behaved models to explain data in biologically unreasonable ways. In this case, the branch lengths are largely determined by the large slow subset (note the underestimated tree length), which makes it difficult for the model to explain the fast-evolving sites in the small fast subset. To explain these sites, the model finds that it can increase the effective tree length for the second subset by increasing the proportion of invariable sites parameter to absurd levels. The invariable sites model is a mixture model involving two relative rates that, by definition, have expectation 1.0:

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx26_ht.jpg
Table 2.
Tree Length, Subset Relative Rates (m1 and m2), and Proportion of Invariable Sites Parameter Values (pinvar,1 and pinvar,2) for Two Subsets (subscripts 1 and 2). In Total, 549 (11%) of the 5,000 Sites in Subset 1, and 99 (99%) of the 100 Sites in Subset ...

The last step results from the fact that r0 = 0 (i.e., invariable sites evolve, by definition, at zero rate). Bumping up pinvar to 0.96 allows the model to effectively increase each branch length by a factor of 27, which is very close to the estimated relative rate (25) for this subset in a model (JC + M) that allows subset relative rates to be free parameters (second row of table 2). Note that using a model (JC + I + M, third row of table 2) having both pinvar parameters for each subset as well as subset relative rates behaves more sensibly than JC + I. This is because the subset relative rates can account for the difference in rate among subsets, allowing pinvar to go back to measuring the proportion of invariable sites. This JC + I + M model is considered best by the HM method, yet still rather seriously overestimates pinvar for the first partition. In contrast, the generalized SS method described here shows a slight preference for the (true) JC + M model over the more complex JC + I + M model.

Appendix 2

Derivation of the BF in the Normal Example

Assume data x1,…,xn~N(0,σ2) and analyze it with two models, partitioned and unpartitioned. For the partitioned model (M0), suppose x1,,xn2N(μ1,σ2) and xn2+1,,xnN(μ2,σ2), and set priors as:

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx27_ht.jpg

The joint prior is

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx28_ht.jpg

According to the definition of marginal likelihood,

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx29_ht.jpg

where An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx30_ht.jpg.

For the unpartitioned model (M1), suppose x1,…,xn~N(μ3,σ2) and set priors as follows:

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx31_ht.jpg

The marginal likelihood is

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx32_ht.jpg

where An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx33_ht.jpg.

The BF in favor of the partitioned model is

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx34_ht.jpg

Assuming that the data are centered An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx35_ht.jpg, the BF simplifies to

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq224fx36_ht.jpg


  • Akaike H. A new look at statistical model identification. IEEE Trans Automat Contr. 1974;19:716–723.
  • Brandley M, Schmitz A, Reeder T. Partitioned Bayesian analyses, partition choice, and the phylogenetic relationships of scincid lizards. Syst Biol. 2005;54:373–390. [PubMed]
  • Brown JM, Lemmon AR. The importance of data partitioning and the utility of Bayes factors in Bayesian phylogenetics. Syst Biol. 2007;56:643–655. [PubMed]
  • Brown MW, Spiegel FW, Silberman JD. Phylogeny of the “Forgotten” Cellular Slime Mold, Fonticula alba, Reveals a Key Evolutionary Branch within Opisthokonta. Mol Biol Evol. 2009;26:2699–2709. [PubMed]
  • Clarke JA, Middleton KM. Mosaicism, modules, and the evolution of birds: results from a Bayesian approach to the study of morphological evolution using discrete character data. Syst Biol. 2008;57:185–201. [PubMed]
  • Gelman A, Meng X-L. Simulating normalizing constants: from importance sampling to bridge sampling to path sampling. Stat Sci. 1998;13:163–185.
  • Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970;57:97–109.
  • Huelsenbeck JP, Larget B, Alfaro ME. Bayesian phylogenetic model selection using reversible jump Markov chain Monte Carlo. Mol Biol Evol. 2004;21:1123–1133. [PubMed]
  • Huelsenbeck JP, Suchard MA. A nonparametric method for accommodating and testing across-site rate variation. Syst Biol. 2007;56:975–987. [PubMed]
  • Lartillot N, Philippe H. Computing Bayes factors using thermodynamic integration. Syst Biol. 2006;55:195–207. [PubMed]
  • Lefebvre G, Steele R, Vandal AC. A path sampling identity for computing the Kullback-Leibler and J-divergences. Comput Stat Data Anal. 2010;54:1719–1731.
  • Lewis PO, Holder MT, Swofford DL. Phycas: software for phylogenetic analysis. Storrs (CT): University of Connecticut; 2008. Available from
  • Liu H, Aris-Brosou S, Probert I, de Vargas C. A time line of the environmental genetics of the haptophytes. Mol Biol Evol. 2010;27:161–176. [PubMed]
  • Marshall DC. Cryptic failure of partitioned Bayesian phylogenetic analyses: lost in the land of long trees. Syst Biol. 2010;59:108–117. [PubMed]
  • Marshall DC, Simon C, Buckley TR. Accurate branch length estimation in partitioned Bayesian analyses requires accommodation of among-partition rate variation and attention to branch length priors. Syst Biol. 2006;55:993–1003. [PubMed]
  • Meng X-L, Wong WH. Simulating ratios of normalizing constants via a simple identity: a theoretical exploration. Stat Sin. 1996;6:831–860.
  • Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equation of state calculations by fast computing machines. J Chem Phys. 1953;21:1087–1092.
  • Neal RM. Contribution to the discussion of “Approximate Bayesian inference with the weighted likelihood bootstrap” by Michael A. Newton and Adrian E. Raftery. J R Stat Soc Ser A (Methodological) 1994;56:41–42.
  • Neal RM. Slice sampling. Ann Stat. 2003;31:705–741.
  • Newton MA, Raftery AE. Approximate Bayesian inference with the weighted likelihood bootstrap (with discussion) J R Stat Soc Ser B (Methodological) 1994;56:3–48.
  • Nylander J, Ronquist F, Huelsenbeck J, Nieves-Aldrey J. Bayesian phylogenetic analysis of combined data. Syst Biol. 2004;53:47–67. [PubMed]
  • Rambaut A, Grassly NC. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci. 1997;13:235–238. [PubMed]
  • Ronquist F, Huelsenbeck JP. MRBAYES 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003;19:1572–1574. [PubMed]
  • Schwarz GE. Estimating the dimension of a model. Ann Stat. 1978;6:461–464.
  • Suchard MA, Weiss RE, Sinsheimer JS. Bayesian selection of continuous-time Markov chain evolutionary models. Mol Biol Evol. 2001;18:1001–1013. [PubMed]
  • Swofford DL. PAUP*: phylogenetic analysis using parsimony (*and other methods) Sunderland (MA): Sinauer Associates; 2002.
  • Verdinelli I, Wasserman L. Computing Bayes factors using a generalization of the Savage-Dickey density ratio. J Am Stat Assoc. 1995;90:614–618.
  • Wilks SS. The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann Stat. 1938;9:60–62.
  • Xie W, Lewis PO, Fan Y, Kuo L, Chen M-H. Forthcoming. Improving Marginal likelihood estimation for Bayesian phylogenetic model selection. Syst Biol. 2010 [PMC free article] [PubMed]

Articles from Molecular Biology and Evolution are provided here courtesy of Oxford University Press