PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of bioinfoLink to Publisher's site
 
Bioinformatics. 2009 October 1; 25(19): 2530–2536.
Published online 2009 August 6. doi:  10.1093/bioinformatics/btp473
PMCID: PMC2800350

Improving phylogenetic analyses by incorporating additional information from genetic sequence databases

Abstract

Motivation: Statistical analyses of phylogenetic data culminate in uncertain estimates of underlying model parameters. Lack of additional data hinders the ability to reduce this uncertainty, as the original phylogenetic dataset is often complete, containing the entire gene or genome information available for the given set of taxa. Informative priors in a Bayesian analysis can reduce posterior uncertainty; however, publicly available phylogenetic software specifies vague priors for model parameters by default. We build objective and informative priors using hierarchical random effect models that combine additional datasets whose parameters are not of direct interest but are similar to the analysis of interest.

Results: We propose principled statistical methods that permit more precise parameter estimates in phylogenetic analyses by creating informative priors for parameters of interest. Using additional sequence datasets from our lab or public databases, we construct a fully Bayesian semiparametric hierarchical model to combine datasets. A dynamic iteratively reweighted Markov chain Monte Carlo algorithm conveniently recycles posterior samples from the individual analyses. We demonstrate the value of our approach by examining the insertion–deletion (indel) process in the enolase gene across the Tree of Life using the phylogenetic software BALI-PHY; we incorporate prior information about indels from 82 curated alignments downloaded from the BAliBASE database.

Contact: ude.alcu@lgnail

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

In an experimental study, it is often possible to collect further data to more accurately estimate model parameters. In contrast, in comparative analyses of molecular sequence data, sequences are finite and typically complete, and more data cannot be collected. Additional sequences cannot be directly included in the analysis because that changes the inference target. Decreasing estimation standard errors in phylogenetic analysis requires a radically different approach with which to inject additional information into the analysis.

Bayesian methods can improve inference accuracy by incorporating proper informative prior distributions into the analysis (Carlin and Louis, 2008; Gelman et al., 2003). However, concerns often exist about the appropriateness of any given piece of prior information (Box and Tiao, 1992). The ability to generate appropriate prior distributions for phylogenetic parameters would be a useful mechanism to reduce estimation error (Alfaro and Holder, 2006).

Many phylogenetic conclusions are indeed sensitive to prior information (Rannala, 2002; Zwickl and Holder, 2004). For example, Yang and Rannala, (2005) found that the posterior distribution of the phylogenetic tree is sensitive to the prior specification for internal branch lengths. When there is sensitivity, it is preferable to have priors that independent observers will recognize as encoding appropriate and relevant information. Internal branch length priors based on scientific information will minimize concern regarding this influence (Kolaczkowski and Thornton, 2007).

Subjective priors form the basis of traditional Bayesian proper prior construction. One elicits these priors from the statistician or other expert researchers (e.g. Bedrick et al., 1996, 1997) or studies reported in the literature. In the case of phylogenetic analyses, prior information is often available in the form of previous analyses of similar datasets, either in the literature or from a biologist's own laboratory. For example, further sequence data from similar taxa or genes is often available.

An objective approach to proper prior construction begins by building a Bayesian hierarchical model involving the dataset(s) of interest and additional similar datasets whose analyses may not be of immediate interest. Biologists can find or construct datasets similar to their own data either from their lab or from proliferating publicly available sequence databases. Candidate databases include the benchmark alignments database (BAliBASE, Thompson et al., 1999; http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE/, version 1) described in Section 2, and HIV sequence databases at the Los Alamos National Laboratory for researchers who study HIV sequences and drug resistance (http://www.hiv.lanl.gov/content/index).

For aligned or alignable sequences, biologists often use publicly available software to estimate parameters describing the evolutionary process and the unknown phylogenies that give rise to the sequences. These programs fit complex Bayesian models; running time can be substantial. Priors are relatively vague, at least as the default option, as in MrBayes (Huelsenbeck and Ronquist, 2001) and bali-phy (Suchard and Redelings, 2006). These programs are designed to fit single datasets and are not designed to fit a hierarchical model containing multiple datasets. Informative priors may or may not be easily set in the software, depending on implementation specifics.

In this article, we introduce easy-to-use methods that allow biologists to construct informative priors for particular analyses using additional information either from their own lab or public sequence databases. We assume that K (one or a small number) of datasets are to be analyzed using the publicly available software.

The first step involves the construction of an additional N-related datasets of sequences. These additional datasets are similar in the sense that the K + N datasets may be thought of as random samples from the same sample space of datasets (exchangeable in Bayesian parlance), possibly after adjusting for covariates. Second, we analyze each of the N + K datasets independently using the chosen software and obtain Markov chain Monte Carlo (MCMC) samples from the marginal posterior distributions of the phylogenetic model parameters of interest. We refer to each of these N + K analyses as an individual analysis. In the third step, we propose to fit a semiparametric Bayesian hierarchical random effects model that combines the N + K individual analyses.

Our hierarchical model exploits a mixture of Dirichlet processes (DPs) (Sethurman, 1994; Escobar and West, 1995) for the prior of the random effects, allowing for equality in parameters across some, but not necessarily all, analyses. In effect, the hierarchical model with mixture DP prior simultaneously enables us to borrow strength from the additional analyses to improve individual inferences and protects against pooling the analysis of interest with analyses lacking information about the parameters of interest. The hierarchical model then updates the individual analysis posterior given a single dataset to a posterior given the full set of N + K datasets. Our inference tools for this step avoid rewriting the public domain software, yet jointly fit all datasets simultaneously by reusing the available MCMC output generated from the individual analyses.

This third step can deliver a convenient proper prior that may be reused to update the existing software for future analyses. We approximate the posterior predictive distribution (PPD) of the parameters of interest; this distribution is exactly the correct prior for a new analysis. When feasible, we recommend incorporating the posterior predictive distributions as an option into the public software for the use in analyzing future data-related sets of interest. Alternatively, we show how to update a future analysis without modifying the software or rerunning our software.

Our hierarchical prior model has several distinct features; first, parameters of interest can be univariate, bivariate or higher dimensional. Second, our approach is practicable whether or not we can make software modifications. Third, additional datasets may be directly exchangeable with the datasets of interest, or may be exchangeable after adjustment for covariates. Fourth, rather than applying the nonparametric DP prior directly to the random effects, we instead apply it to the hyperparameters of the random effects distribution for additional flexibility. Fifth, we exploit a dynamic iteratively reweighting MCMC algorithm (DyIRMA) proposed by Liang and Weiss (2007) that replaces the slow individual analysis updates with efficient resampling procedures.

Our approach is in the spirit of empirical Bayes (EB) (Carlin and Louis, 2000; Efron, 1996, Morris, 1983; Robbins, 1956) in the sense that the informative (empirical) priors are derived from data rather than from a priori assumptions. A key difference is that we start with K datasets of interest, then gather N similar datasets whose output may not be of direct interest. Thus, our approach is fully Bayesian and makes use of hierarchical models, but the primary goal is to estimate K random effects, rather than for the more common purpose of estimating parameters of the hyperprior. It may be that given the additional N datasets, the investigators may switch attention to the hyperparameter estimates, but that is not our intent; we remain focused on the K original analyses.

2 DATA AND APPLICATION

We wish to examine the insertion–deletion (indel) process in the enolase gene. Our motivating dataset consists of 37 aligned enolase sequences identified in Bapteste and Philippe (2002).

Indels are not directly observed in sequences; rather, one infers their existence through sequence alignment. Appropriate inference about where indel events occur depends on correctly specifying indel process parameters. Sequence alignment in turn depends on the tree relating the sequences; yet inference about the phylogenetic tree depends on the alignment. Considerable uncertainty exists regarding the evolutionary tree-relating taxa across the Tree of Life in turn causing difficulty in alignment. To control this uncertainty and protect against bias (Lake, 1991), we use a statistical model that jointly infers alignment and phylogeny (Redelings and Suchard, 2005). This joint framework uses a hidden Markov model (HMM) to characterize the indel process, parameterized by an indel opening rate λ and expected additional indel length, l=ε/(1 − ε) (Redelings and Suchard, 2007), which is one less than the mean indel length.

Our specific scientific objective is to learn about the bivariate distribution of (log λ, log l) in the enolase genes of a specific set of sequences. Unfortunately, the enolase data are sparse, causing uncertain inference for the key parameters (log λ, log l). We apply our approach to increase the precision of inference.

First, we identify multiple additional datasets that when properly analyzed provide useful information about the parameters of interest. We turn to the BAliBASE database to furnish these additional datasets. BAliBASE is a well-developed, publicly available database of curated alignments for evaluation of sequence alignment programs. The BAliBASE first reference set contains 82 reliable alignments, of 3 to 5 equi-distant sequences each, where the percent identity between two sequences lies within an identified range. The diversity level can be low (<25%), medium (25–35%) and high (>35%). The alignment length can be short, medium or long.

We analyze individual datasets with the software package bali-phy. bali-phy uses an MCMC algorithm to sample from the joint posterior distribution of alignment, alignment parameters, phylogenetic tree and phylogenetic parameters given a set of sequences (Suchard and Redelings, 2006). Under the joint alignment and phylogeny model, the computational complexity of fitting a single dataset is large; simultaneously fitting multiple datasets is currently impractical. Our DyIMRA algorithm side-steps this severe limitation.

To better illustrate the value of our approach and algorithm, we partitioned the 37 enolase gene sequences into seven separate datasets containing 4–7 taxas each. These seven datasets in turn provide seven similar illustrations of our approach rather than a single example. Further, this demonstrates the particular benefit of our approach for small datasets.

3 METHODOLOGY

For simplicity, we assume K datasets Y1,…, YK of specific interest with corresponding parameters μ1, …, μK of interest. We collect an additional N datasets YK+1,…, YK+N that we consider similar to the datasets of interest. The N datasets are not of interest at the current time and are collected only to increase the precision of parameters μ1,…, μK.

3.1 Individual analyses

Define [X] and [X|Z] to be the densities of random variable X and the conditional density of X given Z, respectively.

Each dataset Yi, i = 1,…, K + N is assumed to be drawn independently from probability density [Yii, θi, xi], where parameter of interest μi is a D-dimensional vector, θi is a vector of nuisance parameters: additional parameters not of interest and xi are covariates that describe known differences between datasets. Parameters of interest μi have different values across datasets. In our situation, D = 2 with μi=(log λi, logli). The nuisance parameters θi may not be comparable across datasets. For example, the alignment of a set of molecular sequences or the tree relating those sequences is meaningful for those sequences and is not relevant for other sequences.

We use the public domain Bayesian software bali-phy to separately analyze each Yi with sampling model [Yii, θi, xi] and prior [μi|ϕ = ϕ0] [θii], where ϕ is a vector of prior parameters, which, when we fix ϕ = ϕ0, we recover the prior in the software. To extract maximal information, we fix the alignments, but not the indel process, in the additional N BAliBASE datasets to their curated estimates. For the K datasets of interest, we have no such prior information and estimate their alignments as a part of their analyses. When analyzed independently, the K + N datasets give rise to K + N separate posteriors [μi, θi|Yi, xi, ϕ = ϕ0], which we call the individual posteriors.

3.2 Semiparametric random effects model

We now combine information from the individual posteriors using our hierarchical model to improve inference on μi. To do this, we specify a joint posterior for the entire vector μ=(μ1,…, μK+N)′ given the complete data Y = (Y1,…, YK+N).

To pool information across datasets, we model individual μi=(μi1,…, μiD)′ as coming from a D-variate normal (MVND)

equation image
(1)

given a Q+1 vector of covariates xid=(1, xid1,…, xidQ)′ including the intercept, unknown regression coefficients αd=(α0d,…, αQd)′, a D × D diagonal covariance matrix Σ1 = diag(σ12,…, σD2) and unknown dataset-specific random effects βid for each dimension d = 1,…, D of μi. In our example, we have Q = 4 covariates, two indicators each for sequence length (median and long) and identity (median and high). We use these covariates to predict both μi1 and μi2 for a total of Q + 1 = 5 regression coefficients αd per parameter of interest. In general, each parameter can have different predictors, this makes for an increase in notational complexity of one additional subscript and an increase in data management issues, but it presents no theoretical difficulties. The multivariate normal assumption is assumed to hold possibly after transformation, as in our case where we take logs of both parameters before modeling.

To complete the hierarchical specification of (1), we merge all fixed-effects coefficients into a (Q + 1) × D matrix α with columns α1,…, αD. Each row α(q) of α corresponds to the same covariate q = 1,…, Q + 1 across the dimensions of μi. We also form βi=(βi1,…, βiD). To relax parametric assumptions on βi, we envision an unknown, discrete probability distribution P as the prior for the βi and model P with a DP, such that

equation image
(2)

where P0 is the mean of the DP generating P with concentration parameter M. We fix M, a0, R0, cd, sd, η=(η0,…, ηQ) and V = (V0,…, VQ) as a priori known constants of the hyper-prior. We estimate the unknown hyper-parameters, ϕ=(α, β, Σ0, Σ1) as part of the model. A fixed value ϕ0 of ϕ is assumed to yield the priors used in the individual analyses, so that [μi]=[μi|ϕ = ϕ0].

3.3 Computation

The public domain software is used to fit the individual models [Yii, θi, xi] with prior [μi|ϕ=ϕ0] [θii]. This software in turn produces posterior samples from [μi , θi|Yi]. This model and resulting posterior is a complex function of μi and the posterior does not have a closed form of algebraic representation. We run each individual model i = 1,…, K + N, producing a MCMC sample from each posterior.

We do not wish to make the huge investment that would be necessary to reprogram this model and combine it with our hierarchical models (1) and (2). Furthermore, fitting the individual prior can take hours if not days to produce a single sample; rerunning the K + N individual models jointly along with our hierarchical model could produce a problem that might well push past the limits of computational tractability. In the not uncommon situation where all K + N analyses might be of interest, for example, when all sets of sequences come from the same lab, we would not want to rerun all the individual analyses, wasting the computer time already spent on each analysis. Our approach uses a divide and conquer approach to break the problem into the K + N individual models plus one hierarchical model that uses the results from the individual models resulting in more manageable programming and computational efforts.

To combine the results of the K + N individual posteriors, a conventional Gibbs (Gelfand and Smith, 1990) or Metropolis–Hasting (Tierney, 1994) sampling algorithm is not possible. Instead, Liang and Weiss (2007) propose DyIRMA to replace the difficult sampling of μi with a weighted sampling of the MCMC realizations of μi from the individual posteriors, eliminating the need for redundant reprogramming, and the computational cost of rerunning the original models.

Let [μi0] and [θii] be the prior densities of μi and θi from the individual analyses. The densities [μi, θi|Yi, ϕ0] and [μi|Yi, ϕ0] are the joint and marginal individual posterior densities for (μi, θi) and μi, respectively. Let [μi|ϕ] denote the prior density for μi from our hierarchical model; we refer to this as the hierarchical prior for μi. In the hierarchical model, [μi|Y] and [μ|Y] are the hierarchical posteriors for μi and μ; for i = 1,…, K, these are the quantities that we aim to estimate.

Integrating each joint posterior density [μi, θi|Yi, ϕ0] over its nuisance parameter vector θi yields the individual marginal posterior [μi|Yi, ϕ0]. Dividing this density by the prior distribution [μi0] used in the individual model and multiplying by the proposed hierarchical prior density [μi|ϕ] is the individual contribution from each individual problem to our posterior. We multiply these individual contributions together and then multiply them by the hierarchical prior density [ϕ]. The resulting hierarchical posterior is

equation image
(3)

The ratio

equation image
(4)

is a weight function; hence, conditional on ϕ, the posterior of μi under the hierarchical model is a reweighting by (4) of the original marginal individual posterior [μi|Yi]. Unconditional on ϕ, this statement is still true, [μi|Y] is a reweighting of [μi|Yi].

The conventional Gibbs algorithm simulates the joint posterior distribution [μ, ϕ|Y] by sampling μ1,…, μK+N and ϕ sequentially from their full conditional distributions [μi|Y, ϕ, μ(i)]=[μi|Yi, ϕ], where μ(i) is μ without the i-th component, and [ϕ|Y, μ]. Instead, Equation (3) suggests the DyIRMA framework. We update μi by drawing a realization from the pregenerated MCMC samples of [μi|Yi, ϕ0] weighted by w*ϕ0i, ϕ). The reweighting occurs dynamically inside the Gibbs step and is a function of the current value of ϕ.

In detail, let μi(m) be the m-th MCMC sampled value from [μi|Yi, ϕ0] for m = 1,…, Mi, the number of MCMC samples resulting from the analysis of the dataset i. Our algorithm replaces sampling from [μi|Y, ϕ] with sampling from the weighted empirical density

equation image
(5)

where δμi(M)(·) denotes point mass at μi(m), and Wi = ∑m=1Mi w*ϕ0i(m), ϕ). For each DyIRMA iteration t, we sample μi for each i from (5) given ϕ=ϕ(t−1). We then generate α(t), β(t), Σ0(t) and σd(t) for d = 1,2 in sequential Gibbs steps from their full conditional distributions given μ(t). The samples drawn for μi(t) for i = 1,…, K are the posterior samples for our K analyses of scientific interest under the full hierarchical model.

3.4 Posterior predictive density

We also use our method to generate an informative prior for the parameter of interest μn for a future analysis with associated covariates xn=(xn1,…, xnD)′. The prior we want to calculate is the PPD, [μn|Y], for a new dataset Yn. Since μn given both ϕ and its mean g(α)+βn is independent of Y, we have

equation image
(6)

Because the PPD (6) does not have a closed form, we use a Rao–Blackwell estimate (Gelfand and Smith, 1990) to estimate (6) as

equation image
(7)

where α(t) and ϕ(t) are the t-th DyIRMA samples and T is the total number of samples after the burn-in period. The random effect βn(t) is new, we draw it from An external file that holds a picture, illustration, etc.
Object name is btp473i1.jpg, where δa is the distribution with point mass of the single point a a (Blackwell and MacQueen, 1973; Bush and MacEachern, 1996). There is no additional programming effort to compute the posterior sample average (7). All we need is to add one step to the DyIRMA approach: generate βn(t) from [βn|ϕ].

3.5 Incorporating empirical priors

When feasible to replace the vague prior for μn used in the public phylogenetic software with the Rao–Blackwell estimate of the PPD (7), it is useful to approximate the density either parametrically or non-parametrically. The software is then updated to permit a parametric or non-parametric prior. To approximate a parametric form for the target density, we select a parametric family, such as the normal, t or Beta family, that is fairly close to the target density. Often visual inspection suffices to identify a satisfactory family. When the target density is clearly different from a known parametric form, we recommend approximating [μn|Y] nonparametrically. A kernel density estimator (KDE) [μn|Y]KDE of [μn|Y] is

equation image
(8)

where Ks(·) is a density, usually continuous and symmetric and preferably continuously differentiable as well, with scale parameter s.

The posterior density of μn is [μn|Yn, Y] [proportional, variant] [Ynn][μn|Y]KDE, with the approximated prior (8) used in the public phylogenetic software. In the situation where changing the public phylogenetic software is not feasible, we can of course still update [μn|Yn] using our previously described procedure. If we have already run our software to produce [μn|Y] first, we next run the public domain software to produce MCMC samples μn(m) from [μn|Yn], m=1,…, Mn. Then [μn|Yn, Y] [proportional, variant]n|Yn][μn|Y]KDE([μn0])−1, where [μn0] is the prior used in the public domain software, we weight μn(m) with weights w(m)=[μn(m)|Y]KDE([μn(m)0])−1 giving a weighted posterior sample from [μn|Yn, Y].

3.6 Model comparison

We discuss model comparisons in terms of the new dataset Yn with parameter μn. We wish to compare the posteriors [μn|Yn] and [μn|Y, Yn] resulting from model M0 with the default prior and model Mh with the hierarchical prior, respectively. We propose two methods to evaluate which prior model provides better estimation. The first employs a Bayes factor (BF). The BF in favor of Mh against M0 for the single dataset Yn is

equation image
(9)

when using the same priors for θn in both models. Conveniently, we compute the Monte Carlo approximation of BFH0 using the M posterior samples μn(m) from the analysis based on the model [μn|Yn] with the default priors.

In addition to BFs, we propose using the posterior expected loss (PEL) to compare the two models. Consider an expanded mixture model combining both M0 and Mh where the prior is a convex mixture v0 × [μn|ϕ=ϕ0]+(1 − v0) ×[μn|Y], of the public domain software prior [μn|ϕ=ϕ0] and the hierarchical prior [μn|Y] for some choice of 0≤v0≤1. The mixture model assumes that one of the two priors is the correct prior; the parameter v0 can be interpreted as the prior probability of the public domain software prior having the correct prior. This is a family of models indexed by v0.

The resulting posterior density is a mixture v ×[μn |Yn, ϕ=ϕ0]+(1 − v) ×[μn|Yn, Y], where 0≤v≤1 is the posterior probability of the model implemented in the public domain software. The posterior probability v is a monotone non-decreasing function of v0, and can be calculated as the part of the analysis. The posterior probability v=0 when v0=0 and v=1 when v0=1. The posterior mean Evn|Yn)=v × E0n|Yn) + (1 − v) × EHn|Yn), and the PELv of the posterior mean under squared error loss is, as a function of v,

equation image
(10)

where V0n|Yn) and Vhn|Yn) are posterior variances under the models M0 and Mh, respectively.

The PELv equals Vhn|Yn) if v0 = 0, and V0n|Yn) if v0 = 1. For v0 in the interval (0, 1), the PELv is always greater than the smaller of Vhn|Yn) and V0n|Yn). Therefore, using the PEL criterion, the best model out of the family of models indexed by v0 overall occurs either at v0 = v = 0 or at v0 = v = 1 and we should prefer the model that gives us the smallest posterior variance. In our examples, the hierarchical model usually returns a smaller posterior variance and therefore a smaller PEL. If the N additional datasets are selected appropriately, the posterior resulting from the hierarchical model will be preferable to the individual posteriors for our K datasets of interest.

4 RESULTS

We illustrate our three-step approach by first collecting prior information from the 82 BAliBASE datasets to help us to examine the indel process in play with enolase gene evolution. We use bali-phy to fit the individual analyses. The default priors on the log indel opening rate μi1 and log indel extension μi2 are independent. The default prior for μi1 is a double exponential distribution, (2c)−1exp(−|(μi1b)/c|), c > 0, with mean b = −5 and SD An external file that holds a picture, illustration, etc.
Object name is btp473i2.jpg. The default prior for ψ=exp(μi2) is an exponential distribution, d−1 exp(−d−1ψ), with mean d = 10.First, we preprocess the 82 datasets to meet the input requirements for bali-phy and then generate the individual bivariate posteriors [μi|Yi]. For each analysis, we ran the MCMC sampler for 50 000 iterations and discard the first 10 000 samples as burn−in. We store every 40th sample yielding a total of 1000 realizations for each individual analysis. Then we run our K = 7 datasets of interest with the same sampling scheme. We use the samples from the independent analyses as input to our hierarchical model. Section 4.1 discusses the results from combining just the N prior analyses in a joint model. Section 4.2 presents the results of implementing a module for our empirical priors into bali-phy and using this prior to inform our examination of the enolase datasets.

4.1 Combining analyses

We fit our hierarchical model with two dataset-specific covariates, alignment length (short, medium and long) and percent identity (low, medium and high), to combine the 82 individual prior analyses. We set the hierarchical hyper-prior constants as follows. Hyper-prior mean η1=(−5, 2.3) and covariance matrix V0 = diag{4, 4}, matching the prior means and variances for μi1 and μi2 used in bali-phy. Covariates ηq = (0, 0) and Vq = diag{1, 1}, q = 2,…, 5, are finite but relatively uninformative. The inverse covariance matrix Σ−11 is a diagonal matrix with elements σ−2d, d = 1,2, that are gamma-distributed with mean 5, which is approximately the range of the individual posterior samples of μi, and variance 25, so that the prior mean is equal to the prior SD. For Σ0, a0 = 2 and R0 = diag{5,5} were chosen such that E0) is ~30% larger than the variance of the individual posterior means of μi. Finally, we set the DP concentration parameter M = 5 using a formula suggested by Liu (1996) and generate 50 000 DyIRMA iterations with the first 5000 iterations discarded as burn-in.

Figure 1a shows a bivariate scatter plot of the 82 posterior means of μi1 versus μi2 from our hierarchical model. Covariate labels open circle, open triangle and cross represent short-, medium- and long-length datasets, respectively. The horizontal line identifies the log expected indel length μi2 = 0. Almost all the open circles are below the zero line, whereas most of the cross marks are above the zero line. This indicates that the log expected length of the indels is increasing as the length of alignment increases. The posterior mean of the log rate of an indel opening may be decreasing with increasing alignment length.

Fig. 1.
(a) Scatter plot of the posterior mean of μi1 (log indel opening rate) against the posterior mean of μi2 (log expected indel length) from our hierarchical model. Here open circle (○), open triangle ([triangle, equals]) and cross mark (×) ...

Figure 1b and c presents boxplots of percent variance reduction from the individual to hierarchical posteriors of μi1 and μi2, respectively. The variances of the hierarchical posteriors are re-assuringly smaller than those of the individual posteriors; the average percent reduction is 16% (range: 0.7–54%) and 17% (range: 0.4–70%) for μi1 and μi2, respectively. We observed larger variance reduction for the short-length datasets (clear boxes) as compared with the other two length groups for both indel parameters. This demonstrates that smaller datasets successfully borrow more strength to accommodate their sparse information content. Further, Figure 1c demonstrates that for short- and medium-length datasets, higher percent identity leads to greater variance reduction in estimating μi2, also understandable because information content approaches 0 for identical sequences. The hierarchical model improves the individual inferences by borrowing strength across datasets leading to more precise inferences.

4.2 Implementing and evaluating empirical priors

It is possible to directly implement our hierarchical priors for the indel parameters in bali-phy. Our priors constructed in this section are based on the 82 individual posteriors and the Rao–Blackwellized PPD estimate as a mixture of normal distributions. We added a module for the newly generated empirical priors to the bali-phy software, replacing the original priors with this mixture. We then use the empirical priors to separately analyze the seven enolase datasets Yk, where k = 1,…, 7, and assess the relative performance of the new priors compared with the defaults.

To compare the performance, we compare the posteriors from the models M0 and Mh for each of the seven datasets. For a representative dataset, number 6, Figure 2a presents the posterior kernel density (thick dashed line) for μk1 based on the default prior density used in bali-phy (dashed-dotted line), and the posterior kernel density curve (thick solid line) based on the prior density generated using our hierarchical model (dotted line). Similarly, Figure 2b presents prior and posterior densities for the indel process parameter μk2. The variances of posterior density curves from Mh shown in Figure 2a and b are appreciably smaller than those from M0.

Fig. 2.
(a and b) Posterior kernel density curve (thick dashed line) based on the default prior (dash-dotted line) and posterior kernel density curve (thick solid line) based on the empirical prior (dotted line) for (a) μk1 and (b) μk2, where ...

Figure 2c and d presents side-by-side boxplots of the seven posterior densities of μk1 and μk2, respectively. The left-hand side clear boxplot of each pair represents the posterior density from M0 and the right−hand side (solid) represents the posterior density from Mh. The length of the solid boxes are markedly smaller than those of the clear boxes, indicating that the posterior density from Mh is more precise than that from M0. The average reduction is 60.1% (range: 37.7–86.4%) and 58.7% (range: 39.7–75.4%) for μk1 and μk2, respectively. Two datasets (5 and 7) for μk1 and three data sets for μk2 (2, 3 and 7) have unacceptably long left tails when the default priors are used. Additional figures (alignment uncertainty plots and phylogenetic trees) are available in the Supplementary Material.

Table 1 presents log BFs bk=logBFH0 in favor of the hierarchical model priors over the default priors for each enolase dataset k. Four out of seven datasets of interest have log BFs greater than 1, and the largest one b6 is 2.3. The sum of the log BF, ∑bk is not an unreasonable approximation to the global BF for all seven datasets combined; ∑bk > 3.5. Kass and Raftery (1995) suggest that log BFs between 1 and 3 are positive evidence in favor of MH, and between 3 and 5 are strong evidence and >5 are very strong evidence. We also report posterior variances under both procedures for both parameters in Table 1. The estimates have uniformly smaller standard errors under MH than under M0.

Table 1.
Evaluation of the hierarchical model and default priors in estimating indel process parameters μk1 (log indel opening rate) and μk2 (log expected indel length)

5 DISCUSSION

This article has introduced a three-step approach using a semi-parametric random effects model and DyIRMA to efficiently combine multiple complex Bayesian phylogenetic analyses. Our approach improves the analysis of interest by incorporating additional information from publicly available sequence databases, and our approach works whether or not modifying the public phylogenetic software itself is possible. In addition, the estimated PPDs generated using our approach furnish informative priors for the parameters of interest for future analyses of datasets having similar covariates.

We propose two performance evaluation methods for these priors, both of which are easy to compute. We combined each of seven datasets of interest with extensive prior information and then performed two phylogenetic analyses, one using the default priors for the indel process parameters in bali-phy and the other using the empirical prior generated using our approach. When data are sparse, our results show that careful prior choice becomes valuable and informative priors provide substantially more efficient estimates. Assuming that (i) the new datasets that researchers have are similar to the BAliBASE datasets; (ii) researchers are interested in the indel model using bali-phy or other software; and (iii) the prior model for the indel parameters is the same, then the prior generated here can be useful for analyzing their new data.

Our findings should encourage researchers to identify and gather as many sets of comparable sequence data as possible from publicly available sequence databases or research laboratories. Our approach can be applied to generate proper prior distributions of the phylogenetic parameters of interest. The priors can later be incorporated into existing software for use of others. When prior information is available, this helps researchers to produce more accurate posterior estimates of the phylogenetic parameters of interest, as compared with using the default priors. In the case where empirical priors could not be incorporated directly into existing software, our approach can still be used to yield better posterior estimates by combining analyses of interest with additional prior analyses through dynamic reweighting.

This reweighting algorithm requires calculation of weights wϕ0i, ϕ) at each iteration for all individual posterior samples; this can become computationally expensive. Alternatively, we can use a Metropolis–Hastings step (Tierney, 1994) sampling candidate μi(t*) at random from the unweighted samples. Both options perform optimally when the samples from the individual posteriors cover the entire posterior support of the μi. (Liang and Weiss, 2007) suggest another class of algorithms which approximate the individual posterior using a KDE. The KDE provides a smoothed version of the individual posterior, particularly in the tail of the density. All of these DyIRMA extensions show further promise in simulating hierarchical models over already complex phylogenetic models.

Our approach also provides suggestions to software engineers. Software will be more utile and calculations can be made easier if future software engineers allow for KDE-type priors as input into their programs and also allow for families of proper priors, including, but not limited to, normal, t and Beta priors as appropriate for individual parameters. Even if joint priors are restricted to products of independent priors, this will still allow for substantially improved inference.

Funding: National Institutes of Health (grant GM086887); Biostatistics training grant for AIDS Research (5T32AI007307 to L.-J.L.), UCLA Center for AIDS Research (to R.E.W.); National Institutes Health (grant AI28697 to R.E.W.); Alfred P. Sloan Research Fellow (M.A.S.); John Simon Guggenheim Memorial Foundation Fellow (M.A.S.).

Conflict of Interest: none declared.

Supplementary Material

[Supplementary Data]

REFERENCES

  • Alfaro M, Holder M. The posterior and the prior in Bayesian phylogenetics. Ann. Rev. Ecol. Evol. Syst. 2006;37:19–42.
  • Bapteste E, Philippe H. The potential value of indels as phylogenetic markers: position of trichomonads as a case study. Mol. Biol. Evol. 2002;19:972–977. [PubMed]
  • Bedrick EJ, et al. A new perspective on priors for generalized linear models. J. Am. Stat. Assoc. 1996;91:1450–1460.
  • Bedrick EJ, et al. Bayesian binomial regression: predicting survival at a trauma center. Am. Stat. 1997;51:211–218.
  • Blackwell D, MacQueen JB. Ferguson distributions via polya urn schemes. Ann. Stat. 1973;1:353–355.
  • Box G.EP, Tiao GC. Bayesian Inference in Statistical Analysis. New York: John Wiley & Sons; 1992.
  • Bush CA, MacEachern SN. A semiparametric Bayesian model for randomised block designs. Biometrika. 1996;83:275–285.
  • Carlin BP, Louis TA. Bayesian Methods for Data Analysis. 3. Boca Raton, FL: Chapman & Hall/CRC; 2008.
  • Carlin BP, Louis T. Empirical Bayes: past, present, and future. J. Am. Stat. Assoc. 2000;95:1286–1289.
  • Efron B. Empirical Bayes methods for combining likelihoods. J. Am. Stat. Assoc. 1996;91:538–550.
  • Escobar MD, West M. Bayesian density estimation and inference using mixtures. J. Am. Stat. Assoc. 1995;90:577–588.
  • Gelfand AE, Smith A.FM. Sampling-based approaches to calculating marginal densities. J. Am. Stat. Assoc. 1990;85:398–409.
  • Gelman A, et al. Bayesian Data Analysis. 2. Boca Raton, FL: Chapman & Hall/CRC; 2003.
  • Huelsenbeck JP, Ronquist F. MrBayes: Bayesian inference of phylogenetic trees. Bioinformatics. 2001;17:754–755. [PubMed]
  • Kass RE, Raftery AE. Bayes factors. J. Am. Stat. Assoc. 1995;90:773–795.
  • Kolaczkowski B, Thornton J. Effect of branch length uncertainty on posterior probabilities for phylogenetic hypotheses. Mol. Biol. Evol. 2007;24:2108–2118. [PubMed]
  • Lake J. The order of sequence alignment can bias the selection of tree topology. Mol. Biol. Evol. 1991;8:378–385. [PubMed]
  • Liang L, Weiss R. A hierarchical semi-parametric regression model for combining HIV-1 phylogenetic analyses using iterative reweighting algorithms. Biometrics. 2007;63:733–741. [PubMed]
  • Liu J. Nonparametric hierarchical bayes via sequential imputations. Ann. Stat. 1996;24:911–930.
  • Morris C. Parametric empirical Bayes inference: theory and application. J. Am. Stat. Assoc. 1983;78:47–59.
  • Rannala B. Identifiability of parameters in MCMC Bayesian inference of phylogeny. Syst. Biol. 2002;51:754–760. [PubMed]
  • Redelings B, Suchard M. Joint Bayesian estimation of alignment and phylogeny. Syst. Biology. 2005;54:401–418. [PubMed]
  • Redelings B, Suchard M. Incorporating indel information into phylogeny estimation for rapidly emerging pathogens. BMC Evol. Biol. 2007;7:40. [PMC free article] [PubMed]
  • Robbins H. Proceedings of the Third Berkeley Symposium on Mathematical Statistics. Vol. 1. Berkeley: University of California Press; 1956. An empirical Bayes approach to statistics; pp. 157–163.
  • Sethurman J. A constructive definition of Dirichlet priors. Stat. Sin. 1994;4:639–650.
  • Suchard M, Redelings B. BALI-Phy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics. 2006;22:2047–2048. [PubMed]
  • Thompson J, et al. BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics. 1999;15:87–88. [PubMed]
  • Tierney L. Markov chains for exploring posterior distributions. Ann. Stat. 1994;22:1701–1762.
  • Yang Z, Rannala B. Branch-length prior influences Bayesian posterior probability of phylogeny. Syst. Biol. 2005;54:455–470. [PubMed]
  • Zwickl D, Holder M. Model parameterization, prior distributions and the general time-reversible model in Bayesian phylogenetics. Syst. Biol. 2004;54:961–965. [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press