Home | About | Journals | Submit | Contact Us | Français |

**|**Bioinformatics**|**PMC2800350

Formats

Article sections

- Abstract
- 1 INTRODUCTION
- 2 DATA AND APPLICATION
- 3 METHODOLOGY
- 4 RESULTS
- 5 DISCUSSION
- Supplementary Material
- REFERENCES

Authors

Related links

Bioinformatics. 2009 October 1; 25(19): 2530–2536.

Published online 2009 August 6. doi: 10.1093/bioinformatics/btp473

PMCID: PMC2800350

Associate Editor: Martin Bishop

Received 2009 February 7; Revised 2009 July 28; Accepted 2009 July 28.

Copyright © The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

This article has been cited by other articles in PMC.

**Motivation:** Statistical analyses of phylogenetic data culminate in uncertain estimates of underlying model parameters. Lack of additional data hinders the ability to reduce this uncertainty, as the original phylogenetic dataset is often complete, containing the entire gene or genome information available for the given set of taxa. Informative priors in a Bayesian analysis can reduce posterior uncertainty; however, publicly available phylogenetic software specifies vague priors for model parameters by default. We build objective and informative priors using hierarchical random effect models that combine additional datasets whose parameters are not of direct interest but are similar to the analysis of interest.

**Results:** We propose principled statistical methods that permit more precise parameter estimates in phylogenetic analyses by creating informative priors for parameters of interest. Using additional sequence datasets from our lab or public databases, we construct a fully Bayesian semiparametric hierarchical model to combine datasets. A dynamic iteratively reweighted Markov chain Monte Carlo algorithm conveniently recycles posterior samples from the individual analyses. We demonstrate the value of our approach by examining the insertion–deletion (indel) process in the enolase gene across the Tree of Life using the phylogenetic software BALI-PHY; we incorporate prior information about indels from 82 curated alignments downloaded from the BAliBASE database.

**Contact:** ude.alcu@lgnail

**Supplementary information:** Supplementary data are available at *Bioinformatics* online.

In an experimental study, it is often possible to collect further data to more accurately estimate model parameters. In contrast, in comparative analyses of molecular sequence data, sequences are finite and typically complete, and more data cannot be collected. Additional sequences cannot be directly included in the analysis because that changes the inference target. Decreasing estimation standard errors in phylogenetic analysis requires a radically different approach with which to inject additional information into the analysis.

Bayesian methods can improve inference accuracy by incorporating proper informative prior distributions into the analysis (Carlin and Louis, 2008; Gelman *et al.*, 2003). However, concerns often exist about the appropriateness of any given piece of prior information (Box and Tiao, 1992). The ability to generate appropriate prior distributions for phylogenetic parameters would be a useful mechanism to reduce estimation error (Alfaro and Holder, 2006).

Many phylogenetic conclusions are indeed sensitive to prior information (Rannala, 2002; Zwickl and Holder, 2004). For example, Yang and Rannala, (2005) found that the posterior distribution of the phylogenetic tree is sensitive to the prior specification for internal branch lengths. When there is sensitivity, it is preferable to have priors that independent observers will recognize as encoding appropriate and relevant information. Internal branch length priors based on scientific information will minimize concern regarding this influence (Kolaczkowski and Thornton, 2007).

Subjective priors form the basis of traditional Bayesian proper prior construction. One elicits these priors from the statistician or other expert researchers (e.g. Bedrick *et al.*, 1996, 1997) or studies reported in the literature. In the case of phylogenetic analyses, prior information is often available in the form of previous analyses of similar datasets, either in the literature or from a biologist's own laboratory. For example, further sequence data from similar taxa or genes is often available.

An objective approach to proper prior construction begins by building a Bayesian hierarchical model involving the dataset(s) of interest and additional similar datasets whose analyses may not be of immediate interest. Biologists can find or construct datasets similar to their own data either from their lab or from proliferating publicly available sequence databases. Candidate databases include the benchmark alignments database (BAliBASE, Thompson *et al.*, 1999; http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE/, version 1) described in Section 2, and HIV sequence databases at the Los Alamos National Laboratory for researchers who study HIV sequences and drug resistance (http://www.hiv.lanl.gov/content/index).

For aligned or alignable sequences, biologists often use publicly available software to estimate parameters describing the evolutionary process and the unknown phylogenies that give rise to the sequences. These programs fit complex Bayesian models; running time can be substantial. Priors are relatively vague, at least as the default option, as in MrBayes (Huelsenbeck and Ronquist, 2001) and bali-phy (Suchard and Redelings, 2006). These programs are designed to fit single datasets and are not designed to fit a hierarchical model containing multiple datasets. Informative priors may or may not be easily set in the software, depending on implementation specifics.

In this article, we introduce easy-to-use methods that allow biologists to construct informative priors for particular analyses using additional information either from their own lab or public sequence databases. We assume that *K* (one or a small number) of datasets are to be analyzed using the publicly available software.

The first step involves the construction of an additional *N*-related datasets of sequences. These additional datasets are similar in the sense that the *K* + *N* datasets may be thought of as random samples from the same sample space of datasets (exchangeable in Bayesian parlance), possibly after adjusting for covariates. Second, we analyze each of the *N* + *K* datasets independently using the chosen software and obtain Markov chain Monte Carlo (MCMC) samples from the marginal posterior distributions of the phylogenetic model parameters of interest. We refer to each of these *N* + *K* analyses as an *individual analysis*. In the third step, we propose to fit a semiparametric Bayesian hierarchical random effects model that combines the *N* + *K* individual analyses.

Our hierarchical model exploits a mixture of Dirichlet processes (DPs) (Sethurman, 1994; Escobar and West, 1995) for the prior of the random effects, allowing for equality in parameters across some, but not necessarily all, analyses. In effect, the hierarchical model with mixture DP prior simultaneously enables us to borrow strength from the additional analyses to improve individual inferences and protects against pooling the analysis of interest with analyses lacking information about the parameters of interest. The hierarchical model then updates the individual analysis posterior given a single dataset to a posterior given the full set of *N* + *K* datasets. Our inference tools for this step avoid rewriting the public domain software, yet jointly fit all datasets simultaneously by reusing the available MCMC output generated from the individual analyses.

This third step can deliver a convenient proper prior that may be reused to update the existing software for future analyses. We approximate the posterior predictive distribution (PPD) of the parameters of interest; this distribution is exactly the correct prior for a new analysis. When feasible, we recommend incorporating the posterior predictive distributions as an option into the public software for the use in analyzing future data-related sets of interest. Alternatively, we show how to update a future analysis without modifying the software or rerunning our software.

Our hierarchical prior model has several distinct features; first, parameters of interest can be univariate, bivariate or higher dimensional. Second, our approach is practicable whether or not we can make software modifications. Third, additional datasets may be directly exchangeable with the datasets of interest, or may be exchangeable after adjustment for covariates. Fourth, rather than applying the nonparametric DP prior directly to the random effects, we instead apply it to the hyperparameters of the random effects distribution for additional flexibility. Fifth, we exploit a dynamic iteratively reweighting MCMC algorithm (DyIRMA) proposed by Liang and Weiss (2007) that replaces the slow individual analysis updates with efficient resampling procedures.

Our approach is in the spirit of empirical Bayes (EB) (Carlin and Louis, 2000; Efron, 1996, Morris, 1983; Robbins, 1956) in the sense that the informative (empirical) priors are derived from data rather than from a priori assumptions. A key difference is that we start with *K* datasets of interest, then gather *N* similar datasets whose output may not be of direct interest. Thus, our approach is fully Bayesian and makes use of hierarchical models, but the primary goal is to estimate *K* random effects, rather than for the more common purpose of estimating parameters of the hyperprior. It may be that given the additional *N* datasets, the investigators may switch attention to the hyperparameter estimates, but that is not our intent; we remain focused on the *K* original analyses.

We wish to examine the insertion–deletion (indel) process in the enolase gene. Our motivating dataset consists of 37 aligned enolase sequences identified in Bapteste and Philippe (2002).

Indels are not directly observed in sequences; rather, one infers their existence through sequence alignment. Appropriate inference about where indel events occur depends on correctly specifying indel process parameters. Sequence alignment in turn depends on the tree relating the sequences; yet inference about the phylogenetic tree depends on the alignment. Considerable uncertainty exists regarding the evolutionary tree-relating taxa across the Tree of Life in turn causing difficulty in alignment. To control this uncertainty and protect against bias (Lake, 1991), we use a statistical model that jointly infers alignment and phylogeny (Redelings and Suchard, 2005). This joint framework uses a hidden Markov model (HMM) to characterize the indel process, parameterized by an indel opening rate λ and expected additional indel length, *l*=ε/(1 − ε) (Redelings and Suchard, 2007), which is one less than the mean indel length.

Our specific scientific objective is to learn about the bivariate distribution of (log λ, log l) in the enolase genes of a specific set of sequences. Unfortunately, the enolase data are sparse, causing uncertain inference for the key parameters (log λ, log l). We apply our approach to increase the precision of inference.

First, we identify multiple additional datasets that when properly analyzed provide useful information about the parameters of interest. We turn to the BAliBASE database to furnish these additional datasets. BAliBASE is a well-developed, publicly available database of curated alignments for evaluation of sequence alignment programs. The BAliBASE first reference set contains 82 reliable alignments, of 3 to 5 equi-distant sequences each, where the percent identity between two sequences lies within an identified range. The diversity level can be low (<25%), medium (25–35%) and high (>35%). The alignment length can be short, medium or long.

We analyze individual datasets with the software package bali-phy. bali-phy uses an MCMC algorithm to sample from the joint posterior distribution of alignment, alignment parameters, phylogenetic tree and phylogenetic parameters given a set of sequences (Suchard and Redelings, 2006). Under the joint alignment and phylogeny model, the computational complexity of fitting a single dataset is large; simultaneously fitting multiple datasets is currently impractical. Our DyIMRA algorithm side-steps this severe limitation.

To better illustrate the value of our approach and algorithm, we partitioned the 37 enolase gene sequences into seven separate datasets containing 4–7 taxas each. These seven datasets in turn provide seven similar illustrations of our approach rather than a single example. Further, this demonstrates the particular benefit of our approach for small datasets.

For simplicity, we assume *K* datasets *Y*_{1},…, *Y*_{K} of specific interest with corresponding parameters μ_{1}, …, μ_{K} of interest. We collect an additional *N* datasets *Y*_{K+1},…, *Y*_{K+N} that we consider similar to the datasets of interest. The *N* datasets are not of interest at the current time and are collected only to increase the precision of parameters μ_{1},…, μ_{K}.

Define [*X*] and [*X*|*Z*] to be the densities of random variable *X* and the conditional density of *X* given *Z*, respectively.

Each dataset *Y*_{i}, *i* = 1,…, *K* + *N* is assumed to be drawn independently from probability density [*Y*_{i}|μ_{i}, θ_{i}, *x*_{i}], where parameter of interest μ_{i} is a *D*-dimensional vector, θ_{i} is a vector of *nuisance parameters*: additional parameters not of interest and *x*_{i} are covariates that describe known differences between datasets. Parameters of interest μ_{i} have different values across datasets. In our situation, *D* = 2 with μ_{i}=(log λ_{i}, log*l*_{i}). The nuisance parameters θ_{i} may not be comparable across datasets. For example, the alignment of a set of molecular sequences or the tree relating those sequences is meaningful for those sequences and is not relevant for other sequences.

We use the public domain Bayesian software bali-phy to separately analyze each *Y*_{i} with sampling model [*Y*_{i}|μ_{i}, θ_{i}, *x*_{i}] and prior [μ_{i}|ϕ = ϕ_{0}] [θ_{i}|μ_{i}], where ϕ is a vector of prior parameters, which, when we fix ϕ = ϕ_{0}, we recover the prior in the software. To extract maximal information, we fix the alignments, but not the indel process, in the additional *N* BAliBASE datasets to their curated estimates. For the *K* datasets of interest, we have no such prior information and estimate their alignments as a part of their analyses. When analyzed independently, the *K* + *N* datasets give rise to *K* + *N* separate posteriors [μ_{i}, θ_{i}|*Y*_{i}, *x*_{i}, ϕ = ϕ_{0}], which we call the *individual posteriors*.

We now combine information from the individual posteriors using our hierarchical model to improve inference on μ_{i}. To do this, we specify a joint posterior for the entire vector μ=(μ_{1},…, μ_{K+N})′ given the complete data *Y* = (*Y*_{1},…, *Y*_{K+N}).

To pool information across datasets, we model individual μ_{i}=(μ_{i1},…, μ_{iD})′ as coming from a *D*-variate normal (MVN_{D})

(1)

given a *Q*+1 vector of covariates *x*_{id}=(1, *x*_{id1},…, *x*_{idQ})′ including the intercept, unknown regression coefficients α_{d}=(α_{0d},…, α_{Qd})′, a *D* × *D* diagonal covariance matrix Σ_{1} = diag(σ_{1}^{2},…, σ_{D}^{2}) and unknown dataset-specific random effects β_{id} for each dimension *d* = 1,…, *D* of μ_{i}. In our example, we have *Q* = 4 covariates, two indicators each for sequence length (median and long) and identity (median and high). We use these covariates to predict both μ_{i1} and μ_{i2} for a total of *Q* + 1 = 5 regression coefficients α_{d} per parameter of interest. In general, each parameter can have different predictors, this makes for an increase in notational complexity of one additional subscript and an increase in data management issues, but it presents no theoretical difficulties. The multivariate normal assumption is assumed to hold possibly after transformation, as in our case where we take logs of both parameters before modeling.

To complete the hierarchical specification of (1), we merge all fixed-effects coefficients into a (*Q* + 1) × *D* matrix α with columns α_{1},…, α_{D}. Each row α_{(q)} of α corresponds to the same covariate *q* = 1,…, *Q* + 1 across the dimensions of μ_{i}. We also form β_{i}=(β_{i1},…, β_{iD}). To relax parametric assumptions on β_{i}, we envision an unknown, discrete probability distribution *P* as the prior for the β_{i} and model *P* with a DP, such that

(2)

where *P*_{0} is the mean of the DP generating *P* with concentration parameter *M*. We fix *M*, *a*_{0}, *R*_{0}, *c*_{d}, *s*_{d}, η=(η_{0},…, η_{Q}) and *V* = (*V*_{0},…, *V*_{Q}) as a priori known constants of the hyper-prior. We estimate the unknown hyper-parameters, ϕ=(α, β, Σ_{0}, Σ_{1}) as part of the model. A fixed value ϕ_{0} of ϕ is assumed to yield the priors used in the individual analyses, so that [μ_{i}]=[μ_{i}|ϕ = ϕ_{0}].

The public domain software is used to fit the individual models [*Y*_{i}|μ_{i}, θ_{i}, *x*_{i}] with prior [μ_{i}|ϕ=ϕ_{0}] [θ_{i}|μ_{i}]. This software in turn produces posterior samples from [μ_{i} , θ_{i}|*Y*_{i}]. This model and resulting posterior is a complex function of μ_{i} and the posterior does not have a closed form of algebraic representation. We run each individual model *i* = 1,…, *K* + *N*, producing a MCMC sample from each posterior.

We do not wish to make the huge investment that would be necessary to reprogram this model and combine it with our hierarchical models (1) and (2). Furthermore, fitting the individual prior can take hours if not days to produce a single sample; rerunning the *K* + *N* individual models jointly along with our hierarchical model could produce a problem that might well push past the limits of computational tractability. In the not uncommon situation where all *K* + *N* analyses might be of interest, for example, when all sets of sequences come from the same lab, we would not want to rerun all the individual analyses, wasting the computer time already spent on each analysis. Our approach uses a divide and conquer approach to break the problem into the *K* + *N* individual models plus one hierarchical model that uses the results from the individual models resulting in more manageable programming and computational efforts.

To combine the results of the *K* + *N* individual posteriors, a conventional Gibbs (Gelfand and Smith, 1990) or Metropolis–Hasting (Tierney, 1994) sampling algorithm is not possible. Instead, Liang and Weiss (2007) propose DyIRMA to replace the difficult sampling of μ_{i} with a weighted sampling of the MCMC realizations of μ_{i} from the individual posteriors, eliminating the need for redundant reprogramming, and the computational cost of rerunning the original models.

Let [μ_{i}|ϕ_{0}] and [θ_{i}|μ_{i}] be the prior densities of μ_{i} and θ_{i} from the individual analyses. The densities [μ_{i}, θ_{i}|*Y*_{i}, ϕ_{0}] and [μ_{i}|*Y*_{i}, ϕ_{0}] are the joint and marginal individual posterior densities for (μ_{i}, θ_{i}) and μ_{i}, respectively. Let [μ_{i}|ϕ] denote the prior density for μ_{i} from our hierarchical model; we refer to this as the *hierarchical prior* for μ_{i}. In the hierarchical model, [μ_{i}|*Y*] and [μ|*Y*] are the *hierarchical posteriors* for μ_{i} and μ; for *i* = 1,…, *K*, these are the quantities that we aim to estimate.

Integrating each joint posterior density [μ_{i}, θ_{i}|*Y*_{i}, ϕ_{0}] over its nuisance parameter vector θ_{i} yields the individual marginal posterior [μ_{i}|*Y*_{i}, ϕ_{0}]. Dividing this density by the prior distribution [μ_{i}|ϕ_{0}] used in the individual model and multiplying by the proposed hierarchical prior density [μ_{i}|ϕ] is the individual contribution from each individual problem to our posterior. We multiply these individual contributions together and then multiply them by the hierarchical prior density [ϕ]. The resulting hierarchical posterior is

(3)

The ratio

(4)

is a weight function; hence, conditional on ϕ, the posterior of μ_{i} under the hierarchical model is a reweighting by (4) of the original marginal individual posterior [μ_{i}|*Y*_{i}]. Unconditional on ϕ, this statement is still true, [μ_{i}|*Y*] is a reweighting of [μ_{i}|*Y*_{i}].

The conventional Gibbs algorithm simulates the joint posterior distribution [μ, ϕ|*Y*] by sampling μ_{1},…, μ_{K+N} and ϕ sequentially from their full conditional distributions [μ_{i}|*Y*, ϕ, μ_{(i)}]=[μ_{i}|*Y*_{i}, ϕ], where μ_{(i)} is μ without the *i*-th component, and [ϕ|*Y*, μ]. Instead, Equation (3) suggests the DyIRMA framework. We update μ_{i} by drawing a realization from the pregenerated MCMC samples of [μ_{i}|*Y*_{i}, ϕ_{0}] weighted by *w*^{*}_{ϕ0} (μ_{i}, ϕ). The reweighting occurs dynamically inside the Gibbs step and is a function of the current value of ϕ.

In detail, let μ_{i}^{(m)} be the *m*-th MCMC sampled value from [μ_{i}|*Y*_{i}, ϕ_{0}] for *m* = 1,…, *M*_{i}, the number of MCMC samples resulting from the analysis of the dataset *i*. Our algorithm replaces sampling from [μ_{i}|*Y*, ϕ] with sampling from the weighted empirical density

(5)

where δ_{μi}^{(M)}(·) denotes point mass at μ_{i}^{(m)}, and *W*_{i} = ∑_{m=1}^{Mi} *w*^{*}_{ϕ0} (μ_{i}^{(m)}, ϕ). For each DyIRMA iteration *t*, we sample μ_{i} for each *i* from (5) given ϕ=ϕ^{(t−1)}. We then generate α^{(t)}, β^{(t)}, Σ_{0}^{(t)} and σ_{d}^{(t)} for *d* = 1,2 in sequential Gibbs steps from their full conditional distributions given μ^{(t)}. The samples drawn for μ_{i}^{(t)} for *i* = 1,…, *K* are the posterior samples for our *K* analyses of scientific interest under the full hierarchical model.

We also use our method to generate an informative prior for the parameter of interest μ_{n} for a future analysis with associated covariates *x*_{n}=(*x*_{n1},…, *x*_{nD})′. The prior we want to calculate is the PPD, [μ_{n}|*Y*], for a new dataset *Y*_{n}. Since μ_{n} given both ϕ and its mean *g*(α)+β_{n} is independent of *Y*, we have

(6)

Because the PPD (6) does not have a closed form, we use a Rao–Blackwell estimate (Gelfand and Smith, 1990) to estimate (6) as

(7)

where α^{(t)} and ϕ^{(t)} are the *t*-th DyIRMA samples and *T* is the total number of samples after the burn-in period. The random effect β_{n}^{(t)} is new, we draw it from , where δ_{a} is the distribution with point mass of the single point *a* a (Blackwell and MacQueen, 1973; Bush and MacEachern, 1996). There is no additional programming effort to compute the posterior sample average (7). All we need is to add one step to the DyIRMA approach: generate β_{n}^{(t)} from [β_{n}|ϕ].

When feasible to replace the vague prior for μ_{n} used in the public phylogenetic software with the Rao–Blackwell estimate of the PPD (7), it is useful to approximate the density either parametrically or non-parametrically. The software is then updated to permit a parametric or non-parametric prior. To approximate a parametric form for the target density, we select a parametric family, such as the normal, *t* or Beta family, that is fairly close to the target density. Often visual inspection suffices to identify a satisfactory family. When the target density is clearly different from a known parametric form, we recommend approximating [μ_{n}|*Y*] nonparametrically. A kernel density estimator (KDE) [μ_{n}|*Y*]_{KDE} of [μ_{n}|*Y*] is

(8)

where *K*_{s}(·) is a density, usually continuous and symmetric and preferably continuously differentiable as well, with scale parameter *s*.

The posterior density of μ_{n} is [μ_{n}|*Y*_{n}, *Y*] [*Y*_{n}|μ_{n}][μ_{n}|*Y*]_{KDE}, with the approximated prior (8) used in the public phylogenetic software. In the situation where changing the public phylogenetic software is not feasible, we can of course still update [μ_{n}|*Y*_{n}] using our previously described procedure. If we have already run our software to produce [μ_{n}|*Y*] first, we next run the public domain software to produce MCMC samples μ_{n}^{(m)} from [μ_{n}|*Y*_{n}], *m*=1,…, *M*_{n}. Then [μ_{n}|*Y*_{n}, *Y*] [μ_{n}|*Y*_{n}][μ_{n}|*Y*]_{KDE}([μ_{n}|ϕ_{0}])^{−1}, where [μ_{n}|ϕ_{0}] is the prior used in the public domain software, we weight μ_{n}^{(m)} with weights *w*^{(m)}=[μ_{n}^{(m)}|*Y*]_{KDE}([μ_{n}^{(m)}|ϕ_{0}])^{−1} giving a weighted posterior sample from [μ_{n}|*Y*_{n}, *Y*].

We discuss model comparisons in terms of the new dataset *Y*_{n} with parameter μ_{n}. We wish to compare the posteriors [μ_{n}|*Y*_{n}] and [μ_{n}|*Y*, *Y*_{n}] resulting from model *M*_{0} with the default prior and model *M*_{h} with the hierarchical prior, respectively. We propose two methods to evaluate which prior model provides better estimation. The first employs a Bayes factor (BF). The BF in favor of *M*_{h} against *M*_{0} for the single dataset *Y*_{n} is

(9)

when using the same priors for θ_{n} in both models. Conveniently, we compute the Monte Carlo approximation of BF_{H0} using the *M* posterior samples μ_{n}^{(m)} from the analysis based on the model [μ_{n}|*Y*_{n}] with the default priors.

In addition to BFs, we propose using the posterior expected loss (PEL) to compare the two models. Consider an expanded mixture model combining both *M*_{0} and *M*_{h} where the prior is a convex mixture *v*_{0} × [μ_{n}|ϕ=ϕ_{0}]+(1 − *v*_{0}) ×[μ_{n}|*Y*], of the public domain software prior [μ_{n}|ϕ=ϕ_{0}] and the hierarchical prior [μ_{n}|*Y*] for some choice of 0≤*v*_{0}≤1. The mixture model assumes that one of the two priors is the correct prior; the parameter *v*_{0} can be interpreted as the prior probability of the public domain software prior having the correct prior. This is a family of models indexed by *v*_{0}.

The resulting posterior density is a mixture *v* ×[μ_{n} |*Y*_{n}, ϕ=ϕ_{0}]+(1 − *v*) ×[μ_{n}|*Y*_{n}, *Y*], where 0≤*v*≤1 is the posterior probability of the model implemented in the public domain software. The posterior probability *v* is a monotone non-decreasing function of *v*_{0}, and can be calculated as the part of the analysis. The posterior probability *v*=0 when *v*_{0}=0 and *v*=1 when *v*_{0}=1. The posterior mean *E*_{v}(μ_{n}|*Y*_{n})=*v* × *E*_{0}(μ_{n}|*Y*_{n}) + (1 − *v*) × *E*_{H}(μ_{n}|*Y*_{n}), and the PEL_{v} of the posterior mean under squared error loss is, as a function of *v*,

(10)

where *V*_{0}(μ_{n}|*Y*_{n}) and *V*_{h}(μ_{n}|*Y*_{n}) are posterior variances under the models *M*_{0} and *M*_{h}, respectively.

The PEL_{v} equals *V*_{h}(μ_{n}|*Y*_{n}) if *v*_{0} = 0, and *V*_{0}(μ_{n}|*Y*_{n}) if *v*_{0} = 1. For *v*_{0} in the interval (0, 1), the PEL_{v} is always greater than the smaller of *V*_{h}(μ_{n}|*Y*_{n}) and *V*_{0}(μ_{n}|*Y*_{n}). Therefore, using the PEL criterion, the best model out of the family of models indexed by *v*_{0} overall occurs either at *v*_{0} = *v* = 0 or at *v*_{0} = *v* = 1 and we should prefer the model that gives us the smallest posterior variance. In our examples, the hierarchical model usually returns a smaller posterior variance and therefore a smaller PEL. If the *N* additional datasets are selected appropriately, the posterior resulting from the hierarchical model will be preferable to the individual posteriors for our *K* datasets of interest.

We illustrate our three-step approach by first collecting prior information from the 82 BAliBASE datasets to help us to examine the indel process in play with enolase gene evolution. We use bali-phy to fit the individual analyses. The default priors on the log indel opening rate μ_{i1} and log indel extension μ_{i2} are independent. The default prior for μ_{i1} is a double exponential distribution, (2c)^{−1}exp(−|(μ_{i1} − *b*)/*c*|), *c* > 0, with mean *b* = −5 and SD . The default prior for ψ=exp(μ_{i2}) is an exponential distribution, *d*^{−1} exp(−*d*^{−1}ψ), with mean *d* = 10.First, we preprocess the 82 datasets to meet the input requirements for bali-phy and then generate the individual bivariate posteriors [μ_{i}|*Y*_{i}]. For each analysis, we ran the MCMC sampler for 50 000 iterations and discard the first 10 000 samples as burn−in. We store every 40th sample yielding a total of 1000 realizations for each individual analysis. Then we run our *K* = 7 datasets of interest with the same sampling scheme. We use the samples from the independent analyses as input to our hierarchical model. Section 4.1 discusses the results from combining just the *N* prior analyses in a joint model. Section 4.2 presents the results of implementing a module for our empirical priors into bali-phy and using this prior to inform our examination of the enolase datasets.

We fit our hierarchical model with two dataset-specific covariates, alignment length (short, medium and long) and percent identity (low, medium and high), to combine the 82 individual prior analyses. We set the hierarchical hyper-prior constants as follows. Hyper-prior mean η_{1}=(−5, 2.3) and covariance matrix *V*_{0} = diag{4, 4}, matching the prior means and variances for μ_{i1} and μ_{i2} used in bali-phy. Covariates η_{q} = (0, 0) and *V*_{q} = diag{1, 1}, *q* = 2,…, 5, are finite but relatively uninformative. The inverse covariance matrix Σ^{−1}_{1} is a diagonal matrix with elements σ^{−2}_{d}, *d* = 1,2, that are gamma-distributed with mean 5, which is approximately the range of the individual posterior samples of μ_{i}, and variance 25, so that the prior mean is equal to the prior SD. For Σ_{0}, *a*_{0} = 2 and *R*_{0} = diag{5,5} were chosen such that *E*(Σ_{0}) is ~30% larger than the variance of the individual posterior means of μ_{i}. Finally, we set the DP concentration parameter *M* = 5 using a formula suggested by Liu (1996) and generate 50 000 DyIRMA iterations with the first 5000 iterations discarded as burn-in.

Figure 1a shows a bivariate scatter plot of the 82 posterior means of μ_{i1} versus μ_{i2} from our hierarchical model. Covariate labels open circle, open triangle and cross represent short-, medium- and long-length datasets, respectively. The horizontal line identifies the log expected indel length μ_{i2} = 0. Almost all the open circles are below the zero line, whereas most of the cross marks are above the zero line. This indicates that the log expected length of the indels is increasing as the length of alignment increases. The posterior mean of the log rate of an indel opening may be decreasing with increasing alignment length.

(**a**) Scatter plot of the posterior mean of μ_{i1} (log indel opening rate) against the posterior mean of μ_{i2} (log expected indel length) from our hierarchical model. Here open circle (○), open triangle () and cross mark (×) **...**

Figure 1b and c presents boxplots of percent variance reduction from the individual to hierarchical posteriors of μ_{i1} and μ_{i2}, respectively. The variances of the hierarchical posteriors are re-assuringly smaller than those of the individual posteriors; the average percent reduction is 16% (range: 0.7–54%) and 17% (range: 0.4–70%) for μ_{i1} and μ_{i2}, respectively. We observed larger variance reduction for the short-length datasets (clear boxes) as compared with the other two length groups for both indel parameters. This demonstrates that smaller datasets successfully borrow more strength to accommodate their sparse information content. Further, Figure 1c demonstrates that for short- and medium-length datasets, higher percent identity leads to greater variance reduction in estimating μ_{i2}, also understandable because information content approaches 0 for identical sequences. The hierarchical model improves the individual inferences by borrowing strength across datasets leading to more precise inferences.

It is possible to directly implement our hierarchical priors for the indel parameters in bali-phy. Our priors constructed in this section are based on the 82 individual posteriors and the Rao–Blackwellized PPD estimate as a mixture of normal distributions. We added a module for the newly generated empirical priors to the bali-phy software, replacing the original priors with this mixture. We then use the empirical priors to separately analyze the seven enolase datasets *Y*_{k}, where *k* = 1,…, 7, and assess the relative performance of the new priors compared with the defaults.

To compare the performance, we compare the posteriors from the models *M*_{0} and *M*_{h} for each of the seven datasets. For a representative dataset, number 6, Figure 2a presents the posterior kernel density (thick dashed line) for μ_{k1} based on the default prior density used in bali-phy (dashed-dotted line), and the posterior kernel density curve (thick solid line) based on the prior density generated using our hierarchical model (dotted line). Similarly, Figure 2b presents prior and posterior densities for the indel process parameter μ_{k2}. The variances of posterior density curves from *M*_{h} shown in Figure 2a and b are appreciably smaller than those from *M*_{0}.

(**a** and **b**) Posterior kernel density curve (thick dashed line) based on the default prior (dash-dotted line) and posterior kernel density curve (thick solid line) based on the empirical prior (dotted line) for (a) μ_{k1} and (b) μ_{k2}, where **...**

Figure 2c and d presents side-by-side boxplots of the seven posterior densities of μ_{k1} and μ_{k2}, respectively. The left-hand side clear boxplot of each pair represents the posterior density from *M*_{0} and the right−hand side (solid) represents the posterior density from *M*_{h}. The length of the solid boxes are markedly smaller than those of the clear boxes, indicating that the posterior density from *M*_{h} is more precise than that from *M*_{0}. The average reduction is 60.1% (range: 37.7–86.4%) and 58.7% (range: 39.7–75.4%) for μ_{k1} and μ_{k2}, respectively. Two datasets (5 and 7) for μ_{k1} and three data sets for μ_{k2} (2, 3 and 7) have unacceptably long left tails when the default priors are used. Additional figures (alignment uncertainty plots and phylogenetic trees) are available in the Supplementary Material.

Table 1 presents log BFs *b*_{k}=logBF_{H0} in favor of the hierarchical model priors over the default priors for each enolase dataset *k*. Four out of seven datasets of interest have log BFs greater than 1, and the largest one *b*_{6} is 2.3. The sum of the log BF, ∑*b*_{k} is not an unreasonable approximation to the global BF for all seven datasets combined; ∑*b*_{k} > 3.5. Kass and Raftery (1995) suggest that log BFs between 1 and 3 are positive evidence in favor of *M*_{H}, and between 3 and 5 are strong evidence and >5 are very strong evidence. We also report posterior variances under both procedures for both parameters in Table 1. The estimates have uniformly smaller standard errors under *M*_{H} than under *M*_{0}.

This article has introduced a three-step approach using a semi-parametric random effects model and DyIRMA to efficiently combine multiple complex Bayesian phylogenetic analyses. Our approach improves the analysis of interest by incorporating additional information from publicly available sequence databases, and our approach works whether or not modifying the public phylogenetic software itself is possible. In addition, the estimated PPDs generated using our approach furnish informative priors for the parameters of interest for future analyses of datasets having similar covariates.

We propose two performance evaluation methods for these priors, both of which are easy to compute. We combined each of seven datasets of interest with extensive prior information and then performed two phylogenetic analyses, one using the default priors for the indel process parameters in bali-phy and the other using the empirical prior generated using our approach. When data are sparse, our results show that careful prior choice becomes valuable and informative priors provide substantially more efficient estimates. Assuming that (i) the new datasets that researchers have are similar to the BAliBASE datasets; (ii) researchers are interested in the indel model using bali-phy or other software; and (iii) the prior model for the indel parameters is the same, then the prior generated here can be useful for analyzing their new data.

Our findings should encourage researchers to identify and gather as many sets of comparable sequence data as possible from publicly available sequence databases or research laboratories. Our approach can be applied to generate proper prior distributions of the phylogenetic parameters of interest. The priors can later be incorporated into existing software for use of others. When prior information is available, this helps researchers to produce more accurate posterior estimates of the phylogenetic parameters of interest, as compared with using the default priors. In the case where empirical priors could not be incorporated directly into existing software, our approach can still be used to yield better posterior estimates by combining analyses of interest with additional prior analyses through dynamic reweighting.

This reweighting algorithm requires calculation of weights *w*_{ϕ0} (μ_{i}, ϕ) at each iteration for all individual posterior samples; this can become computationally expensive. Alternatively, we can use a Metropolis–Hastings step (Tierney, 1994) sampling candidate μ_{i}^{(t*)} at random from the unweighted samples. Both options perform optimally when the samples from the individual posteriors cover the entire posterior support of the μ_{i}. (Liang and Weiss, 2007) suggest another class of algorithms which approximate the individual posterior using a KDE. The KDE provides a smoothed version of the individual posterior, particularly in the tail of the density. All of these DyIRMA extensions show further promise in simulating hierarchical models over already complex phylogenetic models.

Our approach also provides suggestions to software engineers. Software will be more utile and calculations can be made easier if future software engineers allow for KDE-type priors as input into their programs and also allow for families of proper priors, including, but not limited to, normal, *t* and Beta priors as appropriate for individual parameters. Even if joint priors are restricted to products of independent priors, this will still allow for substantially improved inference.

*Funding:* National Institutes of Health (grant GM086887); Biostatistics training grant for AIDS Research (5T32AI007307 to L.-J.L.), UCLA Center for AIDS Research (to R.E.W.); National Institutes Health (grant AI28697 to R.E.W.); Alfred P. Sloan Research Fellow (M.A.S.); John Simon Guggenheim Memorial Foundation Fellow (M.A.S.).

*Conflict of Interest:* none declared.

- Alfaro M, Holder M. The posterior and the prior in Bayesian phylogenetics. Ann. Rev. Ecol. Evol. Syst. 2006;37:19–42.
- Bapteste E, Philippe H. The potential value of indels as phylogenetic markers: position of trichomonads as a case study. Mol. Biol. Evol. 2002;19:972–977. [PubMed]
- Bedrick EJ, et al. A new perspective on priors for generalized linear models. J. Am. Stat. Assoc. 1996;91:1450–1460.
- Bedrick EJ, et al. Bayesian binomial regression: predicting survival at a trauma center. Am. Stat. 1997;51:211–218.
- Blackwell D, MacQueen JB. Ferguson distributions via polya urn schemes. Ann. Stat. 1973;1:353–355.
- Box G.EP, Tiao GC. Bayesian Inference in Statistical Analysis. New York: John Wiley & Sons; 1992.
- Bush CA, MacEachern SN. A semiparametric Bayesian model for randomised block designs. Biometrika. 1996;83:275–285.
- Carlin BP, Louis TA. Bayesian Methods for Data Analysis. 3. Boca Raton, FL: Chapman & Hall/CRC; 2008.
- Carlin BP, Louis T. Empirical Bayes: past, present, and future. J. Am. Stat. Assoc. 2000;95:1286–1289.
- Efron B. Empirical Bayes methods for combining likelihoods. J. Am. Stat. Assoc. 1996;91:538–550.
- Escobar MD, West M. Bayesian density estimation and inference using mixtures. J. Am. Stat. Assoc. 1995;90:577–588.
- Gelfand AE, Smith A.FM. Sampling-based approaches to calculating marginal densities. J. Am. Stat. Assoc. 1990;85:398–409.
- Gelman A, et al. Bayesian Data Analysis. 2. Boca Raton, FL: Chapman & Hall/CRC; 2003.
- Huelsenbeck JP, Ronquist F. MrBayes: Bayesian inference of phylogenetic trees. Bioinformatics. 2001;17:754–755. [PubMed]
- Kass RE, Raftery AE. Bayes factors. J. Am. Stat. Assoc. 1995;90:773–795.
- Kolaczkowski B, Thornton J. Effect of branch length uncertainty on posterior probabilities for phylogenetic hypotheses. Mol. Biol. Evol. 2007;24:2108–2118. [PubMed]
- Lake J. The order of sequence alignment can bias the selection of tree topology. Mol. Biol. Evol. 1991;8:378–385. [PubMed]
- Liang L, Weiss R. A hierarchical semi-parametric regression model for combining HIV-1 phylogenetic analyses using iterative reweighting algorithms. Biometrics. 2007;63:733–741. [PubMed]
- Liu J. Nonparametric hierarchical bayes via sequential imputations. Ann. Stat. 1996;24:911–930.
- Morris C. Parametric empirical Bayes inference: theory and application. J. Am. Stat. Assoc. 1983;78:47–59.
- Rannala B. Identifiability of parameters in MCMC Bayesian inference of phylogeny. Syst. Biol. 2002;51:754–760. [PubMed]
- Redelings B, Suchard M. Joint Bayesian estimation of alignment and phylogeny. Syst. Biology. 2005;54:401–418. [PubMed]
- Redelings B, Suchard M. Incorporating indel information into phylogeny estimation for rapidly emerging pathogens. BMC Evol. Biol. 2007;7:40. [PMC free article] [PubMed]
- Robbins H. Proceedings of the Third Berkeley Symposium on Mathematical Statistics. Vol. 1. Berkeley: University of California Press; 1956. An empirical Bayes approach to statistics; pp. 157–163.
- Sethurman J. A constructive definition of Dirichlet priors. Stat. Sin. 1994;4:639–650.
- Suchard M, Redelings B. BALI-Phy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics. 2006;22:2047–2048. [PubMed]
- Thompson J, et al. BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics. 1999;15:87–88. [PubMed]
- Tierney L. Markov chains for exploring posterior distributions. Ann. Stat. 1994;22:1701–1762.
- Yang Z, Rannala B. Branch-length prior influences Bayesian posterior probability of phylogeny. Syst. Biol. 2005;54:455–470. [PubMed]
- Zwickl D, Holder M. Model parameterization, prior distributions and the general time-reversible model in Bayesian phylogenetics. Syst. Biol. 2004;54:961–965. [PubMed]

Articles from Bioinformatics are provided here courtesy of **Oxford University Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |