The inference of selection from molecular data, both along a sequence (Nielsen and Yang 1998
, Suzuki and Gojobori 1999
, Yang et al. 2000
) and over the evolutionary tree (Yang and Nielsen 2002
, Kosakovsky Pond and Frost 2005a), has been an area of active research and unrelenting debate (Suzuki and Nei 2004
, Wong et al. 2004, Nozawa et al. 2009). Selective pressures can vary over both sites and time, resulting in bursts of selection localized to a subset of sites and a small number of lineages, for example, Messier and Stewart (1997)
A class of methods, termed “branch-site” tests (Yang and Nielsen 2002
), was the first to offer a model-based phylogenetic hypothesis testing framework for deciding whether or not a lineage (or lineages) of interest had undergone adaptive change. Branch-site tests measure selective pressure by ω
, the ratio of nonsynonymous (β
) to synonymous (α
) substitution rates, and if a proportion of sites in the sequence provides statistically significant support for ω
> 1 along the lineages of interest, then episodic positive selection is inferred. The original formulation of the method suffered from high rates of false positives when the model assumptions were violated (Zhang 2004
) because the model could misidentify relaxed selective constraints as evidence of diversifying selection and was subsequently revised to address that shortcoming (Zhang et al. 2005). Typically, the lineages to be tested (“foreground” lineages) were specified a priori, until a recent extension outlined and benchmarked a sequential testing approach to examine whether any single lineage was under selection (Anisimova and Yang 2007
). These branch-site methods have been used extensively, with well over 1,000 citations to date, highlighting the interest of the evolutionary community in being able to identify instances of episodic selection. Alternative approaches to capturing variable selective pressures include the covarion models of Guindon et al. (2004) and a full Bayesian treatment in the framework of Rodrigue et al. (2010).
In the context of codon evolutionary models, the selective profile of site Ds in a multiple sequence alignment can be characterized by the collection of branch-specific ω values, (ω1,…,ωB), denoted Ωs, where B equals the total number of branches in the phylogeny. Existing branch-site models use three alignment-wide (i.e., shared by all sites) ratios ω− < ωN = 1 ≤ ω+ to model strong conservation, neutral evolution, and diversifying selection, respectively. Assuming these three ω ratios () with no further restrictions, each site can follow one of 3B possible selective profiles—the number of different ways to assign the B branches to the three different selection rate bins. However, it is unclear how to determine which of these selective profiles or, equivalently, assignments of branches to selection rate bins is the most appropriate at a given site.
FIG. 1. An illustration of episodic selection profiles at a single site with three possible regimes: negative, neutral (or nearly neutral), and diversifying selection along a branch. Panel (A) depicts the phylogeny used for discussion in the text and to carry (more ...)
One approach (Yang and Nielsen 2002
) is to model each site using only four predefined profiles regardless of the size of the phylogeny. More specifically, 1) every branch belongs either to the a priori known foreground class, which is allowed to experience diversifying selection, or the “background” class, which evolves under purifying selection or neutrally and 2) at a given site, there is no variation in selection strength (ω
) among background branches, with all foreground branches either sharing the selection strength of the background or being under shared diversifying selection (). Clearly, these options are not exhaustive: For example, neither variable strength of selection among background or foreground nor positive selection along background branches is allowed. We refer to this approach as the restricted branch-site (rBS) model because the number of selective profiles is limited to the four a priori defined scenarios. Given a 4-taxon tree (), and three selection parameters (as in ), there are 35
= 243 possible selection configurations, only four of which are accounted for by the branch-site model. The number of ω
configurations grows as KB
, where K
is the number of rate classes, thus making it unlikely that any four selection profiles chosen a priori are going to be sufficiently representative. Because there are no compelling biological reasons to expect that any two branches in the phylogenetic tree will have the same ω
at any given site, we do not expect these four predefined selective profiles to provide an adequate description of complex biological data. This model was likely motivated by the need to avoid overfitting in the case of small sample sizes; however, we argue that if branches with differing selective pressures are incorrectly assigned to the same class, likelihood ratio test (LRT)-based branch-site methods can be positively misleading. In this manuscript, we present one case where they falsely identify positive selection on a neutrally evolving lineage (Type I or false positive error), and another where they fail to detect positive selection on a lineage with ω
> 1 (Type II or false negative error). In addition, if several branches are claimed to be under positive selection by setting the foreground to one branch at a time, as is done by the sequential testing procedure of Anisimova and Yang (2007)
, this creates a logical inconsistency—when a branch is found to be under selection, the model under which this was established implies that no other branch could be under positive selection.
We introduce a new class of models in which substitution rates may vary from branch to branch and from site to site. We incorporate this variation via “random effects”—unobserved strengths of selection at sites and branches are incorporated using a discrete or a discretized parametric probability distribution. Parameters defining the distribution are estimated jointly from all sites using maximum likelihood. Random effects likelihood (REL) and complementary fixed effects likelihood (FEL) models are standard tools in statistical modeling. Both types of model have been used to allow sitewise rate variation in phylogenetic models—see Yang:1994kx, where rates over sites in a nucleotide alignment followed a discretized unit-mean gamma distribution (the now ubiquitous + Γ4
model). Nielsen and Yang (1998)
and Yang et al. (2000)
applied REL models to codon data in order to identify signatures of natural selection, whereas Kosakovsky Pond and Frost (2005b) and Massingham and Goldman (2005)
used FEL models for the same purpose. For all these models, likelihoods of individual sites are computed by Felsenstein's pruning algorithm (Felsenstein 1981
). However, as we show later, the direct application of the pruning algorithm is intractable for REL models with branchwise as well as sitewise rate variation. It is presumably for this reason that, to date, branch models (Yang 1998
, Kosakovsky Pond and Frost 2005a) have only been implemented in the FEL framework and branch-site models only as a four-category sitewise REL model. Our solution involves a simple extension of the pruning algorithm which makes it feasible to implement not only the model proposed here but also several other branch-site REL models.
The extended pruning algorithm computes the likelihood of each site, treating the selection site profile Ωs
as an unobserved variable, under the assumption that the probability of observing a substitution rate at a branch is independent of all other branches. Computationally, our algorithm is equivalent to replacing the standard Markov evolutionary model at a single phylogenetic branch with a mixture of three Markov models (one each for ω−
), where the mixing coefficients and ω
rates are inferred for each branch along with branch lengths, nucleotide substitution biases, and other alignment-wide parameters. Just like existing branch-site methods (Anisimova and Yang 2007
), we use sequential likelihood ratio testing to identify which branches support a model with episodic diversifying selection. Unlike existing methods, however, our approach is unrestricted and considers every possible site profile, thus avoiding some of the prominent issues posed by model misspecification and further allows ω
rates to vary independently from branch to branch and site to site.
Using an extensive collection of simulated sequences from Anisimova and Yang (2007)
, we perform a direct comparison of the unrestricted branch-site (uBS) model with the existing, restricted, approach (rBS) to evaluate Type I error and power. We also reinvestigate three empirical data sets that had been previously analyzed with the standard or sequential branch-site method and discover that many, but not all, of the original inferences are supported by our mixture model. Lastly, we report selective episodes not previously detected.