PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of ploscompComputational BiologyView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
 
PLoS Comput Biol. 2010 August; 6(8): e1000885.
Published online 2010 August 19. doi:  10.1371/journal.pcbi.1000885
PMCID: PMC2924240

CodonTest: Modeling Amino Acid Substitution Preferences in Coding Sequences

Wen-Hsiung Li, Editor

Abstract

Codon models of evolution have facilitated the interpretation of selective forces operating on genomes. These models, however, assume a single rate of non-synonymous substitution irrespective of the nature of amino acids being exchanged. Recent developments have shown that models which allow for amino acid pairs to have independent rates of substitution offer improved fit over single rate models. However, these approaches have been limited by the necessity for large alignments in their estimation. An alternative approach is to assume that substitution rates between amino acid pairs can be subdivided into An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e001.jpg rate classes, dependent on the information content of the alignment. However, given the combinatorially large number of such models, an efficient model search strategy is needed. Here we develop a Genetic Algorithm (GA) method for the estimation of such models. A GA is used to assign amino acid substitution pairs to a series of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e002.jpg rate classes, where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e003.jpg is estimated from the alignment. Other parameters of the phylogenetic Markov model, including substitution rates, character frequencies and branch lengths are estimated using standard maximum likelihood optimization procedures. We apply the GA to empirical alignments and show improved model fit over existing models of codon evolution. Our results suggest that current models are poor approximations of protein evolution and thus gene and organism specific multi-rate models that incorporate amino acid substitution biases are preferred. We further anticipate that the clustering of amino acid substitution rates into classes will be biologically informative, such that genes with similar functions exhibit similar clustering, and hence this clustering will be useful for the evolutionary fingerprinting of genes.

Author Summary

Evolution in protein-coding DNA sequences can be modeled at three levels: nucleotides, amino acids or codons that encode the amino acids. Codon models incorporate nucleotide and amino acid information, and allow the estimation of the rate at which amino acids are replaced (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e004.jpg) versus the rate at which they are preserved (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e005.jpg). The An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e006.jpg ratio has been used in thousands of studies to detect molecular footprints of natural selection. A serious limitation of most codon models is the unrealistic assumption that all non-synonymous substitutions occur at the same rate. Indeed, amino acid models have consistently demonstrated that different residues are exchanged more or less frequently, depending on incompletely understood factors. We derive and validate a computational approach for inferring codon models which combine the power to investigate natural selection with data-driven amino acid substitution biases from alignments. The addition of amino acid properties can lead to more powerful and accurate methods for studying natural selection and the evolutionary history of protein-coding sequences. The pattern of amino acid substitutions specific to a given alignment can be used to compare and contrast the evolutionary properties of different genes, providing an evolutionary analog to protein family comparisons.

Introduction

Modern molecular evolution has benefited greatly from the development of a sound probabilistic framework for modeling the evolution of homologous gene sequences [1]. In particular, codon substitution models [2], [3] have facilitated the estimation of the ratio of non-synonymous to synonymous substitution rates (referred to as An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e007.jpg), which can be interpreted as an indicator of the strength and type of natural selection (see [4] or [5] for recent reviews). Codon models are fundamentally mechanistic because they use the structure of the genetic code to partition codon substitutions into classes. Initially, and in most subsequent applications of codon models, all one-nucleotide substitutions were stratified into synonymous (rate An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e008.jpg, using the notation of [2]) and non-synonymous (rate An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e009.jpg) classes. Despite several early attempts, e.g. [3], none of the widely-adopted codon models incorporated physicochemical properties of the two residues being exchanged. In contrast, most protein substitution models are derived by estimating the relative rates of amino-acid substitutions in large protein databases [6][8], and consistently report dramatic differences in the relative replacement rates of different residues.

The persisting dissonance between how codon and protein models approach amino acid substitution rates has fostered multiple recent efforts to develop what we will call multi-rate codon models (or more accurately, multi- nonsynonymous rate models), in contrast to the existing single-rate model. These models divide amino acid pairs (or codon pairs) into multiple rate categories, such that every category has its own rate which governs substitutions between the pairs in that category. In the most extreme case, every amino acid or codon pair belongs to a different category and thus has its own rate – potentially leading to a very large number of parameters that need to be estimated. Several strategies have been proposed for limiting the number of parameters in multi-rate models.

Doron-Faigenboim et al. [9] proposed to overlay existing empirically derived amino acid substitution matrices (e.g. [7] or [8]) onto single-rate codon models by weighted partitioning of the empirical rate of substitution between two protein residues. Kosiol, Holmes & Goldman [10] directly estimated all An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e010.jpg codon-to-codon substitution rates in an empirical codon model – a codon equivalent of the nucleotide GTR model [11], assuming the universal genetic code. However, this effort required a truly massive training dataset encompassing alignments from An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e011.jpg protein families of the Pandit database [12]. The resulting empirical codon model (ECM) encodes evolution patterns averaged over many proteins. However, no single empirically-derived substitution rate matrix appears to be generalizable across multiple genes and taxonomic groups, as evidenced by a plethora of specialized substitution models, e.g. for mammalian mitochondrial genomes [13], plant chloroplast genes [14], viral reverse transcriptases [15] or HIV-1 genes [16].

More mechanistic parameters can be introduced to improve biological realism of codon-models. The linear combination of amino acid properties (LCAP) model [17] expresses exchangeability of a pair of codons as an (exponentiated) linear combination of differences in five independently validated amino acid physicochemical properties. This parameterization incorporates weighting (or importance) coefficients inferred from the data to allow for differences in protein evolution between genes, shown to be significant and biologically meaningful in yeast proteins [18], and once again underscoring the utility of gene-specific evolutionary models.

All multi-rate codon models published to date have shown clear improvements in model fit over the single-rate model. However, multi-rate models in which substitutions were randomly assigned to classes easily outperform the single-rate model [19] and thus it is a poor performance benchmark. At the other extreme of model space is the full time-reversible codon model, with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e012.jpg parameters (or An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e013.jpg, if only single nucleotide substitutions are modeled), which will certainly suffer from massive over-fitting on single gene alignments. Over-parameterization can be reduced by “smoothing”, i.e. by grouping the rates into exchangeability classes based on the physicochemical properties of amino acids [20]. However, without a rigorous model selection framework, it is difficult to ascertain how well any particular smoothing approach fits the data. To appreciate how large the space of potential models is, consider that there are are approximately An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e014.jpg possible multi-rate codon models with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e015.jpg nonsynonymous rate classes, and approximately An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e016.jpg possible models for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e017.jpg. Given such a large search space it is impossible to evaluate even a small fraction of possible models exhaustively, and one cannot presume that any given model or a small set of models are sufficiently representative without exploring the alternatives.

Huelsenbeck et al. [21] examined a Bayesian approach to estimate empirical amino acid substitution models in which amino acid exchangeability classes are assigned using a Dirichlet process. However, a prior distribution needs to be specified for the number of classes (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e018.jpg = 2, 5, or 10), and mechanistic features of codon evolution are excluded. Models which combine empirical codon models and mechanistic parameters, such as An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e019.jpg and transition-transversion bias [10], have been shown to outperform the models which include only a single effect. This evidence highlights the necessity to model both mutational effects, which result in substitution preferences for particular amino acids, and selective effects, the result of fitness differences of alternate phenotypes. In this manuscript, we present an information-theoretic model selection procedure that extends the concept of ModelTest [22], formulated for nucleotide model selection, to codon models. Unlike ModelTest, which examines An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e020.jpg a priori defined models, we use a Genetic Algorithm (GA) to search the combinatorially large set of codon models (i.e. select the number of rate classes), to assign amino acid substitution rates to these classes, infer rate parameters and, finally, report a set of credible models given the data. Our group has successfully applied GAs to a variety of problems in evolutionary biology, including inference of lineage-specific selective regimes [23], detecting recombination in homologous sequence alignments [24], and model selection for paired RNA sequences [25], where the GA was able to recover biologically relevant properties and outperformed all known mechanistic models.

Using simulated data, we demonstrate that GA model selection (under a sufficiently stringent model selection criterion) is not susceptible to over-fitting, and that codon alignments of typical size contains sufficient signal to reliably allocate non-synonymous substitutions into a small number of rate classes, typically An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e021.jpg. On empirical data sets, GA-selected codon substitution models consistently outperformed published empirical and mechanistic models. In addition to selecting a single best fitting model, the GA also estimates a set of credible models for an alignment. A weighted combination of models in the credible set enable model averaged phylogenetic [26] and substitution rate matrix [25] inference and further reduces the risk of over-fitting. We anticipate that improvements in model realism will translate into improved sequence alignment, phylogeny estimation, and selection detection. Moreover, we hypothesize that the clustering of non-synonymous substitution rates into groups with the same rate parameter is shared by genes with similar biological and structural properties, and hence this clustering is informative for improving evolutionary fingerprinting of genes [27].

Methods

Model definition

Models considered in this paper assume that codon substitutions along a branch in a phylogenetic tree can be described by an appropriately parameterized continuous-time homogeneous and stationary Markov process; an assumption ubiquitous in codon-evolution literature. The substitution process is uniquely defined by the rate matrix, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e022.jpg, whose elements An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e023.jpg denote the instantaneous substitution rate from codon An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e024.jpg to codon An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e025.jpg. Using An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e026.jpg to label the amino-acid encoded by codon An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e027.jpg, and assuming a universal genetic code with three stop codons (other codes can be handled with obvious modifications), matrix An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e028.jpg comprises 61×61 such elements, where

equation image
(1)

Here, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e030.jpg denote equilibrium frequency parameters, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e031.jpg denote nucleotide mutational biases, and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e032.jpg denote the substitution rates between amino acids encoded by codons An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e033.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e034.jpg. How to infer An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e035.jpg is the primary focus of this paper. We consider two different parameterizations of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e036.jpg: the GY parameterization [3], where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e037.jpg is the equilibrium frequency of the target codon, and the MG parameterization [2], where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e038.jpg is a nucleotide frequency parameter for the position that is being substituted (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e039.jpg; An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e040.jpg). For the GY parameterization, we estimate codon equilibrium frequencies by their proportions in the data (the F61 estimator, 60 parameters for the universal genetic code). For the MG parameterization, we estimate the nine frequency parameters by maximum likelihood [28]. The equilibrium frequency of codon An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e041.jpg can then be computed as

equation image

where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e043.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e044.jpg.

Finally, we set An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e045.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e046.jpg and estimate An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e047.jpg other rates (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e048.jpg) by maximum likelihood; this parameterization follows the MG94An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e049.jpgREV model from [29].

Inferring non-synonymous substitution rates

By varying the parametric complexity of the non-synonymous substitution rate An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e050.jpg encoding in equation (1), we can span the range of models from the single rate model (SR, current default standard, 1 non-synonymous rate parameter), to the general codon time-reversible model (REV) with each amino-acid pair substitution exchanged at its own rate. Only 75 out of 190 total amino-acid pairs can be exchanged via a single nucleotide substitution, for example An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e051.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e052.jpg are one such pair, but An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e053.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e054.jpg are not. Consequently, the REV model has An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e055.jpg non-synonymous rate parameters. The purpose of our study is to explore the model space between these two extremes, taking into account the limitations of information content in single gene alignments. Note that most existing multi-rate models can be represented with an appropriate choice of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e056.jpg in equation (1). Empirical models (e.g. ECM) replace An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e057.jpg with numerical values estimated from large training data sets, whereas mechanistic models (e.g. LCAP) assume that rates can be modeled via a function measuring differences/similarities in physicochemical properties of residues (Table 1).

Table 1
Various approaches to estimating residue-dependent non-synonymous substitution rates.

We focus on structured (or rate clustering) models: those which assume that substitution rates can be partitioned/structured into An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e076.jpg classes, where each class has a single estimated rate parameter. These structured models may be defined using amino acid similarity classes [30], but instead of adopting a priori classes of rates, we propose to infer their number and identity from the data. A structured model with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e077.jpg substitutions (e.g. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e078.jpg for the Universal genetic code) in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e079.jpg classes can be represented as a vector An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e080.jpg of length An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e081.jpg, where each element is an integer between An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e082.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e083.jpg labeling the class. For example if the vector entries corresponding to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e084.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e085.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e086.jpg substitutions have values An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e087.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e088.jpg, then An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e089.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e090.jpg. As an analogy, the HKY85 nucleotide model [31] is a structured model with vector, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e091.jpg, where the substitutions between 6 nucleotide pairs (indicated by a subscript) are placed into transition (1) and transversion (0) classes. Given the structure of a codon model, e.g. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e092.jpgAn external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e093.jpg, it can be fitted to the data using standard maximum likelihood phylogenetic algorithms, e.g. as implemented in HyPhy [32]. The resulting set of rate estimates An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e094.jpg instantiate a structured model and induce a corresponding empirical model, e.g. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e095.jpg.

Because the space of structured codon models is combinatorially large, we utilize a GA previously used to solve an analogous model selection problem for paired RNA data [25]. Parameter space is defined by two components: a discrete component which assigns pairwise non-synonymous substitutions between codons to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e096.jpg rate classes using the structured vector described above, and a continuous component comprising a vector of branch lengths, nucleotide substitution rates, frequency parameters and non-synonymous rates An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e097.jpg. The discrete component is optimized by the GA, while the continuous component is estimated using numerical non-linear optimization procedures, given the structure of the model. We initially approximate branch lengths using the SR model and update them whenever the GA iteration improves the fitness score by more than 50 An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e098.jpg points (see below) as compared to the most recent model for which branch lengths have been estimated. Further details of the genetic algorithm are described in detail in [25], and for the sake of brevity we do not present it here.

We are left with the problem of inferring the number of rate classes An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e099.jpg. This is done by starting with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e100.jpg and iteratively proposing to increment An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e101.jpg. For each proposal, the model with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e102.jpg rate classes is optimized using the optimized An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e103.jpg-class model as initialization. If the proposal results in a model with a better fitness value (see below), it is accepted and a new proposal generated. The process terminates when the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e104.jpg-class proposal does not beat the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e105.jpg-class model.

We initially assigned a fitness value to each model using An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e106.jpg where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e107.jpg is the sample size and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e108.jpg is the number of parameters in the model [33]. The “sample size” of a sequence alignment is difficult to quantify with a single number, since it depends on both the number of sequences in the alignment and the lengths of those sequences. We use the number of characters to approximate “sample size” to make the model selection criterion maximally conservative. While it is straightforward to count the number of estimated parameters in any given structured model, setting An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e109.jpg to that number leads to model over-fitting (results not shown), because the topological component (the assignment of rates to classes) adds further “degrees of freedom” to the model. To determine the appropriate penalty term, we conducted simulations; there is precedent for this in statistical literature on generalized information criteria (e.g. [34]). We removed the effect of phylogeny by simulating nine sets of two-sequence alignments (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e110.jpg divergence): each set of simulations consisted of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e111.jpg replicates with between An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e112.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e113.jpg codons (in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e114.jpg increments). The sets had 1 to 5 rate classes (Figure 1), representing rate classification problems that ranged from easy (large numerical differences between class rates, e.g. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e115.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e116.jpg) to difficult (small numerical differences, e.g. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e117.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e118.jpg). We constructed generating multi-rate models by assigning rates to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e119.jpg bins randomly with equal probability. For each simulation set we plotted the difference in log likelihood (scaled by the sample size = log of characters) between the correct model (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e120.jpg rates), and models with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e121.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e122.jpg rates, respectively. Simulations indicated that doubling the number of parameters in the BIC penalty term ensured sufficient power, and controlled false positives for all simulation sets (Figure 1). We used this modified BIC, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e123.jpg to assign fitness to every model examined by a GA run and select those with the lowest An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e124.jpg.

Figure 1
Simulation studies used to derive the appropriate penalty term for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e125.jpg.

Simulated data analysis

We also simulated realistic “gene-size” alignments on An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e141.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e142.jpg taxon trees. Nucleotide frequencies were uniform (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e143.jpg) for each position, and the nucleotide bias component was set to HKY85 with transition/transversion ratio, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e144.jpg. We generated An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e145.jpg data sets for each An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e146.jpg:rate vector combination, under the single rate, and a fixed Random-K model (Table 2). These data allowed us to assess the performance of the model when the true underlying model was known.

Table 2
The performance of GA model selection with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e147.jpg in estimating the number and membership of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e148.jpg rate classes as well as rate values from simulated data.

For each simulation scenario, we report the proportion of replicates An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e162.jpg for which the GA inferred the correct number of rate classes An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e163.jpg, the proportion of underfitted replicates An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e164.jpg (too few rate classes were inferred) and the proportion of overfitted replicates An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e165.jpg (too many rate classes were inferred). For the replicates where the correct number of rate classes was inferred, we computed the Rand statistic (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e166.jpg, [35]) on the generating and inferred model structures to quantify the similarity between two clusterings rates. The Rand statistic quantifies the similarity between two clusterings (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e167.jpg) of the same set of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e168.jpg objects and can be defined as An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e169.jpg, where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e170.jpg is the number of objects (pairs of substitution rates) that belong to different classes in both A and B, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e171.jpg (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e172.jpg) is the number of objects that belong to different (same) classes in A, but the same (different) class in B, and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e173.jpg is the number of objects that belong to the same class in both A and B. Clearly, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e174.jpg for perfect agreement (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e175.jpg) and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e176.jpg for perfect disagreement (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e177.jpg).

Empirical data analysis

We prepared a collection of reference empirical data sets (see Table 3), to be used for benchmarking GA, published and extreme-case models. The collection included three protein family alignments from Pandit [12] selected randomly from all alignments with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e178.jpg taxa, a randomly selected Yeast protein alignment [18], a group M HIV-1 pol alignment [36] and an Influenza A virus (IAV) HA alignment comprising H3N2, H5N1, H2N2 and H1N1 serotypes. The latter was assembled by random selection of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e179.jpg post-2005 sequences for each serotype from the NCBI Influenza database [37]. Finally, we examined the vertebrate rhodopsin protein, recently analyzed for molecular mechanisms of phenotypic adaptation by [38]. We inferred a structured multi-rate model for each of these data sets using the genetic algorithm and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e180.jpg model fitness function defined above. A comparison of the GA-fitted model against existing models is unfair, since the former was selected among a set of candidate models using the test alignment. To confirm that GA models were generalizable, we evaluated the fit of the GA models and that of existing models for both the reference datasets, and independent test alignments for the same taxonomic groups (validation data sets). Two HIV-1 pol gene alignments were obtained for subtypes B [39] and C [40]. Subtype assignments were confirmed using the SCUEAL sub-typing tool [36], and inter- and intra-subtype recombinants were pruned from the analysis. For IAV HA we used independent alignments for serotypes H5N1 and H3N2, filtered from the NCBI Influenza database [37], and from [41], respectively.

Table 3
Empirical data set characteristics.

We fitted five reference models to each dataset: (i) the single-rate model, (ii) a Random-An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e189.jpg and a Random-An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e190.jpg model, (iii) the empirical codon model (ECM, [10]), (iv) the Linear Combination of Amino Acid Properties (LCAP) model [17], [18], and (v) the reversible (REV) model (see Table 1).

For every dataset, the corresponding GA-run was processed to obtain three different alignment-specific multi-rate models.

  1. A structured GA model (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e191.jpg): this is the best-fitting model (with value An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e192.jpg), which defines An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e193.jpg rate clusters. The numerical values of corresponding An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e194.jpg substitution rates are inferred using maximum likelihood. This model is a direct analog of the single “best” substitution model reported by the familiar ModelTest [22] nucleotide model selection procedure.
  2. A numerical model-averaged GA model (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e195.jpg), which is computed by weighting the numerical rate estimates from all models in the credible set using An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e196.jpg-based Akaike weights (as in [25]). Briefly, for the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e197.jpgth model examined by the GA, we compute its evidence ratio versus the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e198.jpg model as An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e199.jpg, which can be thought of as the probability that model An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e200.jpg is the best model to explain the data, in the sense of minimizing the Kullback-Leibler divergence from the “true” unobserved model [42]. In addition to the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e201.jpg model, we also construct a set of credible models, i.e. all those models whose An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e202.jpg is sufficiently large (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e203.jpg). From this credible set we compute a model averaged estimate of any parameter An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e204.jpg, by a weighted sum of the estimate under model An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e205.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e206.jpg as An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e207.jpg, where the Akaike weight of model An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e208.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e209.jpg is defined as An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e210.jpg. This An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e211.jpg model is an analog of an empirical substitution model (e.g. ECM), and has no rate parameters that are estimated from validation data sets. By combining information from multiple models, statistical noise may be reduced (e.g. [26]).
  3. The numerical An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e212.jpg model with the addition of a single non-synonymous substitution rate parameter (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e213.jpg) which multiplies all non-synonymous substitution rates in the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e214.jpg matrix. The direct analog is the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e215.jpg model of [10], and its purpose is to add a dataset specific “adjustment” to the baseline numerical model, since the estimated parameters of the baseline numerical model are weighted over the credible set and fixed at these estimates when applied to other datasets.

We used both BIC [33] and Likelihood ratio tests, where appropriate, for model comparison. These goodness-of-fit comparisons allowed us to evaluate whether a model estimated on reference alignments yielded a significant improvement over the other models when fitted to independent alignments for the same taxonomic groups. All models were implemented with the F61 frequency parameterization, in addition to their original frequency parameterizations, because the methodology used to estimate the ECM model precluded the use of other frequency parameterizations for across-the-board comparison. Alignments and phylogenetic trees were provided for the Pandit data set. In all other cases, alignments were generated using codon alignment tools implemented in HyPhy [32]. Maximum likelihood phylogenetic trees were estimated using PhyML [43] under a GTR [44] model of nucleotide substitution and among-site rate variation modeled as a discretized gamma distribution with 4 rate-classes [45]. Empirical alignments and trees are available at http://www.hyphy.org/pubs/cms/.

Rate matrix comparisons

The entries of the substitution rate matrix An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e216.jpg can be used to estimate the expected number of substitutions per site per unit time, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e217.jpg, and to determine the value of the time parameter (assuming all other parameters are known) An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e218.jpg which yields An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e219.jpg. Furthermore, the expression for the number of expected one-nucleotide substitutions between codons An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e220.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e221.jpg, in time An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e222.jpg, at a site is given by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e223.jpg (the simplification is the consequence of time-reversibility). Given two amino-acid residues An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e224.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e225.jpg which can be exchanged by a single nucleotide substitution, we can further define An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e226.jpgAn external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e227.jpg, where An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e228.jpg denotes the residue encoded by codon An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e229.jpg. Consider a An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e230.jpgelement substitution spectrum vector An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e231.jpg, which describes the relative abundance or paucity of a particular type of amino-acid pair substitution under the model defined by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e232.jpg. Given two models, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e233.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e234.jpg, we propose to compare their similarity by computing the distance between the corresponding substitution spectrum vectors evaluated at the corresponding “normalized” times:

equation image
(2)

Any norm on the standard An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e236.jpgdimension real valued vector space can be used, but for the purposes of this paper we consider the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e237.jpg norm, and the corresponding induced Euclidean distance metric.

Implementation

All models and data sets utilized in this study are implemented as scripts in the HyPhy Batch Language (HBL), and are be available with the current source release of HyPhy [32]. In addition, we have made the GA codon model selector available as an analysis option at http://www.datamonkey.org [46]. The GA model selection code requires an MPI cluster environment with typical runtimes of approximately 36–48 hours for an intermediate-sized alignment (50 taxa) and 32 compute nodes.

Results

Power and accuracy analysis on simulated data

Results from both two- and multi-taxon simulations (Table 2, Figure 1) indicated that An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e238.jpg controlled the rates of overfitting, defined as the proportion of replicates that overestimated the number of rate classes An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e239.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e240.jpg. For null (single-rate model) simulations (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e241.jpg), false positive rates were An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e242.jpg for two-taxon simulations and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e243.jpg for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e244.jpg-taxon simulation. Neither two- nor multi-taxon simulations showed over-fitting across any simulation scenarios (Table 2). We deliberately designed the procedure to be conservative, since over-fitting is a major concern in statistical model selection. The power to select the correct number of rate classes An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e245.jpg (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e246.jpg) behaved as expected: increasing, and eventually reaching An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e247.jpg, given sufficiently divergent sequences and well resolved rate classes (Table 2). Indeed, the limited information content of alignments where simulated rate classes are similar (i.e 3 rates of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e248.jpg), and/or where pairwise sequence divergence is low (0.2), was evident as increased model under-fitting (Table 2), An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e249.jpg. Model under-fitting was substantially reduced when information content was increased, either by boosting the disparity in rate classes, or by elevating sequence divergence and/or number of taxa (Table 2). Further evidence that the GA procedure has high power is provided by the positive association of the difference between An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e250.jpg scores of the correct model with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e251.jpg rates, and one with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e252.jpg-1 rates, and separation between simulated rates, pairwise sequence divergence or number of taxa (Table S1). The ability to assign individual rates to the correct group (as measured by the Rand statistic) was similarly improved, while the variance in numerical rate parameter estimates decreased, for more divergent sequences and rate classes, suggesting that the GA search procedure recaptures most of the rate class structure, given sufficient information.

Empirical data analysis

We compared the fit of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e253.jpg codon substitution models (Table 1) on An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e254.jpg empirical data sets (Table 3), spanning a range of proteins, taxonomic groups and divergence levels, using the BIC to measure goodness-of-fit. Using the GA procedure, we inferred distinct multi-rate models from An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e255.jpg of these data sets (labelled with asterisks in Table 3). The remaining An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e256.jpg alignments were used for validation such that we could determine the generalizability of two of the GA-fitted models (HIV and IAV) to other alignments from the same taxonomic groups. In An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e257.jpg cases, the GA model outperforms every other model (often by a large margin), and in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e258.jpg cases it comes in second after the parameter rich REV model (Table 4). Note that the GA model outperforms REV in all An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e259.jpg cases under the more conservative An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e260.jpg criterion (which was used to inform the GA). Data set specific GA models consistently fit the data better than state-of-the-art empirical (ECM) and mechanistic (LCAP) models.

Table 4
Comparison of empirical model fits using BIC.

An intuitive understanding of the model selection process via the GA may be gained by thinking of it as a non-linear curve fitting problem, where the “true” curve is the unobserved distribution of biological substitution rates (Figure 2). We consider the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e263.jpg substitution rate matrix for a codon model, extract non-synonymous rates for the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e264.jpg above-diagonal entries which correspond to one-step non-synonymous substitutions and rank them in an increasing order to obtain monotonically increasing rate curves as shown in (Figure 2). Note that because the ratios for all substitutions between the same pair of amino-acids (of which there are An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e265.jpg pairs) are identical, this will create steps in such curves. In the case of one non-synonymous substitution rate (SR) the curve is a flat line at the estimated average non-synonymous substitution rate across all residue pairs. This is easily improved on by a random model which assigns non-synonymous substitutions randomly to one of 5 rate classes. At the other extreme lies the general time reversible models with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e266.jpg estimated rates. Since we have no a priori reason to believe that any two non-synonymous substitution rates will be exactly the same, REV is the most biologically realistic of the models which assume time-reversibility and only single nucleotide substitutions. However, fitting the parameter rich REV model to limited data is statistically unsound. The GA-approach, instead, searches for the best (in an information theoretic sense) step-wise smoothing of the biological distribution given the data available (Figure 2).

Figure 2
Evolutionary rate estimation as “curve fitting.”

The “generalist” ECM model sacrifices gene-level resolution, in some cases so dramatically that it underperforms the single-rate model, even with the correction factor An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e268.jpg (Table 4). For instance, ECM appears to be ill suited for the analysis of viral genes. LCAP, on the other hand, performs poorly for highly divergent data sets; indeed the original validation of LCAP took place on relatively closely related yeast species [18], and the mechanistic properties assumed by the model may be insufficient in alignments spanning multiple genera and taxonomic groups. To test whether GA structured models are generalizable, we estimated two viral models: one for HIV-1 polymerase and one for human IAV hemagglutinin. We then applied each of these models (holding the inferred class structure fixed) to two additional samples of sequences from the same gene, obtained independently from the training sample. In all An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e269.jpg cases An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e270.jpg outperformed ECM, ECM+An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e271.jpg and LCAP by wide margins, lending credence to the claim that data-driven structured models recover substitutional biases that are shared by other samples shaped by similar evolutionary parameters. Curiously, for very low divergence (and low information content) intra-serotype IAV alignments, the single rate model was preferred to all other models by BIC, suggesting that there are biologically interesting alignments, which do not contain sufficient amino-acid variability to indicate the use of a multi-rate model.

As a test of protein-specificity of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e272.jpg models, we randomly selected four Pandit data sets to assess how well An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e273.jpg models inferred from unrelated proteins fitted these data (Table S2). Not surprisingly, ECM was the best model in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e274.jpg cases, because it was derived as the best “average” protein model. LCAP topped the list in one case, but placed outside the top three in the other three cases. The GA structured models, being tailored to specific proteins, tended to differ from each other (Table S3) and did not perform well on proteins from different families. However, the GA structured models for ATP cone and Transketolase C did outperform the LCAP model in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e275.jpg cases, which suggests some similarity between the respective protein families in those cases. This indicates the GA models fitted to different proteins may be generalizable, with the degree limited by taxonomy, protein function or both. The generalizability of GA models could further be quantified by evolutionary fingerprinting of genes [27]; see also Figure 3(b).

Figure 3
Neighbor-joining [57] trees built from matrices of pairwise substitution spectrum distances (Eq. 2) computed between different models fitted to the HIV-1 group M pol alignment, and between An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e276.jpg models inferred from different alignments.

Further analysis of GA multi-rate models

A GA search run typically examines between two- and a hundred-thousand potential models, e.g. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e277.jpg models with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e278.jpg to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e279.jpg rate classes for the HIV-1 group M pol dataset. An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e280.jpg, which we compared to existing models in the previous section, is simply the single “best” model, i.e. the model that minimized the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e281.jpg criterion among all those examined during the run. Further, we estimate the credible set of models as those models whose evidence ratio versus the best model is sufficiently large (see methods). Among An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e282.jpg models fitted to HIV-1 pol by the GA, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e283.jpg belonged to the credible set. Given sufficient data and knowing that the true model is in the set examined by the GA, e.g. in the long 2-sequence simulations discussed above, the size of the credible set frequently shrinks to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e284.jpg (the true model). These structured (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e285.jpg) and model-averaged (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e286.jpg) models can be analyzed further to draw inferences of the substitution process.

For instance, the structured An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e287.jpg model identifies which residue pairs are exchanged rarely, relative to the baseline synonymous rate. In Figures 4 and and55 we cluster the pairs of residues which have the same rate of non-synonymous substitution; residues are labelled by Stanfel class and physicochemical properties. Note that the same residue can be present as a node in multiple clusters because the GA partitions residue pairs (i.e. the rates between them), not the residues themselves. The model reveals a startling heterogeneity of substitution rates in HIV-1 pol: the single rate An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e288.jpg estimate of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e289.jpg is resolved into An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e290.jpg rate classes (Figure 4), with relative non-synonymous substitution rates ranging from An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e291.jpg (20 residue pairs) to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e292.jpg (3 residue pairs); a similar range is revealed for other datasets (Table 3). It is remarkable that some of the non-synonymous substitutions occur at rates matching or exceeding the gene-average rate of synonymous substitutions. This can be interpreted, for instance, as lack of selective constraint on particular residue substitutions gene-wide, or evidence of directional selection when some residues are preferentially replaced with others. Regardless of how this result is interpreted, a remarkable complexity of substitution patterns is revealed by the analysis. We hypothesize that such patterns reflect complex dynamics of substitutional preferences that may be shared by multiple samples of the same genes. This hypothesis is supported (by the goodness-of-fit of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e293.jpg vs other models) on HIV-1 and IAV samples in this study (Table 4), and we are currently undertaking the GA analysis of several thousand alignments to confirm this finding.

Figure 4
Evolutionary rate clusters in structured GA models (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e294.jpg) inferred from the HIV-1 group M pol alignment.
Figure 5
Evolutionary rate clusters in structured GA models (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e298.jpg) inferred from the vertebrate rhodopsin protein alignment.

One of the benefits of using the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e302.jpg model instead of REV or other models is that the former model automatically classifies all substitutions into similarity groups, supplying a data-driven analog of “conservative” or “radical” substitutions, previously defined a priori based on chemical properties of the residues, or a more sophisticated multi-property basis defined in the LCAP model. For example, the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e303.jpg substitution rates are partitioned into seven classes in the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e304.jpg model inferred from HIV-1 pol, and into An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e305.jpg rate classes for the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e306.jpg model fitted to a smaller, but more divergent vertebrate rhodopsin alignment (Figure 5).

Multi-model inference is instrumental in assessing how robust the clustering assignment made by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e307.jpg is. In Figures 4 and and5,5, we present this information by labeling individual substitution rates with their model averaged values. An examination of the numerical differences between rate estimates (for a particular amino-acid pair) obtained under An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e308.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e309.jpg can reveal ambiguities in assigning a particular rate to a class. More formally, we can compute a model averaged support for the probability that rates An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e310.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e311.jpg (for residues An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e312.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e313.jpg) are in the same class, as described above, or that the corresponding edges An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e314.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e315.jpg are in the same component of the rate graph (Figure 4). If An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e316.jpg is a cluster defined by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e317.jpg (with the number of nodes in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e318.jpg, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e319.jpg), we define the cluster affinity of an edge An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e320.jpg as the mean of the model averaged estimates of the probabilities that edge An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e321.jpg and other edges in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e322.jpg belong to the same cluster:

equation image

If An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e324.jpg is below a certain threshold, for instance An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e325.jpg for majority rule, then cluster membership of edge An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e326.jpg is ambiguous. For example, the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e327.jpg substitution pair with a model-averaged non-synonymous rate of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e328.jpg is one of two rate pairs with low (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e329.jpg) cluster affinity for HIV-1 (Figure 4). Two of the inferred An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e330.jpg rate classes have non-synonymous rates of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e331.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e332.jpg, respectively, and the placement of model-averaged rate for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e333.jpg between the two values is indicative of the alternate assignment of this substitution pair to these two rate classes among models of the credible set. A larger training data set may be able to infer an additional intermediate rate class between An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e334.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e335.jpg. While An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e336.jpg yields more robust numeric estimates of substitution rates for a single data set, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e337.jpg has better An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e338.jpg fit on validation HIV and IAV alignments (results not shown).

The relationship between substitution rates and residue properties

The expectation that substitutions which preserve amino acid physicochemical properties occur at a lower rate than property-altering substitutions has previously been evaluated in the maximum likelihood codon model context [20], [47]. However, in published work, property-altering and property-conserving amino acid classes are defined a priori, whereas in the GA approach amino acid substitution pairs are first partitioned into classes based on rate similarity, and thereafter property preserving versus property-altering rates can be compared. The increased substitution rate of property preserving substitutions, holds largely – but not universally – for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e339.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e340.jpg rates, as evidenced in Figures 6 and and7.7. For example, in the vertebrate rhodopsin sample, the median rate of charge-changing substitutions is significantly lower than the charge-preserving substitutions, but the two medians are not significantly different in the HIV-1 pol sample. The rates were negatively correlated (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e341.jpg, one-sided Pearson product moment test, no multiple test correction) with An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e342.jpg out of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e343.jpg property-based distances (polarity, volume, isoelectric point and hydropathy) that form the basis of the LCAP model. However, while the broad pattern follows the expectation, the consistently better fit of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e344.jpg-based models, and the presence of strong outliers, such as An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e345.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e346.jpg in the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e347.jpg cluster of HIV-1 rates (Figure 4), suggests that our data driven approach detects significant deviations from purely biochemical rate expectation. These deviations could be attributed to selective pressures which promote property changes, or could arise because not all biologically relevant important properties have been included into structured models.

Figure 6
Correlations of lower substitution rates and property preservation in the HIV-1 group M pol alignment.
Figure 7
Correlations of lower substitution rates and property preservation in the vertebrate rhodopsin alignment.

One benefit of our approach over the “amino acid class” models [20], [47] is that transitivity of rates (i.e. the requirement that if An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e350.jpg, and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e351.jpg are in the same rate class, then so is An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e352.jpg) is not enforced by the GA models. Because we focus on modeling single-nucleotide substitution rates only, the structure of the genetic code itself contradicts transitivity. For instance both An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e353.jpg (encoded by An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e354.jpg) An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e355.jpg An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e356.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e357.jpg are one-step substitutions, but An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e358.jpg is not. Further, since amino acid class models only estimate two non-synonymous rates (within and between classes), it is a necessary condition that non-synonymous rates which change amino acid property be shared irrespective of how much the property is being changed. For instance, substitution rates which change charge from negative to positive will be the same as those which change charge from negative/positive to uncharged. If amino acid substitutions that result in a positive charge are favored, then these transitive conditions are not representative of the substitution process. Furthermore, the amino acid class models assume all substitutions within classes occur at the same rate. This is a very strong assumption since some amino acids with the same physicochemical property class are separated by more than one nucleotide substitution, e.g. positively charged amino acids An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e359.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e360.jpg. Although we do not account for multiple nucleotide substitutions in the GA model directly (but see below), previous work has demonstrated that these occur at lower rates than single-nucleotide substitutions [9], [10], [48].

Model clustering

Using the substitution spectrum distance defined in Equation (2), it is easy to construct a hierarchical clustering of several models fitted to the same dataset, as well as between models fitted to different datasets. The former is useful to interpret how much difference in predicted substitution patterns over a unit of evolutionary time there is between different descriptions of the same data, whilst the latter naturally extends the concept of evolutionary fingerprinting of non-homologous genes [27]. For HIV-1 pol (Figure 3), An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e361.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e362.jpg models both clustered closely with the rate substitution pattern predicted by the REV model, followed by LCAP, ECM+F+An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e363.jpg, and finally – distant single rate models. The similarity between REV and GA models was especially strong for the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e364.jpg parameterization, under which the GA models were inferred. In a between-genes model comparison (Figure 3), the two viral alignments clustered together, as did the two most divergent alignments (ATP-cone and Transketolase C).

Effects of substitution models on statistical inference

Statistical inference procedures based on phylogenetic models have varying degrees of robustness with respect to the substitution rate matrix used in the analysis. For a multi-rate model, it is intuitively clear that the types of inference that rely on “mean” rates should be minimally affected, whereas those that depend on the individual residue rates can be affected significantly. We examine several such measures inferred from two of the datasets in this study.

Branch length estimates are essentially unchanged when moving from the single-rate (SR) model to a An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e365.jpg model. On the example HIV-1 pol dataset, the total tree length changed from to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e366.jpg (SR) expected substitutions/nucleotide to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e367.jpg (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e368.jpg), and the lengths of individual branches were nearly perfectly linearly correlated with linear regression slope of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e369.jpg, intercept of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e370.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e371.jpg.

Ancestral character reconstruction is considerably more sensitive to the substitution model. In the vertebrate rhodopsin data set, for example, the joint maximum likelihood ancestral reconstruction [49] under SR and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e372.jpg models differed in the number of inferred non-synonymous substitutions at An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e373.jpg sites, with more non-synonymous substitutions in An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e374.jpg cases under An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e375.jpg. At An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e376.jpg sites substitutions were mapped to a different set of branches.

Site-specific diversifying selection screens are likely to be profoundly affected by a switch from single- to multi-rate models. Consider the FEL method [50], where the SR model is fitted site-by-site and a likelihood ratio test (LRT) is used to test whether An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e377.jpg. First, because An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e378.jpg defines multiple substitution classes, one can now apply a variety of tests to see which non-synonymous rates at a given site exceed the baseline synonymous rate. To explore this approach for a An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e379.jpgrate An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e380.jpg multi-class model applied to the vertebrate rhodopsin alignment, we performed An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e381.jpg LRT tests, where we independently constrained each non-synonymous rate parameter (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e382.jpg, Table 1) to be equal to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e383.jpg at a site (neutral evolution in class An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e384.jpg), vs an unconstrained An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e385.jpgparameter alternative. This is analogous to performing a test for selection at a site by constraining the non-synonymous rate to be equal to the synonymous rate, and comparing the fit to the unconstrained model (FEL), except that we only place the constraint on one rate class at a time. At An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e386.jpg, the standard (SR) FEL reported An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e387.jpg (codon 54) sites as being under diversifying selection (positively selected). However, for the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e388.jpg model, there were An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e389.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e390.jpg positively selected sites for the four substitution classes (Figure 5, increasing rate magnitude), respectively at the Bonferroni corrected An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e391.jpg of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e392.jpg. Codon 54 was selected only with the fastest rate class (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e393.jpg), because the signal of selection is driven by a large number of An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e394.jpg substitutions. Only one codon (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e395.jpg) was selected with two or more different tests (rate classes An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e396.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e397.jpg).

The effect of site-to-site rate variation

We remark that the effects of site-to-site rate variation and multiple non-synonymous rates appear to be largely additive, and not confounded. This is a critical observation: if the effects are confounded, then we cannot justify inferring the multi-rate model independently assuming no site-to-site rate variation, as is done in this manuscript for computational expedience. To illustrate, we fitted both a constant rate model and the general bivariate distribution [27], with and without accounting for multiple non-synonymous rate classes (Table 5). The constant rate model assumes all sites share the same rate of substitution, whereas a general bivariate distribution infers the number of site-to-site variation classes from the data [27]. These models were fitted to the vertebrate rhodopsin alignment, which exhibits extensive site to site rate heterogeneity. The An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e398.jpg inferred 4 non-synonymous rate classes for the rhodopsin alignment, whereas the single An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e399.jpg has one, resulting in three degrees of freedom for the comparison of these models. When the general bivariate model was fitted with a single An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e400.jpg or An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e401.jpg, 6 and 7 site classes were inferred, respectively, resulting in 4 degrees of freedom for the comparison of single An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e402.jpg and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e403.jpg models (3 rate and 1 site class are added to the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e404.jpg model). The important observation is that the addition of site-to-site rate variation component resulted in a significant improvement in log likelihood scores, regardless of the underlying substitution model (single An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e405.jpg or An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e406.jpg). This suggests that by allowing multiple rate classes, we are not merely fitting variability in site-to-site selective constraints. However, as the cost of computing cores in clusters decreases, we expect that it will become practical to infer An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e407.jpg models with the site-to-site rate variation component included directly in the search procedure.

Table 5
The effects of modeling site-to-site rate variation and multiple non-synonymous rates in the vertebrate rhodopsin alignment using the MG frequency parameterization.

The effect of allowing multiple instantaneous nucleotide substitutions

Recent extensions of codon models which permit multiple instantaneous nucleotide substitutions [9], [10], [48] tend to fit the data better than their traditionally parameterized counterparts. We explored whether this observation held for An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e413.jpg models using a straightforward extension of the rate matrix in Equation (1), following the ideas of [9]. We introduce four new independently estimated parameters to model the relative rates of synonymous (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e414.jpg) and non-synonymous (An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e415.jpg) substitutions which replace two or three nucleotides, and modulate them by the product of the corresponding nucleotide rates An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e416.jpg and the target codon frequency An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e417.jpg (assuming the GY parameterization with the F61 estimator). For instance the rate of synonymous substitution (Serine) from An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e418.jpg to An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e419.jpg is An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e420.jpg, while the rate of non-synonymous substitution An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e421.jpg (Lysine to Proline) is An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e422.jpg.

Table 6 summarizes the effect of adding multi-step substitutions to SR and An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e423.jpg models for the vertebrate rhodopsin alignment. Much as was the case for site-to-site rate variation, the effects of multiple single-step non-synonymous rates and the non-zero rates of two or three nucleotide substitutions are additive at the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e424.jpg level, and the estimates of single-step substitution rates were minimally influenced by the presence of the multi-step component (results not shown). The An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e425.jpg model augmented to allow multi-step substitutions can be directly compared to the Mechanistic-Empirical codon (MEC) model [9] coupled with the LG [51]empirical amino-acid substitution model (selected as the best fitting empirical model using the procedure implemented on http://www.datamonkey.org. Assuming no site-to-site rate variation, BIC of the MEC model is An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e426.jpg, while that of the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e427.jpg+multi-step model using the HKY85 nucleotide component (a direct analog to the MEC model) is An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e428.jpg, once again highlighting how strongly the substitution process in an individual gene appears to deviate from the “average” encoded by empirical protein models.

Table 6
The effects of modeling multi-nucleotide instantaneous substitutions and multiple non-synonymous rates in the vertebrate rhodopsin alignment using the F61 frequency parameterization.

The GA could be modified to search for optimal partitions among all An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e434.jpg pairs of rates, for example using the above parameterization, but as the rhodopsin example indicates, the single-step and multi-step rate rate components appear to be effectively independent. We will explore this option in future versions of the model selection GA.

Discussion

In this manuscript we have developed, validated and benchmarked a procedure to quickly and reliably infer a multi-rate model from the combinatorially large class of general time-reversible codon substitution models. Using extensive simulations, we demonstrated that our conservative An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e435.jpg model selection criterion controls over-fitting and has excellent power on data sets of biologically realistic size, inferring the exact model simulated given sufficient sequence divergence and length. We have previously argued against using the single rate model as a benchmark against which multi-rate models should be compared, since it is trivial to improve upon using a random assignment of substitutions to rate classes [19]. We reiterate this argument here, and suggest we should rather consider how well a multi-rate model approximates the REV model (Figure 2), given the limitations posed by the information content in an alignment. On a diverse collection of biological data, An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e436.jpg models consistently outperform the best-in-class empirical and mechanistic models, and match the performance of fully parameterized general time reversible models with only a few biologically relevant rate parameters (Table 4). Therefore, the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e437.jpg provides goodness of fit matching or exceeding that of REV, with substantially fewer parameters and is thus computationally and statistically feasible for downstream analyses.

ModelTest [22] has been universally adopted to mitigate the effect of model misspecification on statistical inference from nucleotide data, and we posit that a robust codon model selection procedure, for example the one offered in this paper, will play a similar role for codon data. In the same vein as ModelTest, we infer the best model (which we term the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e438.jpg) for an alignment, and also utilize model averaging [26] to achieve more robust estimates of biologically relevant parameters. Certain applications of codon models, such as divergence estimation, appear unaffected by the gross biological over-simplification of single-rate models, because they are only influenced by the mean of substitution rates. Others, including ancestral sequence reconstruction (e.g. for guided site directed mutagenesis, [38]), substitution mapping (e.g. for co-evolutionary analysis, [52]) and character sampling (e.g. for data augmentation modeling approaches, [53]) can see moderate effects. Applications which are tightly integrated with the substitution model and the interpretation of its parameters, such as site-by-site positive selection detection (e.g. [50], [54]), will be profoundly affected by the introduction of multiple rates. Our results strongly argue against the prospect of deriving a single “generalist” model of codon evolution, that is capable of fitting most protein alignments well. Hence we should strive to fit both gene and taxonomy specific models of codon evolution. We further hypothesize that independent alignments representing a gene or a protein family will share most of the model structure and confirm this with HIV-1 polymerase and Influenza A virus hemagglutinin examples. While significant further validation is required and is currently underway, we assert that a collection of substitution models inferred from carefully selected training datasets can provide a useful library of organism and gene-specific models to be used in inference on codon sequences. This is conceptually similar to a library of Hidden Markov profile models, inferred from seed alignments, used for detecting protein domain homology in the Pfam database [55]. In order to facilitate the process of generating gene and taxonomic specific multi-rate codon models we have implemented the GA on our free analysis webserver (http://www.datamonkey.org, [46]), and have begun to assemble a library of representative multi-rate substitution models that are needed to reduce biases in those procedures that are sensitive to model misspecification.

The inference of the multi-rate codon models should be considered more than just a necessary step for downstream applications. By examining the structure of inferred rate classes, we argue that the An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e439.jpg captures the a priori expectation that radical changes in one or more biochemical properties of a residue happen relatively infrequently, but also that a mere reliance on such data-abstract mechanistic properties misses out important gene and organism specific peculiarities of the evolutionary process. For instance the elevation of substitution rates between amino acids that do not preserve physicochemical properties may be indicative of selective pressures which promote property changes. These selective pressures are of crucial importance in understanding evolution in viruses, such as HIV-1, known to evade host immune response [56]. We anticipate that considering specific substitution types when estimating selective pressures will improve power, as demonstrated with our multi-rate FEL analysis of vertebrate rhodopsin. However, this may also increase the rate of false positives, a conjecture that can be evaluated with straightforward, but laborious simulations.

Finally, we demonstrate how simple metrics on An external file that holds a picture, illustration, etc.
Object name is pcbi.1000885.e440.jpg models inferred from different (e.g. non-homologous) alignments can be used to obtain an objective measure of similarity and disparity in substitutional preferences in different proteins and thus improve the resolution in evolutionary fingerprinting of genes [27].

Supporting Information

Table S1

Difference in mean (standard deviation) model mBIC scores for multi-taxon simulations. D is the average pairwise divergence; mBICn is the difference in model mBIC score between the model with n−1 rates and a more complex with n rates; P is the proportion of correctly identified models for 100 simulations. Positive mBIC scores indicate preference for the more complex model with n rates, i.e. mBICn = mBICn−1−mBICn.

(0.04 MB PDF)

Table S2

Randomly selected Pandit data model comparisons using BIC. In each case we fitted the ECM, LCAP and GAs models to each of four randomly selected Pandit datasets. Model ranks (BIC/difference in BIC score relative to the best model) are shown.

(0.03 MB PDF)

Table S3

Qualitative comparison of structured GA models.

(0.02 MB PDF)

Acknowledgments

We thank Associate Editor, Wen-Hsiung Li, Tal Pupko and an anonymous reviewer for insightful comments on an earlier draft of this manuscript.

Footnotes

The authors have declared that no competing interests exist.

This research was supported by the Joint DMS/NIGMS Mathematical Biology Initiative through Grant NSF-0714991, the National Institutes of Health (AI47745), and by a University of California, San Diego Center for AIDS Research/NIAID Developmental Award to WD and SLKP (AI36214). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Felsenstein J. Evolutionary trees from DNA-sequences – a maximum-likelihood approach. J Mol Evol. 1981;17:368–376. [PubMed]
2. Muse SV, Gaut BS. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol. 1994;11:715–724. [PubMed]
3. Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol. 1994;11:725–736. [PubMed]
4. Anisimova M, Kosiol C. Investigating protein-coding sequence evolution with probabilistic codon substitution models. Mol Biol Evol. 2009;26:255–271. [PubMed]
5. Delport W, Scheffler K, Seoighe C. Models of coding sequence evolution. Brief Bioinform. 2009;10:97–109. [PMC free article] [PubMed]
6. Dayhoff MO, Eck EV, Park CM. Dayhoff MO, editor. A model of evolutionary change in proteins. Atlas of protein sequence and structure, National Biomedical Research Foundation, Washington D.C., volume 5. 1972. pp. 89–99.
7. Jones D, Taylor W, Thornton J. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992;8:275–82. [PubMed]
8. Whelan S, Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol. 2001;18:691–699. [PubMed]
9. Doron-Faigenboim A, Pupko T. A combined empirical and mechanistic codon model. Mol Biol Evol. 2007;24:388–397. [PubMed]
10. Kosiol C, Holmes I, Goldman N. An empirical codon model for protein sequence evolution. Mol Biol Evol. 2007;24:1464–1479. [PubMed]
11. Tavaré S. Some probabilistic and statistical problems in the analysis of DNA sequences. Lectures on Mathematics in the Life Sciences. 1986;17:57–86.
12. Whelan S, de Bakker PIW, Quevillon E, Rodriguez N, Goldman N. Pandit: an evolution-centric database of protein and associated nucleotide domains with inferred trees. Nucleic Acids Res. 2006;34:D327–31. [PMC free article] [PubMed]
13. Adachi J, Hasegawa M. Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol. 1996;42:459–468. [PubMed]
14. Adachi J, Waddell P, Martin W, Hasegawa M. Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA. J Mol Evol. 2000;50:348–358. [PubMed]
15. Dimmic MW, Rest JS, Mindell DP, Goldstein RA. rtREV: an amino acid substitution matrix for inference of retrovirus and reverse transcriptase phylogeny. J Mol Evol. 2002;55:65–73. [PubMed]
16. Nickle DC, Heath L, Jensen MA, Gilbert PB, Mullins JI, et al. HIV-specific probabilistic models of protein evolution. PLoS ONE. 2007;2:e503. [PMC free article] [PubMed]
17. Conant GC, Wagner GP, Stadler PF. Modeling amino acid substitution patterns in orthologous and paralogous genes. Mol Phylogenet Evol. 2007;42:298–307. [PubMed]
18. Conant GC, Stadler PF. Solvent exposure imparts similar selective pressures across a range of yeast proteins. Mol Biol Evol. 2009;26:1155–1161. [PubMed]
19. Delport W, Scheffler K, Muse SV, Kosakovsky Pond S. Benchmarking multi-rate codon models. PLoS One in press [PMC free article] [PubMed]
20. Sainudiin R, Wong WSW, Yogeeswaran K, Nasrallah JB, Yang Z, et al. Detecting site-specific physicochemical selective pressures: applications to the Class I HLA of the human major histocompatibility complex and the SRK of the plant sporophytic self-incompatibility system. J Mol Evol. 2005;60:315–326. [PubMed]
21. Huelsenbeck JP, Joyce P, Lakner C, Ronquist F. Bayesian analysis of amino acid substitution models. Philos Trans R Soc Lond B Biol Sci. 2008;363:3941–3953. [PMC free article] [PubMed]
22. Posada D, Crandall K. Modeltest: testing the model of DNA substitution. Bioinformatics. 1998;14:817–818. [PubMed]
23. Kosakovsky Pond SL, Frost SDW. A genetic algorithm approach to detecting lineage-specific variation in selection pressure. Mol Biol Evol. 2005;22:478–485. [PubMed]
24. Kosakovsky Pond SL, Posada D, Gravenor MB, Woelk CH, Frost SD. Automated phylogenetic detection of recombination using a genetic algorithm. Mol Biol Evol. 2006;23:1891–1901. [PubMed]
25. Kosakovsky Pond SL, Mannino FV, Gravenor MB, Muse SV, Frost SD. Evolutionary model selection with a genetic algorithm: a case study using stem RNA. Mol Biol Evol. 2007;24:159–170. [PubMed]
26. Posada D. jModelTest: phylogenetic model averaging. Mol Biol Evol. 2008;25:1253–1256. [PubMed]
27. Kosakovsky Pond S, Scheffler K, Gravenor M, Poon A, Frost S. Evolutionary fingerprinting of genes. Mol Biol Evol. 2009;27:520–536. [PMC free article] [PubMed]
28. Kosakovsky Pond S, Delport W, Muse SV, Scheffler K. Correcting the bias of empirical frequency parameter estimators in codon models. PLoS One in press [PMC free article] [PubMed]
29. Kosakovsky Pond SL, Muse SV. Site-to-site variation of synonymous substitution rates. Mol Biol Evol. 2005;22:2375–2385. [PubMed]
30. Stanfel L. A new approach to clustering the amino acids. J Theor Biol. 1996;183:195–205. [PubMed]
31. Hasegawa M, Kishino H, Yano T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Mol Biol Evol. 1985;21:160–174. [PubMed]
32. Kosakovsky Pond SL, Frost SDW, Muse SV. HyPhy: hypothesis testing using phylogenies. Bioinformatics. 2005;21:676–9. [PubMed]
33. Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6:461–464.
34. Atkinson AC. A note on the generalized information criterion for choice of a model. Biometrika. 1980;67:413–418.
35. Rand WM. Objective criteria for the evaluation of clustering methods. J Amer Statist Assoc. 1971;66:846–850.
36. Kosakovsky Pond SL, Posada D, Stawiski E, Chappey C, Poon AFY, et al. An evolutionary model-based algorithm for accurate phylogenetic breakpoint mapping and subtype prediction in HIV-1. PLoS Comput Biol. 2009;5:e1000581. [PMC free article] [PubMed]
37. Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Zaslavsky L, et al. The influenza virus resource at the national center for biotechnology information. J Virol. 2007;82:596–601. [PMC free article] [PubMed]
38. Yokoyama S, Tada T, Zhang H, Britt L. Elucidation of phenotypic adaptations: Molecular analyses of dim-light vision proteins in vertebrates. Proc Natl Acad Sci U S A. 2008;105:13480–13485. [PubMed]
39. Brumme ZL, Brumme CJ, Heckerman D, Korber BT, Daniels M, et al. Evidence of differential HLA class I-mediated viral evolution in functional and accessory/regulatory genes of HIV-1. PLoS Pathog. 2007;3:e94. [PMC free article] [PubMed]
40. Rousseau CM, Daniels MG, Carlson JM, Kadie C, Crawford H, et al. HLA class I-driven evolution of human immunodeficiency virus type 1 subtype C proteome: immune escape and viral load. J Virol. 2008;82:6434–6446. [PMC free article] [PubMed]
41. Russell CA, Jones TC, Barr IG, Cox NJ, Garten RJ, et al. The global circulation of seasonal influenza A (H3N2) viruses. Science. 2008;320:340–346. [PubMed]
42. Burnham K, Anderson D. Model selection and multimodel inference. New York: Springer; 2003. 2nd ed. edition.
43. Guindon S, Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003;52:696–704. [PubMed]
44. Tavaré S. Some probabilistic and statistical problems in the analysis of DNA sequences. In: Miura RM, editor. Lectures on Mathematics in the Life Sciences. Providence, R.I.: Amer. Math. Soc; 1986. pp. 57–86.
45. Yang Z. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol Biol Evol. 1993;10:1396–1401. [PubMed]
46. Kosakovsky Pond SL, Frost SDW. Datamonkey: rapid detection of selective pressure on individual sites of codon alignments. Bioinformatics. 2005;21:2531–2533. [PubMed]
47. Wong W, Sainudiin R, Nielsen R. Identification of physicochemical selective pressure on protein encoding nucleotide sequences. BMC Bioinformatics. 2006;7:148–158. [PMC free article] [PubMed]
48. Whelan S, Goldman N. Estimating the frequency of events that cause multiple-nucleotide changes. Genetics. 2004;167:2027–2043. [PubMed]
49. Pupko T, Pe'er I, Shamir R, Graur D. A fast algorithm for joint reconstruction of ancestral amino acid sequences. Mol Biol Evol. 2000;17:890–896. [PubMed]
50. Kosakovsky Pond SL, Frost SDW. Not so different after all: a comparison of methods for detecting amino acid sites under selection. Mol Biol Evol. 2005;22:1208–1222. [PubMed]
51. Le SQ, Gascuel O. An improved general amino acid replacement matrix. Mol Biol Evol. 2008;25:1307–20. [PubMed]
52. Poon AFY, Lewis FI, Pond SLK, Frost SDW. An evolutionary-network model reveals stratified interactions in the V3 loop of the HIV-1 envelope. PLoS Comput Biol. 2007;3:e231. [PubMed]
53. Rodrigue N, Kleinman CL, Philippe H, Lartillot N. Computational methods for evaluating phylogenetic models of coding sequence evolution with dependence between codons. Mol Biol Evol. 2009;26:1663–76. [PubMed]
54. Yang Z, Nielsen R, Goldman N, Pedersen AM. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics. 2000;155:431–449. [PubMed]
55. Sonnhammer EL, Eddy SR, Durbin R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins. 1997;28:405–20. [PubMed]
56. Leslie A, Pfafferott K, Chetty P, Draenert R, Addo M, et al. HIV evolution: CTL escape mutation and reversion after transmission. Nat Med. 2004;10:282–9. [PubMed]
57. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4:406–25. [PubMed]

Articles from PLoS Computational Biology are provided here courtesy of Public Library of Science