For all three alignments, the SR model could be rejected in favor of a random multi-rate model in the majority of cases with the likelihood ratio test at
level (). For models with two rate classes, significantly improved model fit was evident in at least 43 and up to 80 of the 100 random models (15 and 66 with Bonferroni correction). Models with 5 rate classes showed significantly improved model fit for 96 to 100 of the 100 permutations. Our analysis demonstrates that given a sufficiently large alignment, effectively any random
multi-rate model with
rate classes is preferred to the SR model. This observation raises serious doubts as to the utility of a single rate model as a benchmark for model comparison.
Comparison of single rate versus random models for 3 alignments.
As an analogy, consider the family of nucleotide models, where JC69 
and the general-time reversible (GTR) model 
are representative of the two extremes of model space, where model space is defined by the number of pair-wise nucleotide substitution rates. Whilst one cannot argue against the GTR model being the most representative of the mutation process, it is seldom selected as the best fitting model. Indeed, of approximately
sequence alignments submitted to Datamonkey 
for nucleotide model selection, not a single one supported the GTR model over other models. Note that the model selection procedure in Datamonkey 
nucleotide time-reversible models. This approach is clearly infeasible for codon models, since there are
codon models with
rate classes, for example. The most frequently selected model (
of cases) was the two parameter HKY85 model 
. This does not suggest that HKY85 is the most biologically plausible, but rather the best approximation to the GTR given limited sample size.
Consequently, we should assess codon models not by whether or not they outperform the single rate model, but rather by how they measure up against the general codon model (i.e.
REV). We demonstrate using log likelihood scores. As previously shown 
, both LCAP and ECM models fit better than a single rate model, at least for the Pandit alignment (). Since SR is nested within LCAP, the improvement in log likelihood score follows by necessity. However, a glance at should convince the reader just how trivially easy it is to outperform the SR model. Comparison of models using BIC () indicates the GA to be the best model for all three alignments. As evident in comparison of log likelihoods (), BIC scores for random models also indicate improved fit over the single rate model. ECM is ranked second when fitted to one of the alignments used in ECM estimation (PF00803), yet fits rhodopsin and HIV-1 alignments worse
than the single rate model, suggesting it may be impossible to derive a generalist empirical codon model.
Figure 1 Distribution of log likelihood scores for multi-rate models where amino-acid substitutions are assigned randomly to 5 non-synonymous classes.
Comparison of empirical model fits using BIC.
When developing multi-rate models of codon evolution we should strive to not only beat the single-rate model, but also to approximate the REV model with the fewest possible parameters. Consider a multi-rate model with 5 independent rate parameters, as in LCAP. In this case we can plot the log-likelihood limits at which we reject a single-rate model (left hand dashed line in ), and at which we reject a 5-rate model in favor of the reversible model (right hand dashed line), say at
using the likelihood ratio test (
degrees of freedom in the first case,
in the second). The performance of most multi-rate codon models thus far falls between these limits, i.e.
the models improve upon the case of the single rate but can be rejected in favor of REV. We should construct multi-rate codon models that match the performance of REV in a statistical sense, with comparable likelihood scores, but with sufficiently few parameters to be computationally tractable and estimable from reasonable alignments. Only one model, the GA 
, achieves this in all cases. This model is set up so as to prevent over-parameterization, which is achieved by incrementing the number of non-synonymous rate classes,
, evaluating the fitness of a population of
-rate models using an appropriately chosen information criterion, and repeating until fitness is no longer improved with an increase in the number of rate classes. The fact that none of the
model selection analyses run on Datamonkey contained enough data to reject all simpler models in favor of a six parameter nucleotide GTR suggest that we should similarly focus our efforts in the codon space on models with a small number of rate classes, and investigate the space of candidate models thoroughly.
In conclusion, we have shown the single rate model to be a poor benchmark for model comparison, given that random models nearly always offer improved fit. We argue that the conceptual approach to codon model selection should instead focus on finding multi-rate models with a few parameters that can match the performance of REV, i.e. cannot be rejected in favor of REV, on alignments of biologically realistic size. Furthermore, our examples highlight the poor fit of “generic” empirical multi-rate models and suggest that new multi-rate models should be alignment specific. Whilst it is not advisable to fit a parameter rich REV model in practice, due to computational constraints and uncertainty in parameter estimates on small alignments, we should aim to derive the best model, given the limitations posed by the size of the alignment.