The simulation results show that under a cut-off
p-value = 0.05 the likelihood ratio test is more accurate than AIC and AICc. With the exception of the
π parameters, AIC and AICc chose overly complex models more frequently than did the backward elimination procedure. For the
π parameters, AIC and AICc chose overly simple models more frequently than did backward elimination. The difference lies in the different cut-off values that are used to penalize the more complex model. Take the heterogeneity test of
κ as an example (df = 1), the LRT statistic is defined as twice the difference in log likelihood values between a pair of nested models: Λ = 2 × (ln
L(

|
x) - ln
L(
θ0 |
x)). Based on the LRT under a significance level of 0.05, we reduce the complexity of the model if Λ is smaller than the critical value 3.84. Under AIC we choose a simpler model only when Λ < 2; hence, AIC tends toward more complex models. However when we reduce a model by more than 7 parameters (e.g. homogeneous
π's versus heterogeneous
π's, with d.f. = 9), the critical value becomes 16.92 for the LRT, which is less than the critical value of Λ under AIC, 18. In this case, AIC will select the same or simpler model than the LRT. Note that AICc compares Λ with 2 ×
k ×
n/(n-2) which is always greater than 2 ×
k used by AIC, hence AICc will always choose the same or simpler model than AIC. This property of AICc had only a small effect on the results of model selection, as AICc performed only slightly better than AIC but substantially worse than backward elimination.
LRT statistics involving parameters such as
ω appear to be asymptotically
χ2 distributed for random-effect codon models; such models employ a parametric distribution (
e.g., the
β distribution of Models M7 and M8 [
2]) to accommodate among site variability in the
ω ratio. However, Aagaard and Phillips [
8] reported that for comparisons of Yang and Swanson's [
4] models C and E (FE13 and FE1 in Figure ), the empirical distribution of LRT statistics deviated from the expected
χ2 distribution, leading to a type I error rate in excess of that specified by the level of the test. The results of our simulation study suggest this bias might affect all tests in Figure involving parameters
κ,
ω and
c. Several authors have noted that LRT statistics derived from models that employ empirical estimates of nucleotide or codon frequencies might not be well approximated by a
χ2 distribution [
8,
28]. Moreover, when Aagaard and Phillips [
8] repeated their simulation study under equal codon frequencies and computed LRT statistics by using models with frequency parameters fixed to the true values (
πi = 1/61), they found that the LRT statistics matched the
χ2 expectation. Aagaard and Phillips [
8] suggested that the approximation of codon frequencies is the source of the observed bias in the LRT.
Indeed, empirical estimates do not satisfy the requirements of LRT [
21], and consequently the backward elimination procedure. To further investigate the impact of empirical estimates on model selection, we reanalysed all the simulated datasets under the true codon frequencies;
i.e., those used to generate the data. We note that for a given dataset the empirical codon frequencies yielded higher likelihood scores than did the true frequencies. This was expected, as empirical estimation will "pick up" some of the sampling errors in each simulation replicate. We found that the accuracy and bias of backward elimination, AIC and AICc under the true codon frequencies were identical to those when empirical codon frequencies were used. This suggests that bias in the model selection procedures used here did not arise from empirical estimation of
π's alone.
There are several possible explanations for the bias of all three model selection methods in the direction of greater complexity for one of
ω,
κ or
c. One possibility is that potential non-independence among the values for different parameters means that AIC might not be a good approximation to the Kullback-Leibler divergence, and that the requirements for the
χ2 approximation might not be met for the LRT, thus the degree of freedom is not accurate. Also, backward elimination may find a local optimum solution. Under backward elimination the cut-off
p-value is subjectively decided before the tests, leading to the possibility that the procedure will stop too early and suggest an overly complex model. This phenomenon is sometimes seen in the regression context. Clearly, these issues require further attention; in the mean time we explored the possibility of tuning the cut-off
p-value in order to improve the performance of backward elimination (see also [
8]). After evaluating several cut-off values for
p, we found that a substantial improvement in performance was obtained by using
p = 0.0001. Moreover, we found that by using this cut-off, nearly all the tendency to select an overly complex model for
ω,
κ or
c was avoided, and that errors for selecting overly-simplistic models happen mostly for datasets where one of the partitions was comprised of a very small number of codon sites.
Our application of fixed-effect models to real data was encouraging, having uncovered previously unrecognized heterogeneity among
Listeria genes and among sites within the abalone sperm lysin gene. However, if the objective is only to identify individual positive selection sites within a gene, the
a priori structural information is not likely to serve as a perfect proxy for those partitions most relevant to differences in selection pressures. For example, Yang and Swanson [
4] showed that the exposed sites of lysin likely include both conserved and positively selected codon positions. Hence, averaging
ω over all sites in the exposed partition yields a reduced estimate of positive selection pressure. We note this effect was the same under the best-fit model, FE9, as under FE1 (Model E) used by Yang and Swanson [
4]. We agree with Yang and Swanson [
4] in anticipating that the power of random-effect models to test the strength and direction of selection pressure at sites within genes will be greater than fixed-effect models in most cases.
If the objective is to investigate heterogeneous evolution among genes, as in genome-scale analyses, then fixed-effect models are useful. The present set of models represents only a small step towards genome-scale evolutionary models. For example, decoupling synonymous and nonsynonymous rates, as in the random-effect model of Kosakovsky Pond and Muse [
29], would allow users freedom to model gradients in synonymous substitution rates along a genome while allowing independent variability in nonsynonymous rates among genes. Yang and Swanson [
4] made several suggestions, including the intriguing idea of enforcing a molecular clock for synonymous changes and leaving nonsynonymous changes unconstrained. We predict that joining fixed-effect codon models and data-mining methods to obtain new methods analogous to model based clustering [
30,
31] could provide extremely useful tools for genome-scale data analysis. Lastly, there is growing interest in both the performance of codon models [
32,
33] and the impact of heterogeneity among genes [
34,
35] in multi-gene phylogenetic analysis; improved ability to model among-gene heterogeneity at the codon level could improve their utility for comparing alternative phylogenomic hypotheses.