While nucleotide context-dependent models can offer large improvements in terms of model fit to the data as compared to independent evolutionary models, their performance has to be compared to the performance of both codon-based and codon partition (CP) models. Shapiro et al. [1
] state that few analyses in the literature on protein-coding sequences actually take into account the genetic code. Moreover, commonly employed model selection techniques (such as Modeltest by Posada and Crandall [2
]) exclude codon-based models from the model comparisons, presumably due to the associated computational cost. Shapiro et al. [1
] show, in alignments of 177 RNA virus genes and 106 yeast genes, that the codon-based GY94 model [10
] is the model of choice for all the yeast genes and for 67% of the virus genes. However, due the computational cost the GY94 model seems only of use in smaller datasets. Compared to the independence models (such as the frequently used GTR + Γ + I model), the CP models performed better on average for all the alignments analyzed. Such biologically motivated CP models are hence a computationally feasible alternative to codon-based models for use with protein-coding sequences, frequently outperforming standard nucleotide models.
We have shown in this paper that the computational requirements for integrating full codon models in a thermodynamic integration framework for model comparison increase drastically. While we have focused on the model of Goldman and Yang [10
] as the full codon model of choice in this paper, better performing codon models have been developed (we refer to [28
] for a review on codon substitution models). A study on full codon models is however beyond the scope of this paper, as the main goal was to show that current codon partition models can be approved upon at a computational cost below that of full codon models.
We have shown in this work that codon partition models that are extended with a context-dependence pattern for the third codon position across the entire underlying phylogenetic tree (so-called context-dependent codon partition models) significantly improve model fit compared to traditional codon partition models. While this work was mainly inspired by empirical findings [11
], other context-dependent assumptions to extend traditional codon partition models can be devised and tested, which is the subject of ongoing work. The models proposed in this paper illustrate the complex (context-dependent) substitution patterns at the third codon position, along with increased rates of evolution at that position.
The approach we have taken to model context-dependent evolution at the third codon position selects one of 16 models using the neighbouring base combination at the start of each branch, which means that for a given site the context might change at each internal node of the underlying phylogenetic tree. A limiting aspect of this approach however is that the neighbouring base combination is assumed not to change along the length of a branch. This leads to only one in three codon positions (i.e. the third codon position and its two immediate neighbours) that can undergo substitutions, which is an assumption similar to that of full codon models (i.e. only one position in the codon can undergo substitutions). Contrary to the original context-dependent model developed for non-coding sequences [13
], the evolution of the third codon position could be allowed to depend upon the neighbouring base combination at both the start and the end of each branch. However, such an assumption leads to 256 possible combinations of neighbours, instead of 16 used in this paper. It therefore seems highly unlikely that a sufficient performance increase is obtained to overthrow the penalty in model fit for the additional parameters. We have hence refrained from testing such an assumption in this paper.
Apart from drastically increasing the number of parameters to model the evolution at the third codon position more closely, as indicated in the previous paragraph, a continuous-time approximation is often used to allow the neighbouring base combinations to evolve along the length of a branch. The approach we've taken to do this partitions each branch into parts with length no greater than 0.005 [29
]. As we have shown for non-coding sequences [17
], this approach does not yield significantly differing substitution parameter estimates or ancestral root distribution estimates. However, such an approach does lead to higher log Bayes factors compared to when the branches are not split into parts. Given that this approach does not lead to significantly different parameter estimates, the increased log Bayes factor can be attributed to the more accurate approximation of the context-dependent Markov substitution process by allowing the ancestral sequences to change in between the internal speciation nodes [17
The results shown in this paper hence support the notion that four-fold degenerate sites in the atpB and rbcL protein-coding genes of land plants have a substitution process that is dependent on the composition of its neighbouring bases. In the work that inspired this paper, Morton [11
] suggested that a plausible explanation for such context dependence lies with misincorporations by either the DNA polymerase or the mismatch repair process. We refer to different papers by Morton [11
] and Morton et al. [31
] for additional plausible explanations for context dependence in the chloroplast genome and to the paper of Hawk et al. [33
] for a study on the variation in the efficiency of DNA mismatch repair at different sites in the yeast genome.
Even though we have found that the same context-dependent codon partition model performed best amongst all considered competing models for both atpB and rbcL genes, we have shown that the (context-dependent) substitution patterns at the third codon position differ greatly between both datasets. This means that, should both genes be concatenated to form a larger alignment, different context-dependent codon partition models may need to be used for each gene to perform accurate phylogenetic reconstruction for such a concatenated alignment. The substitution patterns discussed in this paper hence provide additional indications that great care needs to be taken in the analysis of concatenated genes.
As we have introduced the concept of context-dependent codon partition models in this paper, we have not yet undertaken an attempt to perform phylogenetic inference on protein-coding sequences using these models as the relationship between (increases in) model fit and the ability to accurately reconstruct phylogenetic trees is intricately complex [9
]. In order to perform an extensive phylogenetic study on the context-dependent codon partition models we have introduced in this paper, additional context-dependent codon partition models need to be programmed and compared against the models presented in this paper, which is the subject of ongoing work. Indeed, while this paper introduces a class of context-dependent codon partition models, only one type of such models is discussed (i.e. those where the evolution of the third codon position depends upon its two immediate neighbours). However, a whole range of such models can be developed, which may for example be aimed towards mimicking full codon models more closely, which was not the aim of this study. The performance of these models can then be compared to the performance of a range of full codon models, in the interest of performing a study on inferring phylogenies. Specifically, the comparison with the codon model of Muse and Gaut [34
] would be interesting, since this model, and its derivative MG-type models, relies on estimating only 12 equilibrium frequencies, whereas GY-type models estimate 61 equilibrium codon frequencies [28
]. A thorough model comparison, like the one we've performed on non-coding sequences [35
], is necessary to determine an accurate evolutionary model for phylogenetic inference.
The computational burden of our context-dependent codon partition models is, as for standard CP models, much lower than for actual codon models due to the decrease in number of parameters and the easier computation of eigenvalues and eigenvectors of the substitution matrices. Indeed, the spectral decomposition of the codon probability matrix (i.e. model) as well as the computation of its powers is considerably slower than for a nucleotide model. Computational effort in computing the eigenvalues and eigenvectors of a matrix rises as the cube of the number of rows or columns, hence the effort is 613
≈ 3,547 times greater for a codon than for a nucleotide model [36
]. Since our best-performing context-dependent codon partition model uses 20 nucleotide models to estimates its parameters, computational efforts of a codon model vis-à-vis this model are to 613
) ≈ 222 times greater. In other words,
codon partition models (context-dependent or not) replace the high dimensionality of codon models by a series (or combination) of low dimensional matrices, for which spectral decomposition requires much less computational efforts.
The computational differences listed in the above paragraph are theoretical however, since often other estimations (such as data augmentation) need to be performed alongside eigenvalue decompositions, which do not require calculation of new eigenvalues. We have hence measured the computation time for the different classes of model present in this paper and have listed the results in Table . Given the many different calculations we had to perform, two different systems were used to obtain the results. The architecture used to perform the thermodynamic integration for the codon partition models (CP), context-dependent codon partition models (CDCP) and GY94 model with equal rates is a 4-core AMD Opteron 2356 processor (B3 stepping) clocked at 2.3 Ghz, which has 2 Mb L3 cache, requires DDR2 memory and was introduced in April 2008. The architecture used for the GY94 model with among-codon rate variation is a 8-core Intel Xeon X7560 clocked at 2.27 Ghz, which has 24 Mb L3 cache, requires DDR3 memory and was introduced in March 2010. Note that our software routines are single-threaded. Given the methodology we presented in this paper, it can be concluded that in the thermodynamic integration framework the CDCP models are roughly 2 times more demanding than the existing CP models, whereas the GY94 model with equal rates requires roughly 26 times the computational demands of the existing CP models. The comparison with the GY94 model with among-codon rate variation is more difficult since a different processor architecture was used for those calculations.
Computational demands: number of iterations and computation time for the comparison of the different models in the thermodynamic integration framework.