|Home | About | Journals | Submit | Contact Us | Français|
Explicit evolutionary models are required in maximum-likelihood and Bayesian inference, the two methods that are overwhelmingly used in phylogenetic studies of DNA sequence data. Appropriate selection of nucleotide substitution models is important because the use of incorrect models can mislead phylogenetic inference. To better understand the performance of different model-selection criteria, we used 33,600 simulated data sets to analyse the accuracy, precision, dissimilarity, and biases of the hierarchical likelihood-ratio test, Akaike information criterion, Bayesian information criterion, and decision theory.
We demonstrate that the Bayesian information criterion and decision theory are the most appropriate model-selection criteria because of their high accuracy and precision. Our results also indicate that in some situations different models are selected by different criteria for the same dataset. Such dissimilarity was the highest between the hierarchical likelihood-ratio test and Akaike information criterion, and lowest between the Bayesian information criterion and decision theory. The hierarchical likelihood-ratio test performed poorly when the true model included a proportion of invariable sites, while the Bayesian information criterion and decision theory generally exhibited similar performance to each other.
Our results indicate that the Bayesian information criterion and decision theory should be preferred for model selection. Together with model-adequacy tests, accurate model selection will serve to improve the reliability of phylogenetic inference and related analyses.
Among the rigorous methods of tree reconstruction that are available, maximum likelihood (ML) and Bayesian inference (BI) have dominated phylogenetic studies in recent years [1-6]. Both methods are based on the likelihood function, which needs explicit models of evolution to capture the underlying evolutionary processes in sequence data [1-6]. For DNA sequences, the models are the evolutionary characterisation of one nucleotide being replaced by another one. Although the models are simplifications of the "true" evolutionary processes and are clearly wrong [3,7,8], they are approximations that have been widely accepted. The assumed model of nucleotide substitution can exert a significant influence on phylogenetic estimation. This is an increasingly important concern in the modern genomic era, with the growing use of multiple loci that have probably been subject to different substitution processes .
A variety of nucleotide substitution models have been devised, most of which are special cases of the general time-reversible (GTR) model in which each of the six pairwise nucleotide changes can have a distinct rate, and the frequencies of the four nucleotides are allowed to take different values . Common extensions to this model include parameters for a proportion of invariable sites (I) and for gamma-distributed rate heterogeneity among sites (Γ). In the last few years, many improvements have been explored, including models that account for differences among the three codon positions [10,11], pattern heterogeneity of the substitution process (e.g., ), among-site heterogeneity of rates (e.g., ), compositional heterogeneity among lineages (e.g., ), and site-specific rate variation through time (e.g., [15,16]).
Statistical methods are often used to identify the substitution model that best describes the given DNA sequence data. Model selection using software such as ModelTest , DT-ModSel , and jModelTest [18,19] has now become standard procedure in phylogenetic analysis [4,20]. Alternatively, model determination can be conducted using a reversible-jump Markov chain Monte Carlo approach in a Bayesian setting . This differs from past practice when model choice was conducted without statistical justification or by choosing the most parameter-rich model available [6,8]. Model selection is considered important because the use of alternative models can change the results of phylogenetic analysis. It has effects not only on the estimation of model parameters (e.g., genetic distances and branch lengths; [2,21]), but also on estimates of bootstrap support and posterior probabilities . Furthermore, misspecified models can lead to errors in phylogenetic inference, especially for trees with short internal branches [6,21-23].
One of the challenges facing researchers is how to select the most appropriate substitution model for a given dataset. There is now a range of procedures from which to choose, including the hierarchical likelihood-ratio test (hLRT) [24-26], Akaike information criterion (AIC) [27,28], Bayes factor [29-31], Bayesian information criterion (BIC) , posterior probability [29,33,34], decision theory (DT) , and the emerging approach of cross-validation [35,36]. All of these can be used to select the best-fit model from a set of candidates, but differ in specific algorithms which may ultimately give rise to differences in their performance (for further details, see Methods).
Of the four widely-used model-selection criteria in phylogenetics - the hLRT, AIC, BIC, and DT - the hLRT was once argued to be reasonably accurate and to perform better in general than the AIC and BIC, a conclusion drawn from analyses of simulated data comparing six models . However, the hLRT has been demonstrated to have several disadvantages, such as a dependence on the starting point and the path through the hierarchy of models [37,38], which undermine and limit its performance in model selection. It has been established that both the BIC and DT tend to select simpler models than the AIC [3,4,8,39], while the hLRT particularly favours complex ones [3,40]. Applying them to empirical data, Pol  found that different best-fit models were selected by the hLRT and AIC for 16 out of 18 datasets. Ripplinger and Sullivan  found that the hLRT, AIC, BIC, and DT criteria often selected different models for the same real datasets, which was similar to the results of Abdo et al.  based on simulated data using parameters estimated from a rodent mtDNA dataset; in contrast, several empirical studies found that the BIC and DT often selected the same model [4,39]. However, there is a need for a comprehensive systematic study of the performance of model-selection criteria.
Here we present a study of the performance of the four model-selection criteria hLRT, AIC, BIC, and DT. Considering the biases in model selection revealed by previous studies (as described above) and the convenience of simulated data for theoretical investigation , we conducted a total of 14 simulations of 33,600 datasets. Our investigation was limited to the 24 fundamental substitution models from the GTR family, assuming a stationary, time-reversible, and homogeneous Markov process. Based on the best-fit models selected by these criteria for these simulated datasets, we examined for each criterion the success rate of recovering simulated models (its accuracy) and the number of different models selected across replicate datasets (its precision); the dissimilarity and model biases of these criteria (see Methods for details) were also examined and compared statistically. In addition, we examined dissimilarity in analyses of datasets that were simulated under a slightly more complex model based on a simple homogeneous codon-substitution process. On the whole, our study aims to provide a comprehensive evaluation of the performance of model-selection criteria.
In the 14 simulations (Table (Table1;1; see Methods), the mean accuracy scores for the BIC and DT were higher than those for the hLRT and AIC. The ANOVA-LSD tests demonstrated that there were no significant differences for the pairs of hLRT-AIC and BIC-DT respectively; however, very significant differences existed for the other pairs such as hLRT-BIC (P < 0.01). The full results of the accuracy analysis are provided in Additional file 1.
The hLRT exhibited high accuracy in recovering some models, but unexpectedly, it was always incapable of recovering the four SYM-like models (i.e., SYM, SYM + I, SYM + Γ and SYM + I + Γ; Table Table1)1) (Figure (Figure1).1). The AIC showed moderate or low accuracy except for a few complex models (e.g., GTR + I + Γ) for which the accuracy was even as high as 1.00 in certain simulations. The accuracy of the BIC and DT differed among simulations. In most cases, they showed high accuracy in recovering almost all of the 24 models (Figure (Figure1A).1A). Compared with the other models, however, two (SYM + I + Γ and GTR + I + Γ) were only moderately recovered in simulations derived from parameter set-B; even all of the SYM-like and GTR-like models were recovered less frequently in simulation I-8 (tree topology with a height of 0.1; Figure Figure2D)2D) (Figure (Figure1B).1B). The BIC and DT exhibited similar accuracy in simulation I-4 (ultrametric tree topology of 30 taxa, 0.1 tree height; Figure Figure2D)2D) and simulation IV-1 (non-clock tree topology of 22 taxa; Figure Figure2G),2G), both recovering less than 35% of models of base + I + Γ category (i.e., JC + I + Γ, K80 + I + Γ, etc.; Table Table1)1) (Figure (Figure1C).1C). In fact, they always selected models of base + Γ category (i.e., JC + Γ, K80 + Γ, etc.). Their accuracy values were high when the value of the parameter for proportion of invariable sites (pinv) was altered from 0.25 in parameter set-A to 0.5 in parameter set-B when simulating datasets (data not shown). Notably, the hLRT and AIC in these two simulations, especially the former criterion, also showed lower accuracy in recovering models of base + I + Γ category than in the other simulations.
There were very significant differences among the precisions of the four criteria in the 14 simulations (randomized block ANOVA; P < 0.01). The precision of the hLRT was very significantly different from that of BIC and DT in certain simulations (P < 0.01), but not in others, while the precision of AIC was very significantly different from that of the other three criteria (LSD, P < 0.01) in all 14 simulations. The precision of BIC was always similar to that of DT, with P-values ranging from 0.508 to 1.000 (LSD). The full results of the analyses of precision are provided in Additional file 2.
Although small discrepancies existed, precision values of the AIC were generally higher than those of the other three in the 14 simulations (Figure (Figure3).3). Their means ranged from 7.79 to 9.75, while standard deviations were also much larger and ranged from 4.169 to 5.160 (Additional file 2). This was mainly attributed to the fact that the AIC usually selected a dozen different best-fit models for each set of 100 replicates simulated under the same conditions, but at the same time, it selected only a few for datasets simulated under SYM-like and GTR-like models. Compared with the AIC, the other three criteria selected fewer different best-fit models, and their precision values were relatively stable among datasets generated under the same simulation conditions. However, precision values of the hLRT (means ranging from 3.29 to 4.83; Additional file 2) were generally higher than those of the BIC and DT, and in some cases were very significantly different. Therefore, the BIC and DT exhibited the best precision among the four criteria - lower mean and smaller standard deviation - while that of the BIC was little better than that of DT (Additional file 2).
The percentages of one same model, two models, three models and four models being selected within each set of 100 replicate datasets were summarised, and Table Table22 shows their means in each of the 14 simulations (Table (Table2).2). It was clear that two different best-fit models were generally estimated the most frequently by these criteria, with means ranging from 46.13% to 54.75%, followed by one same model from 33.83% to 47.67%, and three models from 5.17% to 12.88%. Four different models were favoured less than 0.25% by these four criteria in the 14 simulations, and even zero in I-4, IV-1, and V-2.
Figure Figure44 illustrates the dissimilarity values of the six criterion pairs, and Additional file 3 shows the results of the ANOVA-LSD tests. In the 14 simulations, there existed very significantly different variances for different criterion pairs (P < 0.01), and also very significantly different variances for datasets simulated using the 24 different models. Dissimilarity values of the BIC-DT pair, with means ranging from 0.17% to 7.67%, were very significantly different from those of the other criterion pairs (P < 0.01). Although the dissimilarity of hLRT-AIC was less than that of certain pairs (e.g., hLRT-BIC) in datasets simulated for some models (e.g., SYM-like models; Figure Figure4),4), the means were the largest in the 14 simulations (ranging from 48.21% to 63.29%). Its values were very significantly different from those of the other pairs in seven simulations. Across the other simulations, their differences from some of the other criterion pairs were not very significant in spite of low probabilities. With simulation I-4 similar to IV-1, dissimilarity values in simulations derived from parameter set-B were similar to each other (Figure (Figure4B);4B); the other simulations derived from set-A generally resembled each other (Figure (Figure4A4A).
In the additional simulation VI, which was performed using complex codon models, means of percentages that one same model, two models, three models, and four models were selected were 8.25%, 53.83%, 36.29%, and 1.63%, respectively. Among the criterion pairs, means of dissimilarity values ranged from 3.33% for BIC-DT to 84.88% for hLRT-AIC, and dissimilarity of BIC-DT was significantly different from that of the other criterion pairs (P < 0.01).
While considering the four model categories (base, base + I, base + Γ and base + I + Γ; Table Table1),1), the results of the chi-square (χ2) homogeneity tests  demonstrated that there were significant differences in model biases among the four criteria (α = 0.05) (Table (Table3).3). The hLRT was always significantly different from the other three (α' = 0.0083 by Bonferroni correction), with a relatively small portion of base + I category recovered in the 14 simulations (Figure (Figure5).5). In contrast, the BIC was always similar to DT with high probabilities except in simulation I-1, where DT recovered fewer models of base category. For comparisons between the AIC and BIC, and between the AIC and DT, differences were generally not significant in simulations other than I-4, IV-1, and I-1. In fact, there was an even or an approximately even distribution of model categories recovered by the AIC, BIC, and DT in these simulations; but for I-4 and IV-1, models of base + I + Γ category were recovered much less by the BIC and DT than the other model categories (Figure (Figure55).
Figure Figure66 shows the distribution of 11 model categories recovered in the 14 simulations based on the number of free parameters (Table (Table1).1). In all of them, with significant differences among the four criteria on the whole, each criterion was significantly different from any other except the pair of BIC and DT (see Additional file 4). However, there were also significant differences between the BIC and DT in simulation I-1.
Although there exist cases indicating that obvious violation of model assumptions could favour the true tree [43,44] under specific conditions (e.g., oversimplified model for trees within the "Farris zone"; ), it has been clearly demonstrated that identifying the best-fit model is beneficial to phylogenetic inference and to understanding the molecular-evolutionary process. It must be acknowledged, however, that an absolute characterisation of the true evolutionary history of real data is usually impossible. So one concern for researchers, who do not have knowledge of this evolutionary history, is the accuracy of model-selection criteria to ensure that the best model can be selected from the available candidates.
In this study, based on 14 simulations of 33,600 datasets performed using known models, we simply evaluated the accuracy and precision of four model-selection criteria. Higher accuracy was broadly coincident with better precision and vice versa, which the performance of AIC best explained. Nevertheless, there were exceptions to this general pattern. Since the hLRT always selected GTR-like and TrN-like models for datasets simulated using SYM-like models, its precision values were low and its accuracy was almost zero. Generally, the high accuracy and low precision of the BIC and DT indicate that they perform better than the other two criteria. This result is robust to the influences of different simulation programs (Seq-Gen  and Mesquite ), tree topologies comprising different numbers of taxa (22, 30, and 50), and sequence lengths (300 bp, 1,000 bp, and 2,000 bp) (Table (Table1).1). However, accuracy values in certain simulations may confuse the situation to some extent, due to different simulation conditions as described below.
Between simulations derived from Parameter set-A and those derived from set-B, discrepancies in performance existed for both the BIC and DT (e.g., accuracy values for models of base + I + Γ category between I-4 and I-8) (Figure (Figure1).1). Considering that altering pinv from 0.25 in simulation I-4 and IV-1 to 0.5 can improve the recovery of models of base + I + Γ category by the BIC and DT, pinv should be an important parameter influencing accuracy.
The accuracy of the BIC and DT in simulations with a tree height of 0.1 did not support the general conclusions very well, being different from the simulations using tree heights of 0.3, 0.5, and 0.7 (Figure (Figure1).1). At the same time, this happened in the simulation with the non-clock tree topology with both short and long branches. Given the lower accuracy of the other two criteria in these two simulations, we suspect that model selection might not be very effective for data of limited information content.
Our study revealed that the means of dissimilarity values of the hLRT-AIC pair ranged from 48.21% to 63.29%, and were the largest across the 14 simulations even though they were not necessarily statistically larger than those of the other pairs. In contrast, the BIC and DT tended to select the same model, with significantly lower means of dissimilarity values ranging from 0.17% to 7.67% across the 14 simulations. Model selection for datasets simulated under complex codon models also yielded the largest dissimilarity for hLRT-AIC and the smallest for BIC-DT. Thus, we can envisage that researchers using the hLRT and AIC would frequently face a problematic situation in which these criteria would select different models. At the same time, as an extension of the BIC incorporating branch-length error , DT brings doubt as to whether estimating branch-length error, a measure of phylogenetic performance, would influence model selection.
On the whole, if one uses these four criteria to select models for given sequence data, variance in best-fit models could be encountered with the greatest possibility of two different models, followed by moderate possibilities of one same model, and three models; there is little or no chance of encountering four different best-fit models. These results are generally consistent with those of Ripplinger and Sullivan , whose results were based on 250 real sequence datasets for the criteria of hLRT, AIC, BIC, and DT. However, the results of our additional simulation VI supported a different order: two models, three models, one same model, and four models. In all respects, results of both empirical studies  and simulation studies (this study) suggest that model selection with these criteria will often produce inconsistent models, which could present a confusing situation for researchers.
When considering the four model categories, one unexpected discovery was the relatively small portion of base + I category recovered in the 14 simulations, including those simulated with the pinv of 0.5 in parameter set-B. This result indicates that model selection with the hLRT is not sensitive to the proportion of invariable sites under the condition that there is no other among-site rate heterogeneity. In comparison, there was an approximately even distribution of the four model categories by the AIC, BIC, and DT in most simulations (Figure (Figure5),5), which is consistent with the ideal even distribution given that the number of datasets simulated under models of any one category (i.e., 100 × 6 = 600) is the same as that of any other category. In a sense, we could relate the typical insensitivity of the hLRT to the proportion of variable sites with its special hierarchy of pairwise comparisons, because this parameter is the last-optimised parameter in the default hierarchy of ModelTest v.3.7 . As it has been demonstrated that the hierarchy has an influence on the performance of hLRT [8,38], future work could investigate whether different hierarchies might lead to a different success rate in recovering models of base + I category.
Nevertheless, with our two standards of classifying the 24 GTR-family models, the results all confirmed that the BIC exhibited similar model biases to DT, with only a few exceptions. This was consistent with the results of accuracy, precision, and dissimilarity, which to some extent demonstrated that results based on our simulated datasets were reasonable and consistent.
Overall, our performance analysis based on simulated datasets indicates that the BIC together with DT should be preferred for model selection in phylogenetics, although some of our results departed from this general finding owing to specific simulation settings such as values of the proportion of invariable sites. However, in view of the improvements on GTR models explored in recent years (see Background), it is possible that the results from most of our simulated data represent a poor reflection of real sequence data, which have almost certainly evolved under more complex conditions. Accordingly, we suggest here that model selection by the BIC or DT, together with model adequacy tests by parametric bootstrap [48,49] or posterior predictive distributions , might be the best approach. Further studies can be done with simulated datasets under more complex evolutionary models to understand the performance of these criteria and to enhance phylogenetic studies.
We used two different parameter sets to simulate datasets. For parameter set-A , which was mainly derived from a mitochondrial DNA analysis except for the transition/transversion ratio and the proportion of invariable sites , the settings were as follows (as appropriate for each model): base frequencies 0.35 A, 0.22 C, 0.18 G, 0.25 T; rates (relative to GT) 2.675 AC, 7.35 AG, 6.125 AT, 0.225 CG, 30.7 CT; transition/transversion (κ) 2.0; gamma shape parameter (α) 0.67256; and proportion of invariable sites (pinv) 0.25. Parameter set-B was chosen according to the settings of Posada and Crandall  (as appropriate for each model): base frequencies 0.35 A, 0.15 C, 0.25 G, 0.25 T; rates (relative to GT) 2 AC, 4 AG, 1.8 AT, 1.4 CG, 6 CT; transition/transversion (κ) 2.0; and gamma shape parameter (α) 0.5. We set the pinv as 0.5 in parameter set-B. Gamma-distributed rates in both parameter sets were modelled with four discrete categories in the simulations.
Initial simulations were conducted to explore the general performance of the model-selection criteria. First, we generated four ultrametric tree topologies of 30 taxa using the program PAML 4.1  assuming a birth-death process (speciation rate 0.1, extinction rate 0.1, sampling fraction 1.0). Tree heights (i.e., expected number of substitutions per site from the root to each tip) were 0.7, 0.5, 0.3, and 0.1 for the four trees (Figure (Figure2A,2A, Figure Figure2B,2B, Figure Figure2C,2C, and Figure Figure2D,2D, respectively). Then, for each combination of parameter set and tree topology, we used Seq-Gen 1.3.2  to simulate 100 replicate datasets for 24 fundamental models of varying complexity from the GTR family (Table (Table1,1, simulation I). A sequence length of 1,000 bp was used because it was representative of empirical sequence lengths typically used in phylogenetic studies , and was sufficient to evaluate the performance of most model-selection criteria .
Other simulations were conducted to investigate the influence of certain simulation conditions on the performance of the model-selection criteria (Table (Table1).1). With other parameters fixed, the purpose of simulation II was to test the effect of varying the number of taxa. Simulation III employed the program Mesquite 2.6  to simulate datasets to investigate the impact of different simulation programs. Simulation IV adopted one non-clock tree of 22 taxa (Figure (Figure2G):2G): following the method of Lemmon and Moriarty , the internal branches were randomly labelled from 0 to 18, and each branch was then assigned a branch length of 102x/18-3, where x was the number assigned to that branch; similarly, the lengths of the external branches (randomly numbered from 0 to 21) were given by 102x/21-3. In simulation V, sequence lengths were 300 bp and 2,000 bp, respectively, with the other simulation conditions consistent with simulation I-2.
The most widely-used program for model selection, ModelTest v.3.7 , along with DT-ModSel , were employed to select the best-fit model. Default settings were used in each program; some default settings may influence the performance of certain criteria, such as the hierarchy of pairwise comparison of models for the hLRT [8,38], but some may not, including the use of neighbour-joining (NJ) to generate a starting tree [8,39]. The 56 nested candidate models (i.e., a simpler model is one special case of a more general model), corresponding to the '7 schemes' in jModelTest [18,19], included the 24 models used in our simulations. After likelihood scores (L) under the 56 candidate models were computed by PAUP* v.4.0b10  based on NJ trees, the hLRT together with the AIC and BIC was applied to model selection using ModelTest v.3.7 ; DT-ModSel was used for DT .
For the hLRT, the pairwise likelihood ratio test is given by
where L0 is the likelihood score under the null hypothesis (simple model) and L1 is the likelihood score under the alternative hypothesis (complex model). Although this is widely accepted for testing the fit of nested candidate models in a specific sequence, there are many possible ways to traverse the hierarchy of pairwise model comparisons [8,38,53]. We used the default hierarchy in ModelTest v.3.7 . The LRT statistic approximately followed a standard χ2 distribution. However, when the null fixed parameters were at the boundary of the parameter space of the alternative model (i.e., for tests of rate homogeneity among sites and invariable sites), the mixed χ2 distribution (consisting of 50% and 50% ) was used to construct the tests [54-56]. We used 0.01 as the significance level for rejecting or failing to reject the null model.
The AIC is an asymptotically unbiased estimator of the Kullback-Leibler distance between the "true" model and the fitted model [27,57]. In contrast with the hLRT, the AIC can simultaneously compare all candidate models irrespective of their nesting status, and is defined as
We computed the AIC instead of the AICc for given models, in that the sample size n (i.e., the sequence length) for most of our simulated datasets was large enough compared with the number of parameters (K) [40,58]. The candidate model with the lowest AIC value was selected as the best-fit model.
where n is the sample size (i.e., the sequence length) and K is the number of parameters. Given equal prior probabilities of candidate models, the model yielding the smallest BIC was the one with the highest posterior probability and was selected as the best-fit one.
DT, a novel performance-based method of model selection, is an extension of the BIC and specifically incorporates branch-length error as a measure of phylogenetic performance in the course of model selection . In DT-ModSel, DT estimated all candidate models through a penalty function, which was related to the difference in branch-length estimates across models. The model with the minimal posterior penalty was selected as the best-fit model.
For each set of 100 replicates, estimated best-fit models derived from each simulation were compared with the known model under which replicate datasets were simulated. We recorded the number of times that they matched (Numbermatched), and the accuracy of each criterion was then estimated by the rate of recovery of the simulated models, i.e.,
where Numbertest was 100. The ANOVA for two-way randomised block design and LSD test were used to compare the accuracy of these four model-selection criteria.
We counted how many different best-fit models were selected by each criterion within each set of 100 replicates. We did not take account of how many times any particular model was selected among the 100 replicates, but simply interpreted the number of different models selected as an estimate of the precision of each model-selection criterion (i.e., we gave the same weight to each model selected). Therefore, the precision of these criteria for simulated models was approximately evaluated and was compared by the ANOVA for two-way randomized block design and LSD test.
To test whether the four criteria could select the same best-fit models for each replicate dataset or not, we counted the number of different models selected by the four criteria for each replicate dataset. We computed the percentages of one same model, two models, three models, and four models being chosen by the four model-selection criteria for each set of 100 replicates. In addition, we analysed the dissimilarity of two criteria based on the number of different models. For each pair of criteria, the number of instances of one same model (m) across each set of 100 replicates was counted, and their dissimilarity was computed as 
where N was equal to 100. The ANOVA for two-way randomized block design was used to compare the dissimilarity of the six pairs of criteria.
Based on the success of recovery of the true model in each simulation (Numbermatched; see the 'Accuracy' part), we employed the χ2 homogeneity test  to compare the four criteria in terms of the composition of selected models. Here, we adopted two standards to classify the 24 simulated models (Table (Table1).1). Firstly, we classified them into four categories: base (JC, K80, etc.), base + I (JC + I, K80 + I, etc.), base + Γ (JC + Γ, K80 + Γ, etc.), and base + I + Γ (JC + I + Γ, K80 + I + Γ, etc.). These four model categories were then compared (with 3 degrees of freedom). Secondly, the 24 models were classified into 11 categories according to the number of free parameters (from 0 to 10; see Table Table11 for details), and then analysed by the statistical test (with 10 degrees of freedom).
To evaluate the performance of the four criteria when the generating model is more complex than any of the 56 candidate models, we conducted another simulation (VI) as a special case using Recodon v.1.6.0 . With the GY94 codon model  crossed with the settings in parameter set-A for the 24 fundamental models of GTR family respectively, and with the same base frequencies for all codon positions (1 × 4), 100 replicate datasets were simulated for each of the 24 simulated models here (i.e., a total 2,400). Other simulation parameters were set as sequence length 999 bp, number of sequences 30, a constant nonsynonymous/synonymous rate ratio (ω = 0.016) across sites, and mutation rate 0.001. Following model selection with the four criteria as above, the dissimilarity among the criteria was examined. Other aspects of performance were not evaluated, as those analyses were not practicable or were probably affected by the fact that the true model was not represented among the candidates.
CZ, AL, AZ, and YZ conceived this study. AL, WS, AZ, and WX performed the work and the statistical analyses. HQ, WX, and SYWH participated in discussions and wrote code to facilitate computations. All authors read and approved of the final manuscript.
Table of statistics of accuracy values in the 14 simulations. The statistic variables include some derived from descriptive statistics, and the others from the ANOVA-LSD tests. The column 'LSD sig.' only shows criterion pairs that had no significant differences (α = 0.01).
Table of statistics of precision values in the 14 simulations. The column 'LSD sig.' only shows criterion pairs that had no significant differences (α = 0.01).
Table of statistics of dissimilarity values in the 14 simulations. For convenience of notation, in this table 'a' denotes the pair of hLRT and AIC; 'b', hLRT and BIC; 'c', hLRT and DT; 'd', AIC and BIC; 'e', AIC and DT; 'f', BIC and DT. It is noted that only pairs with significant differences are shown in the column 'LSD sig.' (α = 0.01).
Table of statistics of χ2 and multiple comparison tests for the 14 simulations. The 24 models were classified into 11 categories according to the number of free parameters (from 0 to 10).
We would like to thank Dr D. Posada for valuable comments and suggestions on earlier versions of this paper, and Dr Xinhai Li for advice on statistical analyses. Both Dr A. P. Vogler and Dr Ming Li provided encouragement and helped to improve the manuscript. We also appreciate helpful comments from two anonymous referees and the editors.
This project was supported mainly by the Innovation Program in the Chinese Academy of Sciences (KSCX2-YW-NF-02), grants from the National Science Foundation, China (30870268, J0930004), and partially by Public Welfare Project from the Ministry of Agriculture, China (Grant No. 200803006) to Chaodong Zhu. It was also supported by Beijing Municipal Natural Science Foundation Key Projects (No. KZ201010028028) to Aibing Zhang.