With the advent of genomic sequence data, phylogenetics is progressively switching to large-scale analyses, using many genes in parallel [
1]. Among the diverse methods that have been proposed for dealing with multigene data sets is the so-called supermatrix method [
2]. This method consists in concatenating the sequences of all available genes into one single "supergene", which is then subjected to standard phylogenetic reconstruction methods. An obvious advantage of relying on large sequences is the expected increase of the statistical support; as long as all or most of the genes included in the analysis have an evolutionary history congruent with that of their host species (i.e. in the absence of hidden paralogies or lateral gene transfers), the small amounts of phylogenetic signal contained in each gene should in principle add up and overwhelm stochastic errors, thus leading to a well-supported phylogenetic tree.
However, high statistical support does not necessarily imply that the trees obtained are correct. In some cases, in particular under poor taxon sampling [
3-
6], highly resolved trees have been proposed, which have nevertheless been followed by subsequent critical re-analyses, claiming that, however strongly supported, the trees obtained were probably wrong [
7-
9]. More generally, there are several cases where standard phylogenetic reconstruction methods yield wrong but statistically well-supported trees. These so-called systematic (as opposed to stochastic) errors have been known about for a long time in the field [
10,
11], and are expected to be also present, in fact even enhanced, in a phylogenomic context [
1,
12].
A first explanation of systematic errors in phylogenetics is that they are caused by the mutational saturation of the sequences: if some positions have undergone multiple substitutions, this will blur the phylogenetic signal, and thereby increase the probability for several species to display convergent sequence patterns (homoplasies) at those positions. Many reconstruction methods are not able to correctly identify these convergences, and will instead interpret them as shared derived characters. As a consequence, they will be misled towards reconstructing a wrong tree. A typical instance of this phenomenon, called the long branch attraction (LBA) artefact [
10], occurs when two phylogenetically distant species, evolving significanlty more rapidly than the rest of the taxa (hence have long branches), deceivingly appear as closely related in the estimated tree. Similarly, when a distant outgroup is used, a divergent species may be attracted by the long branch separating the in- and the outgroup, and thus be artefactually put at a basal position [
11].
According to this explanation, removing the most saturated sites should improve the accuracy of the reconstruction. In this direction, several methods have been proposed, for selecting less diverged sequences [
9,
13], or filtering out saturated sites [
14]. In most cases, these methods seem to bring a significant improvement. In addition, they are particularly advantageous in a phylogenomic context, where the amount of data is not limiting: fairly stringent filtering criteria can be applied, removing a large amount of data, but still leave behind a more than sufficient amount of phylogenetic signal to obtain well resolved trees [
1].
An alternative way to deal with LBA artefacts is to avoid long branches altogether [
12]. For instance, one can simply eliminate the fast-evolving taxa, and replace them by slow-evolving close relatives. This method was applied to the animal phylogeny, using 18S ribosomal RNA [
15], and led to a reappraisal of the position of nematodes. Specifically, whereas fast-evolving nematodes, such as
Caenorhabditis, would appear at the base of the group of bona fide coelomate Bilateria, a more slowly evolving one,
Trichinella, appeared within arthropods. A related method consists in 'breaking' a long branch, by adding a series of intermediate taxa thought to emerge along this branch [
16].
Altogether, a combination of a better taxon sampling and a more careful selection of sites or sequences makes it possible to converge to reliable phylogenies. And indeed, at most evolutionary scales (mammals [
17], metazoans [
9], plants [
8], eukaryotes [
18]), a consensus seems to be emerging regarding most evolutionary relationships in all these groups. On the other hand, such careful methods require significant expertise in phylogenetics. More fundamentally, they are exclusively focussed on the quality of the data, leaving open the problem of understanding why current reconstruction methods are so prone to systematic artefacts.
After all, LBA artefacts were initially described as an explanation of the statistical inconsistency of Maximum Parsimony (MP) [
10], in contrast to probabilistic methods, which are consistent in a broad range of conditions [
19,
20]. As such, LBAs were expected to disappear once more reliable methods such as the Maximum Likelihood or the Bayesian frameworks became routinely used. Yet, artefacts are also observed under these methods, especially under poor taxon sampling. In a statistical perspective, the explanation of this apparent paradox is straightforward: the consistency property assumes that the underlying model is correct. Hence, such systematic errors simply betray that current models are mis-specified [
21,
22].
Note that explaining systematic artefacts as a model violation problem, as we do now, rather than one of mutational saturation, or of taxon sampling, as we did above, are not mutually exclusive arguments. When the data are not or are weakly saturated, current models (and also the MP method), all lead to the correct topology. It is only when the data are more strongly saturated, and the taxon sampling is not sufficiently rich to reveal the true extent of saturation, that a good model becomes necessary: in such situations, only the model can correctly estimate the frequency of multiple substitutions across the alignment, thereby avoiding systematic errors. Thus, what phylogenetic artefacts betray is fundamentally a lack of robustness of current models. More specifically, it points to the inherent propensity of these models to under-estimate the true level of saturation.
Many directions have already been explored to improve phylogenetic models, by accounting for compositional biases [
23,
24], across site heterogeneities of the rates [
25,
26], or of the substitution processes [
27-
32], or by acknowledging the variation of site-specific rates with time [
33], non-independence between sites [
34-
36], etc. Some of these models have indeed resulted in improved phylogenetic inference [
13,
21]. In the present work, we will focus on site-heterogeneities of the amino-acid replacement processes, which may have a particularly strong impact on the way the model evaluates sequence saturation. A striking feature that one readily observes when working with protein alignments is the biochemical specificity observed at each site: in spite of the fact that there are 20 amino acids, only 2 to 4 distinct residues are typically observed at a given variable column, suggesting that most positions undergo repeated substitutions among a very restricted subset of the amino-acid alphabet [
37,
38]. Obviously, this pattern has a direct bearing on the expected level of homoplasy, as convergent evolution towards the same amino-acid will be all the more frequent as few amino-acids are allowed at a given site. It is therefore crucial to correctly account for this fact in models of protein evolution that are to be used for phylogenetic reconstruction.
In this direction, we have previously developed a mixture model that accounts for across site heterogeneities in the evolutionary processes [
31]. Thanks to a Dirichlet process device, implemented in a Bayesian Monte Carlo framework, this model, CAT, effectively clusters the columns of the alignment into biochemically specific categories, each of which is described by its own amino-acid profile of equilibrium frequencies. By Bayes factor evaluation, we have shown previously that CAT generally has a better fit than homogeneous models based on one single empirical substitution matrix, such as WAG [
39], JTT [
40], or even the most general site-homogeneous and time-reversible model (GTR) [
31,
41]. We now apply the CAT model to the bilaterian phylogenomic dataset of Philippe et al [
1,
9]. This dataset displays an interesting example of systematic artefact When analyzed with current models of evolution, depending on the outgroup, two highly supported, yet contradictory, phylogenetic positions are obtained for two fast-evolving phyla, nematodes and platyhelminths. This artefact offers an experimental protocol for testing alternative models of evolution. Specifically, a good model should not lead to contradictory results depending on the chosen outgroup. In the present work, we use this case study to compare the performance of CAT to that of a site-homogeneous model based on the WAG matrix [
39].