|Home | About | Journals | Submit | Contact Us | Français|
The deep phylogeny of eukaryotes is an important but extremely difficult problem of evolutionary biology. Five eukaryotic supergroups are relatively well established but the relationship between these supergroups remains elusive, and their divergence seems to best fit a “Big Bang” model. Attempts were made to root the tree of eukaryotes by using potential derived shared characters such as unique fusions of conserved genes. One popular model of eukaryotic evolution that emerged from this type of analysis is the unikont–bikont phylogeny: The unikont branch consists of Metazoa, Choanozoa, Fungi, and Amoebozoa, whereas bikonts include the rest of eukaryotes, namely, Plantae (green plants, Chlorophyta, and Rhodophyta), Chromalveolata, excavates, and Rhizaria. We reexamine the relationships between the eukaryotic supergroups using a genome-wide analysis of rare genomic changes (RGCs) associated with multiple, conserved amino acids (RGC_CAMs and RGC_CAs), to resolve trifurcations of major eukaryotic lineages. The results do not support the basal position of Chromalveolata with respect to Plantae and unikonts or the monophyly of the bikont group and appear to be best compatible with the monophyly of unikonts and Chromalveolata. Chromalveolata show a distinct, additional signal of affinity with Plantae, conceivably, owing to genes transferred from the secondary, red algal symbiont. Excavates are derived forms, with extremely long branches that complicate phylogenetic inference; nevertheless, the RGC analysis suggests that they are significantly more likely to cluster with the unikont–Chromalveolata assemblage than with the Plantae. Thus, the first split in eukaryotic evolution might lie between photosynthetic and nonphotosynthetic forms and so could have been triggered by the endosymbiosis between an ancestral unicellular eukaryote and a cyanobacterium that gave rise to the chloroplast.
The deep phylogeny of eukaryotes is an extremely difficult and controversial problem. In the early days of molecular phylogeny, up to mid-1990s, the consensus appeared to be the crown-group phylogeny, that is, a tree that consisted of the crown including animals, fungi, plants, and some groups of unicellular eukaryotes (protists) and a number of “early branching” groups of protists (Sogin 1991; Sogin et al. 1993; Sogin and Silberman 1998). The crown-group phylogeny, in other words, the basal position of many, although not all, protist groups (fig. 1A), was supported by numerous phylogenetic analyses of rRNA as well as various conserved proteins. Even more importantly, the dominant evolutionary hypothesis at the time was the so-called archezoan scenario under which different amitochondrial protists (such as diplomonads or microsporidia) were thought to represent primitive eukaryotic forms, archezoa, one of which would become the host of the (proto)mitochondrial, α-proteobacterial endosymbiont (Cavalier-Smith 1993, 1998; Patterson 1999; Roger 1999).
Subsequently, however, it was shown that all protists that were studied in sufficient detail carried organelles related to mitochondria (mitosomes, hydrogenosomes, and others) and possessed genes of apparent protomitochondrial (α-proteobacterial) descent (Dyall and Johnson 2000; Roger and Silberman 2002; Embley, van der Giezen, Horner, Dyal, Bell, and Foster 2003; Embley, van der Giezen, Horner, Dyal, and Foster 2003; van der Giezen and Tovar 2005; Embley and Martin 2006; Minge et al. 2008). Thus, the apparent indications from cell biology that protists lacking typical mitochondria were evolutionarily primitive were, effectively, invalidated. In parallel, the early branching of protists was repeatedly challenged once it became clear that many of these organisms, especially, parasites, evolve at a high rate, so that their basal position in trees could be a long-branch attraction artifact (Baldauf et al. 2000, 2003). Specifically, it was shown beyond reasonable doubt, by using phylogenetic methods that are relatively robust to long-branch effects, that microsporidia (one of the groups that appeared to best fit the definition of Archaezoa considering their simple cellular organization) are not a basal group, but rather, a highly derived, rapidly evolving sister group of fungi (Keeling and McFadden 1998; Keeling and Fast 2002; Fischer and Palmer 2005). Definitive phylogenetic affinities turned out to be hard to obtain for other former “archezoa,” in part, probably, owing to their rapid evolution. Nevertheless, the two major developments, the demonstration of the nonexistence of primitive amitochondrial forms among the rapidly increasing variety of well-characterized eukaryotes and of the unreliability of the basal position of protists together led to the effective collapse of the crown-group phylogeny of eukaryotes.
The concept of eukaryotic phylogeny that comes closest to being the current consensus maintains that there are five or, possibly, six distinct major branches, or supergroups, in the eukaryotic domain of cellular life, namely, unikonts (an assemblage that includes opishtokonts (Metazoa, Fungi, and related protists and Amoebozoa with the latter considered a distinct supergroup in some studies), Plantae, Chromalveolata, excavates, and Rhizaria (fig. 1B) (Adl et al. 2005; Keeling et al. 2005; Keeling 2007). The “higher” eukaryotes that comprise the core of the former crown group are thus split between two supergroups, unikonts (opisthokonts) and Plantae, whereas the remaining three supergroups consist of diverse protists. The monophyly of each of the supergroups is still questioned as exemplified by recent multigene phylogenetic analyses that employed broad taxonomic sampling and diverse methods (Philip et al. 2005; Parfrey et al. 2006; Yoon et al. 2008).
Regardless of the exact status and composition of each individual supergroup, it appears that several major branches of eukaryotes diverged in a “Big Bang”-type event, where the internal branches in the tree are extremely short, so much so that the “true” tree topology might be undecipherable (Philippe et al. 2000; Rokas et al. 2005; Rokas and Carroll 2006; Koonin 2007). Nevertheless, attempts have been made to root the tree of eukaryotes by using apparent derived shared characters (synapomorphies) along with phylogenies of highly conserved proteins. These studies led to the conclusion that the root lies between the opisthokonts (Metazoa, Choanozoa, and Fungi) and the bikonts (all groups of eukaryotes that ancestrally possess two cilia, namely, plants and most of the protists), with the position of the Amoebozoa remaining uncertain (Stechmann and Cavalier-Smith 2002) but leaning toward an affiliation with opisthokonts (Stechmann and Cavalier-Smith 2003a). The conclusion on the monophyly of the bikonts rests, primarily, on the fusion of a single pair of essential genes, those for dihydrofolate reductase (DHFR) and thymidylate synthase, purportedly, buttressed by the analysis of domain architectures and sequence-based phylogenies of some highly conserved proteins, such as myosins (Richards and Cavalier-Smith 2005).
Considering the crucial importance of the sequence of events at the earliest stages of eukaryotic evolution for understanding the emergence of the key biological features of the major groups of eukaryotes, the inference of the root position on the strength of only one or two characters; however, fundamental ones, seem unsatisfactory, given that parallel emergence of the purported derived character, such as a gene fusion, is difficult to rule out. Indeed, independent fusions of the same pairs of genes in diverse groups of eukaryotes as well as in eukaryotes and bacteria have been demonstrated in case studies (Yanai et al. 2002; Makiuchi et al. 2007). Furthermore, reversion of an ancestral fusion via the split of the fused genes in unikonts cannot be ruled out either.
We sought to reexamine the root position in the eukaryotic tree by means of a genome-wide analysis of rare genomic changes (RGCs). Lately, the analysis of RGCs that can be exemplified by diagnostic gene fusions, domain architectures of proteins, or features of genome architecture such as gene overlaps became an increasingly popular approach to the study of deep evolutionary relationship, given that these characters appear to be less prone to various artifacts than standard methods of molecular phylogeny (Rokas and Holland 2000; Iyer et al. 2004; Luo et al. 2006). Although it can be argued that RGC-based methods effectively employ parsimony and so would be prone to the same artifacts as maximum parsimony methods in sequenced-based phylogenetic analysis, this would not be the case if the RGCs were free of homoplasy (parallel changes and reversals), which is the primary problem for the maximum parsimony methods. Conceivably, if the analyzed changes are indeed rare and their number is sufficiently large, the effect of homoplasy would be minimized. It should be noticed that molecular phylogeny methods that employ sophisticated models of sequence evolution, usually within the maximum likelihood framework, are not without their own serious problems that are related, mostly, to model overspecification and misspecification (proverbial attempts to “fit an elephant”) (Kolaczkowski and Thornton 2004; Steel 2005; Thornton and Kolaczkowski 2005; Stefankovic and Vigoda 2007). Application of sequence-based phylogenetic methods within the phylogenomic approach not only has the potential to substantially increase the resolution power but also poses challenges owing to horizontal gene transfer as well as different optimal models of evolution for different genes (Phillips et al. 2004; Bucknam et al. 2006; Dagan and Martin 2006; Bapteste et al. 2008). The pitfalls that are inherent in even the most advanced maximum likelihood and Bayesian methods, in particular, in the phylogenomic setting, stimulate the search for RGCs that are most suitable for phylogenetic analysis.
Recently, we introduced a new class of RGCs designated RGC_CAMs (after conserved amino acids-multiple substitutions), which are inferred from genome-scale analysis of alignments of orthologous proteins and underlying nucleotide sequence alignments (Rogozin et al. 2007a, 2007b). The RGC_CAM approach utilizes amino acid residues that are conserved through long evolutionary spans and in major organismal lineages, with the exception of a few taxa that together comprise a candidate clade. In order to minimize homoplasy, only those amino acid replacements that require 2 or 3 nt substitutions are employed for phylogenetic inference. The RGC_CAM method, combined with a procedure for rigorous statistical testing of competing phylogenetic affinities, is specifically designed for testing (rejecting) evolutionary hypotheses that are presented as unresolved trifurcations of clades. A direct estimation of the level of homoplasy among RGC_CAMs revealed a nonnegligible number of parallel changes but nevertheless showed that the method is robust for a wide range of phylogenetic problems (Rogozin et al. 2008).
We were interested in applying the RGC_CAM approach to the relationship between the eukaryotic supergroups, a fundamental problem with an obvious bearing on the rooting of the evolutionary tree of eukaryotes. The problem with using RG_CAMs for resolving such deep evolutionary relationships is that the number of characters that support a particular clade can be quite small. Therefore, we additionally employed a relaxed version of the RGC_CAMs denoted RGC_CAs where the requirement for multiple substitutions is lifted, of course, at the price of increased homoplasy (Rogozin et al. 2008). The combined results of these RGC analyses seem to, effectively, refute the bikont–unikont split as the first bifurcation in the evolution of eukaryotes and instead suggest the affiliation of the major protist groups with the animal–fungi (opisthokont) clade. This result is compatible with the scenario where the acquisition of the cyanobacterial symbiont (the future chloroplast) by an ancestor of Plantae triggered the first divergence of major clades in the evolution of eukaryotes.
Each of the 716 protein alignments (488,157 sites altogether) constructed from a previously delineated set of highly conserved clusters of eukaryotic orthologous genes or eukaryotic orthologous groups (KOGs) (Koonin et al. 2004) analyzed here included orthologs from eight eukaryotic species with completely sequenced genomes: Homo sapiens (Hs), Caenorhabditis elegans (Ce), Drosophila melanogaster (Dm), Saccharomyces cerevisiae (Sc), Schizosaccharomyces pombe (Sp), Arabidopsis thaliana (At), Anopheles gambiae (Ag), and Plasmodium falciparum (Pf) (Rogozin et al. 2003). To these KOGs, probable orthologs from 66 prokaryotic genomes from the COG database (Tatusov et al. 2003) were added using a modification of the COGNITOR method (Tatusov et al. 1997). Briefly, all protein sequences from the prokaryotic genomes are compared with the protein sequences previously included in the KOGs; a protein is assigned to a KOG when two genome-specific best hits to members of the given KOG are detected. We added five prokaryotic orthologs (denoted P1, P2, P3, P4, and P5) to each KOG and required these prokaryotic orthologs to belong to three or more major prokaryotic clades (see supplementary table S1, Supplementary Material online) (Basu, Rogozin, and Koonin 2008). The requirement for the availability of five diverse prokaryotic orthologs was satisfied for 396 of the initially selected 716 KOGs. To the resulting mixed COG/KOGs, probable orthologs from 25 other eukaryotic genomes, namely, those of Oryza sativa (Os), Physcomitrella patens (Ppat), Chlamydomonas reinhardtii (Crei), Ostreococcus lucimarinus (Oluc), Volvox carteri (Vcar), Monosiga brevicollis (Mb), Dictyostelium discoideum (Ddis), Entamoeba histolytica (Ehis), Giardia lamblia (Glam), Leishmania braziliensis (Lbra), Leishmania infantum (Linf), Leishmania major (Lmaj), Trypanosoma brucei (Tbru), Trypanosoma cruzi (Tcru), Babesia bovis (Bbov), Cryptosporidium hominis (Chom), Cryptosporidium parvum (Cpar), Phaeodactylum tricornutum (PhTri), Phytophthora infestans (Pinf), Phytophthora ramorum (Pram), Phytophthora sojae (Psoj), Paramecium tetraurelia (Ptet), Tetrahymena thermophila (Tthe), Theileria parva (Tpar), and Trichomonas vaginalis (Tvag) were added using COGNITOR. Amino acid sequence alignments are available at the authors’ Web site at ftp://ftp.ncbi.nlm.nih.gov/pub/koonin/RGC_CAM/eukaryotic_evolution/. To minimize misalignment problems, only conserved, unambiguously aligned regions of the alignments were subject to further analysis. Specifically, we only analyzed positions surrounded by segments of protein alignments containing no insertions or deletions with a 5-amino acid window from each side.
For the purpose of phylogenetic analysis using the RGC_CAM method (Rogozin et al. 2007b), we analyzed amino acid residues that are conserved in most of the included eukaryotes, with the exception of a few species and the prokaryotic outgroups. The assumption is that any character shared by the included five diverse prokaryotic outgroup species and the majority of eukaryotes is the ancestral state, whereas the deviating species possess a derived state (fig. 2A). To reduce the level of homoplasy, only amino acid replacements that require 2 or 3 nt substitutions (Rogozin et al. 2007b). Given the rarity of multiple substitutions, these double replacements are plausible RCGs (RGC_CAMs). To simplify further presentation, we use the following notation: S1 ≠ S2 = S3 means that, for a conserved amino acid position in an alignment, species S2 and S3 share the same amino acid that is different from the amino acid in the species S1. Under this notation, for example, a plasmodium-specific RGC_CAM is denoted by Pf ≠ At = Os = Sc = Sp = Hs = Dm = Ag = Ce = P1 = P2 = P3 = P4 = P5, whereas an RGC_CAM shared by the fungi and animals is denoted by Sc = Sp = Hs = Dm = Ag = Ce ≠ Pf = At = Os = P1 = P2 = P3 = P4 = P5.
First, we estimated the branch length for each analyzed taxon in RGC_CAM units (fig. 3). For each species or group of species, we calculated the number of amino acid residues that are different from all other species (e.g., Sc = Sp ≠ At = Os = Dm = Ag = Hs = Ce = P1 = P2 = P3 = P4 = P5 for fungi).
The next step of the RGC_CAM analysis is statistical testing of phylogenetic hypotheses. We developed a test designed to resolve ambiguous phylogenetic relationships by analyzing all possible evolutionary scenarios for three lineages. In this test, the number of RGC_CAMs shared by two lineages (e.g., Sc = Sp = Hs = Dm = Ag = Ce ≠ Pf = At = Os = P1 = P2 = P3 = P4 = P5; fungi and animals—these shared RGC_CAMs are consistent with the accepted phylogeny) is used as a variable. The values of this variable for two compared alternative topologies, along with the respective branch lengths (excluding the branch that is common to both alternatives), are put in a 2 × 2 contingency table. The test is based on a null model under which, in a comparison of two alternative hypotheses, for example, H1 = ((X − Y),Z) versus H2 = ((X − Z),Y), the number of RGC_CAMs that are shared by two lineages due to chance (NXY and NXZ) is proportional to the length of the branch, the position of which differs between the compared hypotheses, that is, Y and Z, respectively, in the above example. Specifically, we examined all three pairwise comparisons for each analyzed trifurcation, that is, hypothesis H1 = ((X − Y),Z) versus hypothesis H2 = ((X − Z),Y); H1 = ((X − Y),Z) versus H3 = ((Y − Z),X); and H2 = ((X − Z),Y) versus H3 = ((Y − Z),X), using the right-tail Fisher's exact test. In this work, P12, P23, and P13 denote the P values associated with the comparison of the respective hypotheses. It should be emphasized that all numbers in the contingency tables are independent, that is, each RGC_CAM is counted only once (Rogozin et al. 2007b).
The same approach was employed for analyses of a relaxed version of RGC_CAMs by allowing all possible amino acid replacements (as opposed to only those that require 2 or 3 nt substations in RGC_CAMs). We denote these characters RGC_CAs (fig. 2B). In addition, we analyzed deletions (RGC_DEL, fig. 2C) and insertions (RGC_INS, fig. 2D) surrounded by conserved fragments of protein alignments.
Four classes of RGCs were employed in this work (see Materials and Methods for details).
We first applied the RGC_CAM approach to a well-characterized case of ancient divergence of major eukaryotic lineages, namely, plants, animals, and fungi. Numerous molecular phylogenetic studies indicate that animals and fungi form a clade to the exclusion of plants (Baldauf 1999), so the existence of that clade (opisthokonts) is not seriously contested (Parfrey et al. 2006; Yoon et al. 2008).
In this case, the analyzed branches are of approximately equal lengths, that is, form a balanced tree (table 1 and fig. 3A), a situation in which the RGC analyses are most reliable (Rogozin et al. 2008). The raw number of shared RGC_CAMs was by far the greatest for the animal–fungi clade, and this excess was highly statistically significant for all combinations of plant species included in the analysis (table 1). The statistical test yielded significant P values both for the basal position of plants, that is, the animal–fungi clade (P13 and P23, table 1) and for the basal position of fungi that implies the plants–animals clade (P12, table 1). However, the support for the animals–fungi clade in most cases was much stronger (P13 and P23 < 0.0001, table 1) compared with the support for the plants–animals clade (e.g., P12 = 0.013 for the first test in the table 1). The RGC_CAs yielded qualitatively similar results, with an even stronger statistical significance owing to the larger number of characters (table 1). The raw numbers of shared RGC_DEL and RGC_INS also were the largest for the animal–fungi clade (table 1). However, there were few unique insertions and deletions, and the relative level of homoplasy appeared to be much higher compared with RGC_CAMs and RGC_CAs, so that neither hypothesis received significant statistical support (table 1).
Thus, the results of this analysis of a well-established deep evolutionary relationship between major groups of eukaryotes confirm that RGC_CAMs and RGC_CAs are, in general, reliable indicators of phylogenetic affinity. Somewhat unexpectedly, we found that these characters were much more informative than indels which are more traditional markers used for deep phylogenetic analysis. Given this observation, we employed only RGC_CAMs and RGC_CAs for all analyses of uncertain phylogenetic relationships between eukaryotes.
Choanoflagellates are a group of unicellular eukaryotes that show a marked similarity to the choanocytes (feeding cells) of sponges, an observation suggesting the possibility that this group of protists, along with several apparently related groups, includes the closest living relatives of metazoans. This hypothesis is supported both by several phylogenetic analyses (Cavalier-Smith and Chao 2003; Rokas et al. 2005; Steenkamp et al. 2006) and by the analysis of the first sequenced genome of a choanoflagellate, M. brevicollis, which is remarkably complex and encodes orthologs of many signature animal proteins (King et al. 2008). We analyzed the trifurcation M. brevicollis–animals–fungi using RGC_CAMs and RGC_CAs (fig. 3B and supplementary table S2, Supplementary Material online). Clear support for the M. brevicollis–animals clade was obtained from both statistical tests and raw numbers of RGCs (supplementary table S2, Supplementary Material online). In this case, the relatively long M. brevicollis branch (unbalanced tree) did not cause problems for the RGC_CAM and RGC_CA analyses (fig. 3B and supplementary table S2, Supplementary Material online), possibly, owing to the relatively short stem branch (the branch that leads from the outgroup to the analyzed trifurcation; fig. 3B), which minimizes artifacts caused by reversals (Irimia et al. 2007; Rogozin et al. 2008).
We applied the RGC_CAM approach to a well-known case of problematic phylogeny, namely, the evolutionary positions of the slime mold D. discoideum (member of the phylum Mycetozoa or social amoebae) and E. histolytica, member of the phylum Archamoebae. Several phylogenetic analyses suggested that these distantly related amoebae formed a clade that is a sister group to the opisthokont clade although the split between the two lineages of Amoebozoa is deep and is thought to have occurred shortly after the divergence of Amoebozoa from the opisthokonts (Bapteste et al. 2002; Song et al. 2005). We analyzed the trifurcation D. discoideum–plants–opisthokont (fig. 3C) using 278 KOG alignments (supplementary table S3, Supplementary Material online). A clear support for the D. discoideum–opisthokont clade was obtained from both the raw numbers of RGCs and statistical tests (supplementary table S3, Supplementary Material online). The analysis of the E. histolytica–plants–opisthokont trifurcation did not reveal such a clear picture, probably, due to the substantial decrease in the number of analyzed genes compared with the case of D. discoideum (only 191 KOGs) and the extremely long E. histolytica branch (fig. 3D and supplementary table S3, Supplementary Material online) which could result in an excess of parallel changes and reversals (Rogozin et al. 2008). Nevertheless, despite some ambiguity in the results, both the raw numbers and the statistical tests tend to support the E. histolytica–opisthokont clade (supplementary table S4, Supplementary Material online). Thus, the results of RGC analyses with both available genome of amoebas were consistent with the monophyly of Amoebozoa and opisthokonts (together comprising the unikont supergroup), in agreement with some phylogenetic tree analyses (Baldauf et al. 2000; Stechmann and Cavalier-Smith 2003b) but not others (Parfrey et al. 2006; Yoon et al. 2008).
Given the distant and uncertain relationship between the two amoebas themselves, we examined the trifurcation opisthokonts–D. discoideum–E. histolytica (fig. 3E). The raw number of shared RGC_CAMs was the largest for the D. discoideum–E. histolytica clade (table 2). The interpretation of this result requires caution because the E. histolytica branch was extremely long, so that the resulting unbalanced tree might contain an excessive number of parallel changes and reversals (Rogozin et al. 2008). However, reversals cannot have a substantial effect because of the extremely short stem branch (Irimia et al. 2007; Rogozin et al. 2007a) (table 2), whereas the effect of parallel changes is taken into account by the employed statistical test. The results of this test indicate that the most likely tree topology is ((D. discoideum + Metazoa/Fungi) E. histolytica), that is, an Opisthokonta–Mycetozoa clade, to the exclusion of E. histolytica (Archamoebae) (table 2). We employed three different settings for this analysis whereby either animals together with fungi, or four animals, or two fungi were chosen to represent the opisthokont clade, and the results were similar for all three experiments (table 2). Thus, the RGC_CAM and RGC_CA analyses suggest that D. discoideum forms a clade with the opisthokonts, to the exclusion of E. histolytica, that is, the two amoebas, according to these results, represent distinct clades within the unikont supergroup. This conclusion contradicts the results of some of the previous phylogenetic studies (Bapteste et al. 2002; Song et al. 2005) but is compatible with the topology of the trees obtained by the analysis of domain compositions of multidomain proteins (Basu et al. 2008). It seems possible that the apparent monophyly of Mycetozoa and Archamoebae that was observed in phylogenetic analyses is a long-branch attraction artifact.
The Chromalveolata is an assemblage of numerous, diverse groups of protists that was proposed as a monophyletic supergroup by Cavalier-Smith as a refinement of the previously described kingdom Chromista (Cavalier-Smith 2002). The monophyly of Chromalveolata is not considered to be unequivocally established but is supported by several phylogenetic analyses (Baldauf et al. 2000; Harper et al. 2005). Most of the chromalveolates possess a chloroplast-related organelle (such as the apicoplast of the Apicomplexa) that is surrounded by a complex, multilayer membrane. Accordingly, it has been proposed that Chromalveolata is an ancient bikont branch that evolved via a secondary endosymbiosis with a red alga (Archibald and Keeling 2002; Cavalier-Smith 2003; Lane and Archibald 2008).
Taking advantage of the large number of sequenced genomes from diverse chromalveolates, we performed a detailed analysis of the relationship between Chromalveolata, Plantae, and opisthokonts (fig. 3F). The raw number of shared RGC_CAMs in the majority of comparisons (68 cases) was the greatest for the Chromalveolata–animals/fungi clade (supplementary table S5, Supplementary Material online), and overall, this clade received the strongest statistical support (table 3). However, in 20 comparisons, the raw number of shared RGC_CAMs was the greatest for the Chromalveolata–Plantae clade (supplementary table S5, Supplementary Material online), and there was some, albeit weaker, statistical support for this clade (table 3). The third topology, with the basal position of the Chromalveolates and a Plantae–opisthokont clade was poorly supported (table 3 and supplementary table S5, Supplementary Material online) and could be effectively ruled out.
The raw numbers of shared RGCs can be an useful addition to the statistical test (see the analysis of the plants–animals–fungi trifurcation above). However, the utility of raw numbers is hampered by large differences in branch lengths (Rogozin et al. 2008). To minimize this effect, we compared the numbers of RGC_CA(M)s that supported the Chromalveolata–opisthokonts clade or the Chromalveolata–Plantae clade for cases where the branches leading to opisthokonts and plants were of approximately equal lengths (table 4). In the substantial majority of tests, the number of RGC_CA(M)s supporting the Chromalveolata–opisthokonts clade was greater than that supporting the affiliation of chromalveolates with plants (table 4).
In this analysis, many comparisons failed to produce a statistically significant outcome (supplementary table S5, Supplementary Material online). Moreover, some Chromalveolate species, such as C. hominis and P. infestans, but not others, possess multiple RGCs supporting the monophyly of Chromalveolates and plants (supplementary table S5, Supplementary Material online). These observations might indicate that Chromalveolates have a genuine mixed heritage, with the majority of the genes sharing common ancestry with orthologs from opisthokonts but some genes being of plant origin. To test this hypothesis, we examined the affinities of multiple RGC_CAs (RGC_CAMs were not conducive to this type of analysis because there were too few genes with multiple RGC_CAMs) within the same gene, under the reasoning that, if the apparent mixed phylogenetic signal is indeed due to distinct origins of different genes of Chromalveolates and not to noise, all RGC_CAs from the same gene should point in the same direction. Altogether, 21 KOGs contained two or more RGC_CAs, and in each case, multiple RGC_CAs within the same gene supported either the Chromalveolata–Opisthokonta clade or the Chromalveolata–Plantae clade, with the sole exception of KOG100 (supplementary table S6, Supplementary Material online). A striking example is KOG2446 (Glucose-6-phosphate isomerase) that carries up to 12 RGC_CAs (depending on the combination of species) all of which support the Chromalveolata–Plantae clade. Although apparently affected by homoplasy, their results indicate that the gene complement of Chromalveolata indeed could be heterogeneous, with the majority of the genes sharing a common ancestry with orthologs from opisthokonts but some genes derived from Plantae. The presence of multiple genes of apparent red algal origin in genomes of chromalveolates has been reported (Li et al. 2006). Taken together, these findings are compatible with the scenario under which the common ancestor of the Chromalveolata emerged as a result of engulfment of a red alga by a unikont host.
The excavates comprise a vast assemblage of diverse organisms some of which, in particular, diplomonads and parabasalids, lack typical mitochondria and accordingly were long considered “primitive” forms and promising candidates for the archezoan status (Roger 1999; Simpson 2003). Although the discovery of mitochondria-related organelles and genes of apparent mitochondrial origin invalidates the hypothesis that some of the excavates are primary amitochondrial forms, the possibility that they are “basal” eukaryotes remains attractive given that some of these organisms are among the eukaryotic forms with the simplest cellular and genomic organization. Among the 5 eukaryotic supergroups, the monophyly of excavates is, probably, most dubious, and the phylogenetic position of many excavate taxa remains uncertain (Simpson, Inagaki, and Roger 2006; Rodriguez-Ezpeleta et al. 2007). However, a recent phylogenomic analysis of 148 genes from a broad variety of eukaryotic taxa seems to provide substantial support for an excavate clade (Hampl et al. 2009).
We applied the RGC approaches to assess the phylogenetic positions of three highly diverse excavates. Giardia lamblia, a flagellated, amitochondrial protozoan parasite, is the only representative of diplomonads for which the complete genome sequence is currently available. The genome of this organism lacks many genes that are present in all other eukaryotes (Morrison et al. 2007). Accordingly, Giardia was traditionally considered one of the best candidates for a basal position in the eukaryotic tree. However, the bikont–unikont phylogeny rejects this view in favor of the affiliation of diplomonads and associated excavate taxa with the bikont branch of eukaryotes (Stechmann and Cavalier-Smith 2002; Stechmann and Cavalier-Smith 2003a; Rodriguez-Ezpeleta et al. 2007).
We analyzed the trifurcation G. lamblia–plants–opisthokonts (fig. 3G). The raw number of shared RGC_CAMs was the greatest for the plants–animals–fungi clade as expected given the extremely long Giardia branch (fig. 3G and table 5). In this case, reversals are expected to have a substantial effect because of the long stem branch (Irimia et al. 2007; Rogozin et al. 2007a) (table 5). Thus, the trifurcation G. lamblia–plants–opisthokonts could not be unambiguously resolved using RGCs. Nevertheless, assuming that the basal position of G. lamblia is a long-branch artifact, the results of the present analysis are best compatible with the Giardia–opisthokont clade (tables 4 and and55).
The kinetoplastids, a distinct group of mitochondriate protists that includes such major parasites as trypanosomes and Leishmania, comprise another branch in the putative excavate supergroup (Simpson, Stevens, and Lukes 2006; Stevens 2008). We took advantage of the availability of five complete genomes from this group to examine the phylogenetic affinities of kinetoplastids using RGCs (fig. 3H). In the majority of the comparisons (30 cases), the greatest raw number of shared RGC_CAMs was seen for the Plantae–opisthokont clade, that is, the basal position of kinetoplastids (supplementary table S7, Supplementary Material online) that also received a strong statistical support. However, in 25 comparisons, the raw number of shared RGC_CAMs was the largest for the kinetoplastid–opisthokont clade (supplementary table S7, Supplementary Material online), and this excess was statistically supported as well (table 6). Similarly to the case of Giardia, the kinetoplastid branch was extremely long (fig. 3H and supplementary table S7, Supplementary Material online) because of which the basal position of this group, most likely, is an artifact. Under this assumption, the present results support the kinetoplastid–opisthokont clade (tables 4 and and66).
Trichomonas vaginalis is a flagellated, amitochondrial parasitic protist that represents the parabasalids, another excavate group with an uncertain phylogenetic position (Edgcomb et al. 2001; Carlton et al. 2007). We analyzed the trifurcation T. vaginalis–plants–opisthokonta (table 7). As with the other excavates, the results, at face value, seemed to support a basal position for T. vaginalis (fig. 3I and table 7). However, assuming that this position is a long-branch artifact, the T. vaginalis–opisthokonta clade was strongly supported by both raw numbers and statistical tests (tables 4 and and77).
We employed RGCs to analyze one of the most difficult problems in the evolution of eukaryotes, the relationship between the five supergroups. At present, the best description of the radiation of the supergroups seems to be a Big Bang, a pattern that might indeed reflect rapid divergence or condensed cladogenesis, in part, driven by major events such as endosymbiosis (Philippe et al. 2000; Keeling et al. 2005; Rokas and Carroll 2006; Keeling 2007; Koonin 2007). Thus, attempts to decipher the relationships between the supergroups are important not only (and, perhaps, not so much) for establishing the true tree topology for its own sake but also for reconstructing the most likely scenario of the actual events that occurred during the early, formative stages of eukaryotic evolution.
Given the presumed rapidity of the pivotal evolutionary events at this early stage in the evolution of eukaryotes, combined with the dramatic differences in the evolutionary rates among the supergroups, definitive elucidation of the true tree topology is extremely challenging (Philippe et al. 2000). Not surprisingly, so far, despite substantial effort, traditional methods of phylogenetic analysis failed to yield a solution.
In this difficult situation, shared derived characters might offer the best chance to shed light on the early radiation of the supergroups. Attempts to implement this approach include the influential analyses of gene fusions, such as the DHFR–ThyK fusion and domain architectures, such as those of myosins (Stechmann and Cavalier-Smith 2002, 2003a; Richards and Cavalier-Smith 2005). The caveat of this type of analysis is that, with a small number of characters, ruling out homoplasy is difficult, if feasible at all. The RGCs could have an advantage because multiple, if not necessarily numerous (for deep evolutionary relationships), characters are available for analysis. In this work, we attempted both the rather traditional analysis of indels and the more recently developed classes of characters, RGC_CAMs and RGC_CAs. Somewhat unexpectedly, considering the long history of the use of indels for cladistic-type analysis (Rivera and Lake 1992; Gupta 1998; Gupta and Griffiths 2002), indels turned out to be, largely, uninformative for the elucidation of the relationships between the supergroups, whereas the RGC_CAMs and RGC_CAs seemed to carry considerable information (of course, this is not to imply that indels are not helpful in elucidating more recent evolutionary events).
Even with the use of RGCs, resolving the relationship between the supergroups remains an enormously difficult task, so perhaps, the most tangible outcome of this analysis is the rejection of certain evolutionary hypotheses. Thus, the analysis of RGC_CA(M)s allowed us to effectively rule out the basal position of Chromalveolata vis-a-vis Plantae and opisthokonts and produced evidence in favor of a Chromalveolata–opisthokonts clade as opposed to the Plantae–Chromalveolata clade that is predicted by the bikont–unikont topology of the eukaryotic tree. Notably, however, there was also a nonnegligible signal for the plant–Chromalveolata affinity that is most parsimoniously explained by the contribution of the secondary, red algal endosymbiont to the gene complement of the Chromalveolata.
For the three analyzed excavate taxa (diplomonads, kinetoplastids, and parabasalids), the basal position could not be rejected. However, in agreement with the previous conclusions based on the analysis of slowly evolving positions in conserved proteins (Philippe et al. 2000), it seems most likely that this tree topology is an artifact caused by the extremely long branches characteristic of these groups that imply large numbers of parallel changes and reversals since the divergence from other supergroups. Under the assumption that the basal position of these groups is a long-branch artifact, they all show affiliation with opisthokonts and not with Plantae.
A recent, extensive phylogenomic study suggested the existence of a “megagroup” of eukaryotes that consists of Plantae (there denoted Archaeplastida), Chromalveolata, and Rhizaria (Hampl et al. 2009). However, in addition to the usual complications that emerge in the maximum likelihood analysis of concatenated protein sequence alignments and the problems caused by the potential signal from horizontally transferred genes in Chromalveolata, the tree of Hampl et al. (2009) is unrooted, so the conclusion on the existence of the megagroup is conditioned on the root position between unikonts and bikonts (Stechmann and Cavalier-Smith 2003a). Unlike the standard phylogenetic methods, RGC approaches including RGC_CAM, their own limitations notwithstanding, are specifically geared toward the inference of the root position.
Thus, the results of the present analysis of RGCs seem to be best compatible with an unexpected phylogeny in which the first split is between Plantae, that is, primary chloroplast-containing forms and the rest of the eukaryotes (fig. 4). Although surprising in view of some of the previous inferences, this putative topology of the eukaryotic tree appears biologically plausible in that the acquisition of the cyanobacterial endosymbiont would trigger the divergence of the ancestors of Plantae from the common ancestor with the rest of the eukaryotes. Subsequently, the emergence of the Chromalveolata could have been similarly precipitated by the secondary endosymbiosis, the engulfment of a red alga.
The present results are far from being the final word on the relationship between the eukaryotic supergroups but they are at odds with some popular hypotheses, in particular, the bikont–unikont split as the primary radiation in the history of eukaryotes. Extreme caution is necessary in drawing positive conclusions from deep phylogenetic reconstructions like this one. Nevertheless, the present findings are best compatible with the monophyly of unikonts and Chromalveolata, with excavates, possibly, joining the same major assemblage of eukaryotic taxa. Under this, biologically plausible scenario, the first major split in eukaryotic evolution is between photosynthetic and nonphotosynthetic forms and would have been triggered by the endosymbiosis between an ancient heterotrophic, unicellular eukaryote and a cyanobacterium that gave rise to the chloroplast. Methodologically, the present analysis reveals the apparent advantage of RGCs based on (preferably, multiple) substitutions in otherwise highly conserved positions over indels as phylogenetic markers. Apparently, shared indels are too rare and too prone to homoplasy to be informative for resolving deep multifurcations. In addition, the results emphasize the importance of taxon sampling for RGC analysis: the availability of a diverse collection of complete genomes representing Chromalveolata provided for much more conclusive results for this supergroup than for excavates where such sampling is currently impossible. Thus, further progress of genomics of poorly characterized eukaryotic groups is expected to provide additional material for more conclusive reconstruction of the key events of the deep evolutionary past.
Intramural Research Program of the National Library of Medicine at National Institutes of Health/DHHS; research grant from the National Sciences and Engineering Research Council of Canada.
We thank Laura Katz, Masatoshi Nei, Fyodor Kondrashov, Liran Carmel, and Yuri Wolf for useful discussions. The authors declare that they have no competing interests. I.B.R. and E.V.K. incepted the study; I.B.R., M.K.B., and M.C. implemented the tests and performed data analysis; I.B.R. and E.V.K. wrote the manuscript which was read, edited, and approved by all authors.