Our multigene dataset was assembled according to a custom pipeline, as follows: (i) construction of databases made of all existing sequences for species specifically selected for their broad taxonomic distribution and availability of genomic sequences (downloaded from
http://www.ncbi.nlm.nih.gov/ and
http://amoebidia.bcm.umontreal.ca/pepdb/searches/welcome.php), (ii) BLAST searches against these databases using as queries the single-gene sequences composing our previously described multiple alignments (
Burki et al. 2007), (iii) retrieval (with a stringent
e-value cut-off at 10
−50) and addition of the new homologous copies to the existing single-gene alignments, (iv) automatic alignments using MAFFT (
Katoh et al. 2002), followed by manual inspection to extract unambiguously aligned positions, (v) testing the orthology, in particular possible lateral or endosymbiotic gene transfer, for each of the selected genes by performing single-gene maximum-likelihood (ML) reconstructions using
Treefinder Whelan and Goldman (WAG, four gamma categories;
Jobb et al. 2004), and (vi) the final concatenation of all single-gene alignments was done using SCaFoS (
Roure et al. 2007). Owing to the limited data for certain groups and to maximize the number of genes by taxonomic assemblage, some lineages were represented by different closely related species always belonging to the same genus (electronic supplementary material). Potential interesting species with full genomes available, such as the excavates
Giardia and
Trichomonas or the red algae
Cyanidioschyzon, have been discarded from our taxon sampling owing to their extreme rate of sequence evolution or their demonstrated tendency to lead to systematic errors in phylogenies (
Rodríguez-Ezpeleta et al. 2007b).
The concatenated alignment was analysed using both bayesian (BI) and ML frameworks, with
Phylobayes v. 2.3 (
Lartillot & Philippe 2004) and RAxML-VI-HPC v. 2.2.3 (
Stamatakis 2006), respectively.
Phylobayes was run using the site-heterogeneous mixture CAT model and two independent Markov chains with a total length of 10

000 cycles, discarding the first 4000 points as burn-in and calculating the posterior consensus on the remaining 6000 trees. The convergence between the two chains was checked and always led to the exact same tree, except for uncertainties of the order of divergence between the glaucophytes, the red algae and haptophytes+cryptomonads (HC). In order to reduce mixing problems of the chains, the constant sites were removed from the alignment in a subsequent analysis. The convergence was in this case much quicker, after only 5000 cycles (burn-in of 1000), and HC was unambiguously positioned as sister to the Plantae. RAxML was used in combination with the WAG amino acid replacement matrix and stationary amino acid frequencies estimated from the dataset. The best ML tree was determined with the PROTMIX implementation, in a multiple inferences using 20 randomized maximum parsimony (MP) starting trees. Statistical support was evaluated with 100 bootstrap replicates. Two independent runs were performed on each replicate, using a different starting tree (MP and the best ML tree), in order to prevent the analysis from getting trapped in a local maximum. The tree with the best log likelihood was selected for each replicate, and the 100 resulting trees were used to calculate the bootstraps proportions. To save computational burden, the PROTMIX solution was chosen with 25 distinct rate categories. To minimize potential systematic errors associated with saturation and homoplasy, the fast-evolving sites were identified using PAML (
Yang 1997), given the 20 topologies obtained in the ML analysis. Sites were classified according to their mean site-wise rates and ML bootstrap values were computed from shorter concatenated alignments with sites corresponding to categories 7 and 6+7 removed.