We assembled a dataset that maximized the number of taxa and proteins from available organisms with complete genome sequences of prokaryotes and selected eukaryotes. In doing so, we omitted a few taxa (e.g., Agrobacterium tumefaciens Cereon str C58
and Halobacterium sp. NRC-1
) whose addition to the data set would have resulted in a substantial reduction in the total number of proteins. Data assembly began with the Clusters of Orthologous Groups of Proteins (COG) [88
], which consisted of 84 proteins common to 43 species. With that initial dataset we added other species from among completed microbial genomes (NCBI; National Center for Biotechnology Information), assisted by BLAST and PSI-BLAST [89
]. In total 72 species were included in the study (54 eubacteria, 15 archaebacteria and three eukaryotes).
The species of Archaebacteria and their accession numbers are: Aeropyrum pernix K1 (NC_000854), Archaeoglobus fulgidus (NC_000917), Methanothermobacter thermoautotrophicus str. Delta H (NC_000916), Methanococcus jannaschii (NC_000909), Methanopyrus kandleri AV19 (NC_003551), Methanosarcina acetivorans str. C2A (NC_003552), Methanosarcina mazei Goe1 (NC_003901), Pyrobaculum aerophilum (NC_003364), Pyrococcus abyssi (NC_000868), Pyrococcus furiosus DSM 3638 (NC_003413), Pyrococcus horikoshii (NC_000961), Sulfolobus solfataricus (NC_002754), Sulfolobus tokodaii (NC_003106), Thermoplasma acidophilum (NC_002578), Thermoplasma volcanium (NC_002689).
The species of Eubacteria are: Aquifex aeolicus (NC_000918), Bacilllus halodurans (NC_002570), Bacillus subtilis (NC_000964), Borrelia burgodorferi (NC_001318), Brucella melitensis (NC_003317, NC_003318), Buchnera aphidicola str. APS (Acyrthosiphon pisum) (NC_002528), Campylobacter jejuni (NC_002163), Caulobacter crescentus CB15 (NC_002696), Chlamydia muridarum (NC_002620), Chlamydia trachomatis (NC_000117), Chlamydophila pneumoniae CWL029 (NC_000922), Chlorobium tepidum str. TLS (NC_002932), Clostridium acetobutylicum (NC_003030), Clostridium perfringens (NC_003366), Corynebacterium glutamicum ATCC 13032 (NC_003450), Deinococcus radiodurans (NC_001263, NC_001264), Escherichia coli O157:H7 EDL933 (NC_002655), Fusobacterium nucleatum subsp. nucleatum ATCC 25586 (NC_003454), Haemophilus influenzae Rd (NC_000907), Helicobacter pylori 26695 (NC_000915), Lactococcus lactis subsp. lactis (NC_002662), Listeria innocua (NC_003212), Listeria monocytogenes EGD-e (NC_003210), Mesorhizobium loti (NC_002678), Mycobacterium leprae (NC_002677), Mycobacterium tuberculosis H37Rv (NC_000962), Mycoplasma genitalium G-37 (NC_000908), Mycoplasma pneumoniae (NC_000912), Mycoplasma pulmonis (NC_002771), Neisseria meningitidis MC58 (NC_003112), Nostoc sp. PCC7120 (NC_003272), Pasteurella multocida (NC_002663), Pseudomonas aeruginosa PA01 (NC_002516), Ralstonia solanacearum (NC_003295), Rickettsia conorii (NC_003103), Rickettsia prowazekii (NC_000963), Salmonella enterica subsp. enterica serovar Typhi (NC_003198), Salmonella typhimurium LT2 (NC_003197), Sinorhizobium meliloti (NC_003047), Staphylococcus aureus Mu50 (NC_002758), Streptococcus pneumoniae TIGR4 (NC_003028), Streptococcus pyogenes M1 GAS (NC_002737), Streptomyces coelicolor A3(2) (NC_003888), Synechocystis PCC6803 (NC_000911), Thermoanaerobacter tengcongensis (NC_003869), Thermosynechococcus elongatus BP-1 (NC_004113), Thermotoga maritima (NC_000853), Treponema pallidum subsp. pallidum str. Nichols (NC_000919), Ureaplasma parvum serovar 3 str. ATCC 700970 (NC_002162), Vibrio cholerae O1 biovar eltor str. N16961 (NC_002505, NC_002506), Xanthomonas campestris pv. campestris str. ATCC 33913 (NC_003902), Xanthomonas axonopodis pv. citri str. 306 (NC_003919), Xylella fastidiosa 9a5c (NC_002488), Yersinia pestis (NC_003143).
The eukaryotes were Arabidopsis thaliana
, Drosophila melanogaster
, Homo sapiens
. Accession numbers for eukaryote proteins are presented elsewhere [90
This dataset consisted of 60 proteins that were individually analysed as a step in orthology determination. The proteins were aligned with CLUSTALW [91
]. Then phylogenetic trees of each protein were built and visually inspected. Initial trees were constructed using Minimum Evolution (ME), with MEGA version 2.1 [92
]. The major criterion that we used in determining which genes to include or exclude was the monophyly of domains. We rejected genes with domains (archaebacteria and eubacteria) that were non-monophyletic, as these would be the best examples of HGT; this amounted to 61% of the genes rejected. Some other genes were omitted if there were detectable cases of HGT within a domain, such as the deep nesting of a species from one Phylum within a clade of another Phylum. Otherwise we did not eliminate genes that had a different branching order of phyla within a domain or different relationships of groups of lower taxonomic categories. Admittedly, ancient cases of HGT might be an explanation for some of those topological differences, but they are not detectable. However, we further tested the effectiveness of our criteria by examining the stability of individual protein trees, using different gamma values (α = 1, 0.5 and 0.3). We kept only the genes that were stable to such perturbations (in terms of remaining in that category of non-HGT genes). The position of eukaryotes, which varies depending on the gene, was not considered in assessing monophyly of eubacteria and archaebacteria.
The 32 remaining proteins were concatenated for analysis. The α parameters used during the tree building process were estimated with the program PamL (JTT+gamma model) [93
]. From the concatenation, trees were constructed with ME, Maximum Likelihood (ML) [94
] and Bayesian [95
] methods. The phylogenies obtained with ME, ML and Bayesian were similar, differing only at non-significant nodes assessed by the bootstrap method [96
], with one only significant exception on the position of M. kandleri
in the Bayesian phylogeny. The sequence alignments and other supplementary data are presented elsewhere [90
Time estimation was conducted separately within each domain (Archaebacteria and Eubacteria) using reciprocal rooting and several calibration points. All time estimates were calculated with a Bayesian local clock approach [97
] utilizing concatenated data sets of multiple proteins and a JTT+gamma model of substitution [19
]. The following settings were used: numsamp (10,000), burnin (100,000), and sampfreq (100). This method permitted rates to vary on different branches, which was necessary given the known rate variation among prokaryote and eukaryote nuclear protein sequences [30
]. Calibration of rate in this method was implemented by assigning constraints to nodes in the phylogeny. Five different initial settings (prior distributions) were used in each domain [see Additional file 4
]. These were chosen at intervals of 0.5 Ga starting from 4.5 Ga, which is approximately the age of the Earth and Solar System, to 2.5 Ga, which is slightly before the major rise in oxygen (Great Oxidation Event; GOE) as recorded in the geologic record [32
] and related to the presence of oxygenic cyanobacteria. Those constraints pertained to the ingroup root, or deepest divergence in the tree excluding the outgroup. Because of the relatively small number of duplicate genes available for rooting the tree of life, we were unable to estimate the time of the last common ancestor (the divergence of eubacteria and archaebacteria).
For the archaebacterial data set, we included eukaryotes for calibration purposes because reliable calibration points were unavailable among those prokaryotes. In doing so, only proteins in which eukaryotes clustered with archaebacteria were included [30
]. An outgroup was used that consisted of representatives of the major groups of eubacteria [90
]. We used the fossil and molecular times (separately) of the plant-animal divergence as calibration points, for comparison. The fossil calibration was the first appearance of a representative of the plant lineage (red algae) at 1.198 ± 0.022 Ga [100
]. The molecular time estimate for this divergence was 1.609 ± 0.060 Ga from a study of 143 rate-constant proteins [98
]. We used the minimum and maximum bounds for these calibration times as constraints in the Bayesian analysis. Although the results of these two different calibrations are provided for comparison, our preferred calibration is the 1.2 Ga fossil calibration because it has the best justification (supporting evidence). Therefore, our summary time estimates for archaebacteria, presented in the timetree (Fig. ), use only this fossil calibration.
For the eubacterial data set, we used four internal time constraints in separate analyses, all involving the origin of cyanobacteria. The first and most conservative constraint was a fixed origin (minimum and maximum bounds) at 2.3 Ga, which corresponds to the GOE. For the second constraint we used 2.3 Ga as a minimum bound, with no maximum bound. For the third constraint we used a previous molecular time estimate (2.56 Ga) for the divergence of cyanobacteria from closest living relatives among eubacteria, and fixed the minimum (2.04 Ga) and maximum (3.08 Ga) values to the 95% confidence limits of that time estimate [30
]. The fourth constraint for the origin of cyanobacteria was set at 2.7 Ga (minimum constraint) based on biomarker evidence for the presence of 2α-methylhopanes [86
]. We did not consider the fossil record of cyanobacteria because the earliest indisputable fossils [52
] are younger (2000 Ma) than the indirect evidence (GOE) for the presence of these oxygen-producing organisms. Older fossils of cyanobacteria are known but are disputed [52
]. The use of these four alternative constraints for the origin of cyanobacteria considers most of the widely discussed hypotheses but does not rule out an origin prior to 2.7 Ga. Although the results of the four different calibrations are provided for comparison, our preferred calibration is the 2.3 (minimum) geologic calibration because it has the best justification (supporting evidence). Therefore, our summary time estimates for eubacteria, presented in the timetree (Fig. ), use only this geologic calibration.
For each of these calibration points, all five initial settings were applied, resulting in 15 and 20 analyses for the Archaebacteria and Eubacteria (respectively). The effects of the different initial settings on the analyses were found to be minimal. A 44% difference in the priors, in fact, generated a maximum 2.7% (average of all significant nodes) difference in the time estimates (fossil calibration point) in the archaebacteria and a maximum 3.5% (average of all significant nodes) difference in the eubacteria (molecular calibration point) [see Additional file 5