Multiprotein phylogeny produces more robust trees than single-gene approaches, which better serve bacteriologists in interpreting biological data. A concrete example is the fully supported clade marked by a star in Fig. whose members (Pseudomonas
, etc.) share a molecular biological trait that is unique among bacteria: their S19 ribosomal protein operon contains a strong mimic of a 23S rRNA fragment that likely confers autoregulation on the operon (38
). This trait would have required a more complex explanation than simple vertical inheritance according to 16S rRNA trees, since to our knowledge no such trees, including our own, recover this clade. (It was recovered with weak support in our 23S rRNA tree but was again lost in the combined 16S/23S rRNA tree.)
Although performance is improved, multiprotein phylogeny is still subject to some of the same artifacts and difficulties as single-gene approaches, such as long-branch attraction and poor resolution of branches that are deep and short. Telescoping allowed data from more protein families to be applied to isolated subgroups of taxa, but this was still insufficient to resolve all nodes or solve difficulties with highly biased genomes. We identified one genome as a phylogenetic attractor (Sodalis) and one as a disruptor (Photorhabdus); full explanation of the mechanisms of these effects will require further study and targeted genomic sequencing of more members of these clades.
This phylogenetic analysis based on hundreds of proteins for over a hundred taxa strongly supports the splitting of two families (Alteromonadaceae and Pseudoalteromonadaceae) and three orders (Alteromonadales, Pseudomonadales, and Oceanospirillales). A fourth order, Thiotrichales, also appears split but with lower support. Finally, the unity of the order Chromatiales could not be established. Thus, the classification based on 16S rRNA breaks down occasionally at the family level but more frequently at the order level.
The tree assigns taxonomy to previously unclassified genomes: Reinekea
. Ruthia and Candidatus
Vesicomyosocius to Piscirickettsiaceae
; the affiliation of Ca.
Endoriftia with Methylococcales
requires confirmation as more genomes become available. For four genomes (Congregibacter
and the marine strains HTCC2080, HTCC2143, and HTCC2207) the closest assigned genome was Saccharophagus
of the split Alteromonadaceae
. Two genomes fell among the outgroup and are therefore not part of the Gammaproteobacteria
. For one of these, Mariprofundus
, this placement is consistent with previous analysis of its 16S rRNA sequence that placed it as a deep branch within the phylum Proteobacteria
). The other newly identified non-gammaproteobacterium is Acidithiobacillus
, surprisingly, since this genus has long been considered a member of Gammaproteobacteria
). A recent tree (K. P. Williams, unpublished data) prepared using 173 protein families from 124 bacterium-wide genomes, chosen to represent each available bacterial order, showed the same relationships of Acidithiobacillus
to other proteobacteria as depicted in Fig. , confirming that these two genera represent independent sister groups to the beta-/gammaproteobacteria clade that arose after divergence from the Alphaproteobacteria
. All these observations suggest revisions of the taxonomy at multiple ranks in the class and phylum.
An early multiprotein study of the Gammaproteobacteria
used over 200 proteins, but only 13 genomes were available at that time (20
). Later studies have used far fewer protein families (10 to 35 proteins) and fewer representative genomes (28 to 55 genomes) than the present analysis (4
). All these studies and ours agree on the basal branching pattern connecting the five orders Enterobacteriales
, and Xanthomonadales
. One of these nodes in particular (marked by an “X” in Fig. ) has additional support from a unique 4-amino-acid deletion in RpoB; however, our results do not agree with two other suggestions based on rare indels: (i) exclusion of Francisella
from the Gammaproteobacteria
and (ii) a particular split of Alteromonadales
). Some of the intermingling of bacterial orders observed here had been noted in one of these studies (40
), but in the other studies the taxa employed or the extent of multifurcation precluded its detection.
Nearly half of the bacterial genomes currently available are incompletely sequenced, a fraction that may increase in the future, given short-read technologies and sequencing projects that do not include a goal of closing the genome. These incomplete genomes add greatly to the taxonomic diversity available for study and are nearly as rich in protein information as the complete ones, so they are worth including in such analyses despite the minor problems they raise. However, some incomplete genomes are contaminated with DNA from additional taxonomic sources and should be rejected; an rRNA impurity test was employed here to identify some such mixed genomes. Although highly divergent rRNA alleles can occur within a single genome (41
), the caution exercised here was warranted; the highly divergent rRNA allele that we found in the incomplete Azotobacter vinelandii
genome, thereby rejecting the genome, did not appear in the recently completed version of this genome project.
When our five multiprotein supermatrices (i) were partitioned according to the substitution matrix favored by each family or (ii) had the new LG substitution matrix applied uniformly, nearly identical trees resulted, with only two cases of crossing nodes, and these nodes were the least well supported in the whole tree. This agrees with a recent survey of many alignments and supermatrices that concluded that although model choice can affect tree topology, “it rarely affects evolutionary inferences drawn from the data because differences are mainly confined to poorly supported nodes” (27
While both the jackknifing and bootstrapping approaches to determining support values remove columns from the alignment supermatrix, jackknifing does not then duplicate the remaining columns and therefore produces smaller subsamplings that can speed processing for large data sets. We prefer the random removal of whole proteins rather than single columns across all protein alignments, to better assess variation at a coarser granularity and as a double-check against possible remaining incongruity of protein families. This 50% protein jackknifing is a more stringent measure of support than the usual per-column bootstrapping, counteracting the misleadingly high support values that the latter brings to long supermatrices (23
Compositional attraction through amino acid bias made it difficult to infer the phylogeny of the insect endosymbiont genomes, whose composition was the most biased in our taxon set. As in most previous studies, the endosymbionts joined together in a clade, with Sodalis
when present, when all were included in a simple maximum likelihood analysis. When instead the endosymbionts were tested one at a time, Buchnera
consistently joined at the base of the Enterobacteriales
; unity of the Buchnera
genomes is strongly supported by gene order studies (4
Blochmannia and Wigglesworthia
may also derive from this point yet consistently branch as sister to Sodalis
when that taxon is included, jumping past two tree nodes. Ca.
Baumannia appeared with Sodalis
when present and, unlike other endosymbionts, at nearby positions when absent, weakly supporting the idea that Ca.
Baumannia has a unique origin (Fig. , part i). Our finding of multiple origins for the endosymbionts agrees with those of other studies that have sought to avoid compositional attraction artifacts, through the use of either nonhomogeneous substitution matrices or analysis of genomic rearrangements (4
). It would appear that the pattern of an Enterobacteriales
member adopting the endosymbiotic lifestyle and becoming highly A+T-rich has occurred in multiple independent cases.
Surprisingly, nearly equal subsets of Enterobacteriales proteins favored one or the other of two root positions. Massive horizontal transfer is a plausible explanation in principle, although we could not identify a particular single transfer path; the pattern could be a result of numerous transfers along multiple paths. It should be noted that our family selection method removed the genes best known for horizontal transfer, those on genomic islands. Earlier studies of a classical set of 13 gammaproteobacterial genomes that includes Escherichia, Salmonella, and Yersinia but not Sodalis, Pectobacterium, or Photorhabdus found very few of the single-copy genes with evidence of horizontal transfer. We found that omitting Photorhabdus shifted the support by protein families from a near balance for two root positions to a preponderance for one of these roots, identifying Photorhabdus as a disruptor. The ambiguities left by this study show that the Enterobacteriales present rich problems in phylogenetic reconstruction that are not resolved simply by accumulating larger genome-scale protein supermatrices. These problems are probably due both to horizontal transfer that is especially favored by close contact with diverse cohabitants in enteric environments and to the extensive genomic alterations in multiple isolated symbiotic lineages. Future analysis of these problems within the Enterobacteriales is favored by the detailed mechanistic information on mutation and gene transfer known from E. coli and relatives.
Much of the failure to fully resolve the deeper regions of the tree can probably be ascribed to stochastic accumulation of noise that obscures reconstruction of short and ancient internodes. It may be further compounded by residual cases of horizontal gene transfer that passed through our incongruence filter. Compositional bias may be the greatest challenge facing studies on such broad scales of bacterial phylogeny as this one. The Gammaproteobacteria
make it clear that as clades develop nucleotide bias, they can concomitantly develop amino acid bias (33
), which would be expected to produce local asymmetries in amino acid substitution rates. Simple statistical tests would have rejected 73% of the taxa from our study on the basis of bias. Thus, our data set (like other data sets of broad phylogenetic scope) violated the typical assumptions of phylogenetic reconstruction algorithms regarding homogeneity and reversibility of amino acid substitution. We were able to manage the problem in some cases by the approach of placing individual biased taxa in the absence of similarly biased attractors. Some possible future solutions may be to counter bias on a per-column basis in the supermatrix, to use mixture models (19
), and to use maximum-likelihood programs that do not demand model homogeneity (9
). Another promising area for improving multiprotein phylogeny is the use of rare indels (10
), especially if the detection of these markers was automated and if tree building could properly weight indel data in combination with alignment data; most of these are removed in our current protocol.
The multiprotein approach to gammaproteobacterial phylogeny, applied here with some methodological advances, has improved resolution for this challenging group, and we anticipate that additional advances are within reach that will further improve performance of the approach. As more genomes accrue, the multiprotein approach should become a new standard, supplanting 16S rRNA as the basis for phylogenetic reconstructions (1