The Tree of Life
(TOL) is one of the dominant concepts in biology, first seen in the famous single illustration in Darwin's “Origin of Species,
” and now in 21st
century undergraduate textbooks. For approximately a century, beginning with the first, tentative trees published by Haeckel in the 1860s and up to the foundation of molecular evolutionary analysis by Zuckerkandl, Pauling, and Margoliash in the early 1960s, phylogenetic trees were constructed on the basis of comparing phenotypes of organisms. Thus, by design, every constructed tree was an “organismal” or “species” tree; that is, a tree was assumed to reflect the evolutionary history of the corresponding species. Even after the concepts and early methods of molecular phylogeny have been developed, for many years, it was used simply as another, perhaps, particularly powerful and accurate approach to the construction of species trees. The TOL concept remained intact, with the general belief that the TOL, at least in principle, would accurately represent the evolutionary relationships between all lineages of cellular life forms. The discovery of the universal conservation of rRNA and its use as the molecule of choice for phylogenetic analysis pioneered by Woese and others (Pace et al., 1986
; Woese, 1987
) resulted in the discovery of a new domain of life, the archaea, and boosted the hopes that the definitive topology of the TOL was within sight.
However, even before the era of complete genome sequencing and analysis, it has become clear that in prokaryotes some common and biologically important genes have experienced multiple exchanges between species known as horizontal gene transfer (HGT); hence, there is the idea of a “net of life” as an alternative to the TOL. The advances of comparative genomics have revealed that different genes very often have distinct tree topologies and, accordingly, that HGT appears to be the rule rather than an exception in the evolution of prokaryotes—bacteria and archaea (Dagan et al., 2008
; Doolittle, 1999
; Gogarten and Townsend, 2005
It seems worth mentioning some remarkable examples of massive HGT as an illustration of this key trend in the evolution of prokaryotes. The first case in point pertains to the most commonly used model of microbial genetics and molecular biology, the intestine bacterium Escherichia coli
. Some basic information on the genome of E. coli
and other sequenced microbial genomes is available on the website of the National Center for Biotechnology Information at the National Institute of Health (http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome
). The most well studied laboratory isolate of E. coli
on which most of the classic experiments of molecular biology have been performed is known as K12. The K12 genome encompasses 4226 annotated protein-coding genes (there is always uncertainty as to the exact number of the genes in a sequenced genome, for instance, because it remains unclear whether or not some small genes actually encode proteins; however, the estimate suffices for the present discussion). Several other sequenced genomes of laboratory E. coli
strains possess about the same number of genes. In contrast, genomes of pathogenic strains of E. coli
typically are much larger, with one strain, O157:H7, encoding 5315 annotated proteins. The nucleotide sequences of the shared genes in all strains of E. coli
are identical or differ by just one or two nucleotide substitutions. In a stark contrast, the differences between the genomes of laboratory and pathogenic strains concentrate in several “pathogenicity islands” that comprise up to 20% of the genome. The pathogenicity islands encompass genes typically involved in bacterial pathogenesis such as toxins, systems for their secretion, and components of prophages. One can imagine that the pathogenicity islands were present in the ancestral E. coli
genome but have been deleted in K12 and other laboratory strains. However, the gene contents of the islands dramatically differ between the pathogenic strains, so that in three-way comparisons of E. coli
genomes only about 40% of the genes are typically shared. Thus, the only possible conclusion is that the pathogenicity islands spread between bacterial genomes via rampant HGT, conceivably, driven by selection for survival and spread of the respective bacterial pathogens within the host organisms.
The second example involves apparent large-scale HGT across much greater evolutionary distances, namely, between the two “domains” of prokaryotes—bacteria and archaea (Pace et al., 1986
; Woese, 1987
). The distinction between these two distinct domains of microbes was established by phylogenetic analysis of rRNA sequences and the sequences of other conserved genes, and has been supported by major distinctions between the systems of DNA replication and the membrane apparatus of the respective organisms. Comparative analysis of the first few sequenced genomes of bacteria and archaea supported the dichotomy between the two domains: most of the protein sequences encoded in bacterial genomes show the greatest similarity to homologs from other bacteria and cluster with them in phylogenetic trees, and the same pattern of evolutionary relationships is seen for archaeal proteins. However, the analysis of the first sequenced genomes of hyperthermophilic bacteria, Aquifex aeolicus
and Thermotoga maritima
, yielded a striking departure from this pattern: the protein sets encoded in these genomes were shown to be “chimeric,” i.e., they consist of about 80% typical bacterial proteins and about 20% proteins that appear distinctly “archaeal,” by sequence similarity and phylogenetic analysis. The conclusion seems inevitable that these bacteria have acquired numerous archaeal genes via HGT. In retrospect, this finding might not appear so surprising because bacterial and archaeal hyperthermophiles coexist in the same habitats (e.g., hydrothermal vents on the ocean floor) and have ample opportunity to exchange genes. Similar chimeric genome composition, but with reversed proportions of archaeal and bacterial genes, has been subsequently discovered in mesophilic archaea such as Methanosarcina
Beyond these and related observations made by comparative genomics of prokaryotes, HGT is thought to have been crucial also in the evolution of eukaryotes, especially, as a consequence of endosymbiotic events in which numerous genes from the genome of the ancestors of mitochondria and chloroplasts have been transferred to nuclear genomes (Embley and Martin, 2006
). These findings indicate that no single gene tree (or any group of gene trees) can provide an accurate representation of the evolution of entire genomes; in other words, the results of comparative genomics indicate that a perfect TOL fully reflecting the evolution of cellular life forms does not exist. The realization that HGT is a major evolutionary phenomenon, at least, among prokaryotes, led to a crisis of the TOL concept which is often viewed as a paradigm shift in evolutionary biology (Doolittle, 1999
Of course, the inconsistency between gene phylogenies caused by HGT, however widespread, does not alter the fact that all cellular life forms are linked by an uninterrupted tree of cell divisions (Omnis cellula e cellula, according to the famous motto of Rudolf Virchow) that goes back to the earliest stages of evolution and is violated only by endosymbiosis events that were key to the evolution of eukaryotes but not prokaryotes. Thus, the difficulties of the TOL concept in the era of comparative genomics concern the TOL as it can be derived by the phylogenetic analysis of multiple genes and genomes, an approach often denoted “phylogenomics”, to emphasize that phylogenetic studies are now conducted on the scale of complete genomes. Accordingly, the claim that HGT “uproots the TOL” means that extensive HGT has the potential to completely decouple molecular phylogenies from the actual tree of cells. However, such decoupling has clear biological connotations given that the evolutionary history of genes also describes the evolution of the encoded molecular functions. In this article, the phylogenomic TOL is discussed with such an implicit understanding.
The views of evolutionary biologists on the evolving status of the TOL in the age of comparative genomics span the entire spectrum of positions from: i) persisting denial of the major importance of HGT for evolutionary biology to ii) “moderate” overhaul of the TOL concept; to iii) genuine uprooting whereby the TOL is declared obsolete (Doolittle and Bapteste, 2007
). The accumulating data on diverse HGT events are quickly making the first, “anti-HGT” position plainly untenable. Under the intermediate, moderate approach, despite all the differences between the topologies of individual gene trees, the TOL still makes sense as a representation of a central trend (consensus) that, at least, in principle, could be elucidated through a comprehensive comparison of trees for individual genes (Wolf et al., 2002
). By contrast, under the radical “anti-TOL” view, rampant HGT eliminates the very distinction between the vertical and horizontal transmission of genetic information, so the TOL concept should be abandoned altogether in favor of some form of a network representation of evolution (Doolittle and Bapteste, 2007
This article describes some of the methods that are used to compare topologies of numerous phylogenetic trees and the results of the application of these approaches to the analysis of approximately 7000 phylogenetic trees of individual prokaryotic genes that collectively comprise the “Forest of Life” (FOL). This set of trees does gravitate to a single tree topology, suggesting that the “TOL as a central trend” concept is potentially viable.