|Home | About | Journals | Submit | Contact Us | Français|
It is proposed that the pre-cellular stage of biological evolution unraveled within networks of inorganic compartments that harbored a diverse mix of virus-like genetic elements. This stage of evolution might comprise the Last Universal Cellular Ancestor (LUCA) that more appropriately could be denoted Last Universal Cellular Ancestral State (LUCAS). This scenario for the origin of cellular life recapitulates the early ideas of J. B. S. Haldane sketched in his classic 1928 essay. However, unlike in Haldane’s day, there is now considerable support for this scenario from three major lines of comparative-genomic evidence: i) lack of homology between the core components of the DNA replication systems of the two primary lines of descent of cellular life forms, archaea and bacteria, ii) distinct membrane chemistries and lack of homology between the enzymes of lipid biosynthesis in archaea and bacteria, iii) spread of several viral hallmark genes, which encode proteins with key functions in viral replication and morphogenesis, among numerous and extremely diverse groups of viruses, in contrast to their absence in cellular life forms, iv) the extant archaeal and bacterial chromosomes appear to be shaped by accretion of diverse, smaller replicons, suggesting a continuity between the hypothetical, primordial virus stage of life’s evolution and the dynamic prokaryotic world that existed ever since. Under the viral model of pre-cellular evolution, the key components of cells including the replication apparatus, membranes, and molecular complexes involved in membrane transport and translocation originated as components of virus-like entities. The two surviving types of cellular life forms, archaea and bacteria, might have emerged from the LUCAS independently, along with, probably, numerous forms now extinct.
As numerous complete genomes from diverse walks of life become available, comparative genomics turns into a truly powerful methodology 1–4. It has the ability not only to determine which genes are conserved and which are not, but also to reconstruct the gene composition of ancestral life forms including the hypothetical Last Universal Common (Cellular) Ancestor (LUCA) – under certain assumptions, of course 5–9. The key assumption is that genes shared by many diverse extant species are most likely to be inherited from the common ancestor of these species; in particular, genes that are present in all modern cellular life forms hark back to LUCA. The number of such ubiquitous genes is very small, fewer than 60, and nearly all of them encode proteins involved in translation and the core transcription machinery 5–7. This limited repertoire of genes obviously could not provide for a viable life form, so a considerable number of genes that must have been present in LUCA were lost or displaced in some lines of descent during the subsequent evolution.
Consequently, reconstruction approaches have to be applied in order to delineate the likely gene complement of LUCA. The simplest reconstruction methods are based on the principle of evolutionary parsimony, i.e., attempt to derive the evolutionary scenario that includes the smallest number of elementary events (the most parsimonious scenario) 10–12. The set of relevant events is small: i) gene “birth”, that is, emergence of a new gene, typically, via gene duplication followed by radical divergence, ii) gene acquisition via horizontal gene transfer (HGT), iii) gene loss.
Counting these events for different scenarios and choosing the one with the minimum number of events seems to be a straightforward task. However, realization of this goal meets with hurdles at several levels. First, in order to derive the patterns of presence-absence of a gene in a set of lineages (phyletic pattern), which are used as the input for the reconstruction methods, it is necessary to robustly identify orthologous genes, i.e., genes that evolved from a single ancestor gene in the common ancestor of the compared species 13, 14. Identification of orthologs is a nontrivial task for relatively fast-evolving genes from distant species and, especially, for any genes with a history of multiple duplications and losses. Second, and more fundamentally, reliable reconstruction of the course of evolution and of the ancestral gene sets is hampered by the uncertainty associated with the relative probabilities or rates of different events, in particular, gene loss versus horizontal gene transfer. Third, even phyletic patterns based on reliably delineated sets of orthologs hardly contain all the information that is required for the evolutionary reconstruction. In principle, even a gene that is found in all modern cellular life forms might not be inherited from LUCA: its ubiquity could instead result from an HGT sweep. Fourth, reconstruction methods based on parsimony are inherently limited as they have no capability to identify ancestral genes that have been lost in all or all but one of the extant lineages. Thus, the estimates of the gene content of ancestral forms are conservative, and the extent of underestimate is uncertain. Finally, to generate evolutionary scenarios, the parsimony reconstructions rely on a particular topology of the “tree of life”. Even apart from the major uncertainties that are inherent in deep phylogenetic trees, any such tree at best reflects the history of a small fraction of highly conserved genes: figuratively speaking, it is “a tree of one percent” 15. Worse yet, the very adequacy of the “tree of life” concept is questionable considering the extensive HGT that is part and parcel of the evolution of prokaryotes 16, 17. A more adequate probabilistic framework, such as that provided by maximum likelihood models, is required to produce more realistic estimates but such models can be prohibitively complex, and the approach to parameter estimation is unclear. Neither is it clear how the reconstruction can be performed in a tree-independent fashion.
All the difficulties and uncertainties of evolutionary reconstructions notwithstanding, parsimony analyses combined with less formal attempts on the reconstruction of the deep past of particular functional systems leave no serious doubts that LUCA already possessed at least several hundred genes. This diverse gene complement consists of genes encoding proteins of information processing systems including not only the core structural components (e.g., a minimal set of ribosomal proteins) but also some “accessory” proteins, e.g. a considerable variety of RNA modification enzymes; numerous metabolic pathways including the central energy metabolism and the biosynthesis of amino acids, nucleotides, and some coenzymes; and some crucial membrane proteins, such as the subunits of the signal recognition particle (SRP) and the H+-ATPase 11, 18, 19. In addition, a considerable number of RNA species such as three rRNAs, tRNA of all specificities, and the SRP 7S RNA are confidently traced back to LUCA.
However, there are also gaping holes in the reconstructed gene repertoire of LUCA. The two most important ones are: i) the absence of the central parts of the DNA replication machinery, namely, the polymerases that are responsible for the initiation (primases) and elongation of DNA replication, and for gap-filling after primer removal, and the principal DNA helicases, and ii) the absence of most enzymes of lipid biosynthesis. These proteins fail to make it into the reconstructed gene repertoire of LUCA because the respective processes in bacteria, on the one hand, and archaea on the other hand are catalyzed by distinct, unrelated enzymes and, in the case of membrane phospholipids, yield chemically distinct membranes (the archaeal membrane phospholipids are isoprenoid ethers of glycerol 1-phosphate whereas bacterial lipids fatty acid esthers of glycerol 3-phosphate, i.e., the lipids in the two domains differ not only in their chemical composition but also in chirality) 20–24. Thus, the reconstructed gene set of LUCA seems to display a remarkable non-uniformity in that some functional systems seem to reach elaborate complexity almost indistinguishable from that in modern organisms whereas others are rudimentary or missing. This strange picture is remarkably similar to Woese’s general concept of non-simultaneous “crystallization” of different cellular systems at the early stages of evolution 25 and prompts one to step back and take a more general view at the LUCA problem.
The year 2009 is the Darwin year when the world celebrates his 200th birthday and the 150th anniversary of On the Origin of Species 26. It also happens to be the 150th jubilee of the idea of LUCA that, to my knowledge, was clearly proposed by Darwin for the first time (the acronym itself, of course, is much younger: it was coined in 1996 at a special meeting on the last common ancestor of modern life forms 27). In the famous final passage of the Origin, Darwin wrote: “There is grandeur in this view of life, with its several powers, having been originally breathed into a few forms or into one; and that, whilst this planet has gone cycling on according to the fixed law of gravity, from so simple a beginning endless forms most beautiful and most wonderful have been, and are being, evolved” 26. In Darwin’s day, this was an incredibly bold conjecture considering that the only empirical support came from phenotypic similarities between diverse organisms (paradoxically, Darwin’s prescience might have been helped by the obscurity of microbes at the time so that he, effectively, considered multicellular organisms).
The advances of molecular biology and, later, comparative genomics forcefully vindicated Darwin’s insight. The (near)universality of the genetic code complemented by the universal conservation of ~50 proteins involved in the core translation functions, ~30 structural RNAs, and the three core subunits of the DNA-dependent RNA polymerase 5–7 comprise strong evidence in support of the existence of some form of LUCA. Importantly, most of these molecules show a clear-cut pattern of phylogenetic relationships, with the three domains of life (bacteria, archaea, and eukaryota) being well-separated in phylogenetic trees, and the archaeal and eukaryotic sequences showing greater similarity to each other, which suggest rooting the tree between the archaeo-eukaryotic and bacterial branches 6, 28. This rooting was supported by the phylogenetic analysis of ancient paralogous genes, namely, translation factors and membrane ATPase subunits, that are thought to derive from gene duplications antedating LUCA 29, 30.
Although it has been suggested that this tree topology is a long-branch attraction artifact and so the root position has been challenged 31–33, it appears clear that there is a substantial, even if numerically relatively small, set of genes that are not only common to all cellular life forms but also share a (largely) common history. The existence of this evolutionarily coherent gene set that is, in all likelihood, ancestral to all extant cellular life appears to, effectively, prove the existence of an ancestral state that can be reasonably denoted LUCA. The real issue, then, is not whether or not a LUCA existed but rather what it was like, that is, which features of this entity we can infer with confidence and which (so far) remain uncertain.
It seems to make sense to think of LUCA in two distinct contexts:
These two characteristics are likely to correlate but are not necessarily tightly coupled let alone deterministically linked. In principle, it is not inconceivable that LUCA was a cellular entity that was substantially simpler than any modern cell (at least, a free-living one) in terms of its genetic content or, conversely, that considerable genetic complexity evolved prior to the emergence of cellular organization (Figure 1).
All the uncertainties involved notwithstanding, it seems to be extremely likely that LUCA was fairly complex, that is, had at least about as many genes as the simplest of the modern free-living prokaryotes, namely, on the order of a 1000 genes or more. Figures in this range have been inferred by all algorithmic methods for ancestral gene set reconstruction 5, 11, 12, 19. However, given the uncertainty associated with these approaches (see above), the more compelling argument for a complex LUCA is the complexity is of the modern translation machinery that comprises indisputable LUCA heritage. The functioning of such an advanced translation system is predicated on commensurate metabolic capabilities including not only the pathways for the synthesis of all nucleotides and (nearly) all amino acids but also those for at least some coenzymes, e.g., S-adenosylmethionine, the cofactor of the numerous RNA methylases several of which can be traced back to LUCA with a high confidence 18, 34. Furthermore, the evolutionary relationships of some translation system components imply that these proteins are products of preceding complex evolution. A case in point are the aminoacyl-tRNA synthetases (aaRS), the 20 enzymes (one for each amino acid) that are essential for translation and of which, at least, 18 are confidently traced back to LUCA 35, 36. The core catalytic domains of the aaRS represent two distinct classes that possess unrelated structural folds and cover 10 amino acid specificities each. Analysis of the evolutionary history of the catalytic domains of Class I aaRS indicates that they all comprise one cluster of terminal branches in the elaborate tree of the “Rossmann-like” protein domains 37, 38. Thus, the diversification of the aaRS, that was already (nearly) complete in LUCA, was preceded by complex protein evolution including the divergence of many families of enzymes. The same argument applies to translation factors, RNA methylases, and other groups of proteins involved in translation 18. Logically, these observations clinch the case for a LUCA whose genetic complexity was, in the least, not much lower than that of simple modern prokaryotes.
However, it is far from being obvious that LUCA resembled modern prokaryotes in terms of cellular organization as well. The “uniformitarian assumption”, namely, that LUCA was a more or less regular, modern-type is often accepted, effectively, by default in the discussions of early evolution, even if rarely discussed explicitly 39, 40,41. However, any reconstruction of LUCA must account for the evolution of the features that are not immediately traceable back to the common ancestor of archaea and bacteria, the two main ones being DNA replication and membrane biogenesis (and chemistry). The uniformitarian hypotheses of LUCA would explain the lack of conservation of these key systems in one of two ways:
Specifically, with respect to membrane biogenesis, it has been proposed that LUCA had a mixed, heterochiral membrane, with the two versions with opposite chiralities emerging as a result sof subsequent specialization in archaea and bacteria 24. With regard to the DNA replication, a hypothesis has been developed under which one of the modern replication systems is ancestral whereas the other system evolved in viruses and subsequently displaced the original one in either the archaeal or the bacterial lineage 42.
By contrast, radical proposals on LUCA’s nature take a “what you see is what you get” approach by postulating that LUCA lacked those key features that are not homologous in extant archaea and bacteria, at least, in their modern form. The possibility that LUCA was radically different from any known cells has been brought up, originally, in the concept of “progenote”, a hypothetical, primitive entity in which the link between the genotype and the phenotype was not yet firmly established 43. In its original form, the progenote idea involves primitive, imprecise translation, a notion that is not viable given the extensive diversification of proteins prior to LUCA that is demonstrated beyond doubt by the analysis of diverse protein superfamilies (see above). More realistically, it can be proposed that the emergence of the major features of cells was substantially asynchronous 25 so that LUCA closely resembled modern cells in some ways but was distinctly “primitive” in others. The results of comparative genomics provide clues for distinguishing advanced and primitive features of LUCA. Thus, focusing on the major areas of non-homology between archaea and bacteria, it has been hypothesized that LUCA:
With respect to the DNA genome and replication, the conundrum to explain was the combination of non-homologous and conserved components of the DNA replication machinery as well as the universal conservation of the core transcription machinery. To account for this mixed pattern of conservation and diversity, it has been suggested that LUCA had a retrovirus-like replication cycle, with the conserved transcription machinery involved in the transcription of provirus-like dsDNA molecules and the conserved components of the DNA replication system playing accessory roles in this process 22. This speculative scheme combined, in the same hypothetical replication cycle, the conserved proteins that are involved in transcription and replication with proteins, such as reverse transcriptase (RT) that, among the extant life forms, are seen, primarily or exclusively, in viruses and other selfish genetic elements. The proposal formally accounts for the universal conservation of these proteins but has no direct analogy in extant genetic systems.
The other major area of non-homology between archaea and bacteria, lipid biosynthesis (along with lipid chemistry) prompted the notion of a non-cellular, although compartmentalized LUCA. Specifically, it has been proposed that LUCA might have been a diverse population of expressed genetic elements that dwelled in networks of inorganic compartments 23. A major hurdle for the models of non-membrane-bounded LUCA is that several membrane proteins and even molecular complexes, such as the proton ATPase and the signal recognition particle (SRP), are nearly universal among modern cellular life forms and, in all likelihood, were present in LUCA 45.
A more careful consideration of the “genomic” (lack of homology of the core components of the DNA replication systems in archaea and bacteria) and the “membrane” (radical difference in between the phospholipids and the enzymes of lipid biosynthesis between archaea and bacteria) challenges to LUCA suggests that the two are tightly linked. A complex LUCA without a large DNA genome similar to modern bacterial and archaeal genomes could only have a genome consisting of several hundred segments of RNA (or provirus-like DNA), each several kilobases in size. This limitation is dictated by the dramatically lower stability of RNA molecules compared to DNA and is empirically supported by the fact that the largest known RNA genomes (those of coronaviruses) are ~30 kb in size 46. It has been proposed that LUCA represented a bona fide RNA cell that subsequently radiated into three major RNA cell lineages (the ancestors of bacteria, archaea and eukaryotes) in which the genome was independently replaced by DNA as a result of acquisition of the DNA replication machinery from distinct viruses 44. However, the necessity to possess hundreds of genomic RNA segments seems to raise an insurmountable obstacle for a RNA cell because a reasonable accuracy of genome partitioning into daughter cells during cell division would require elaborate mechanisms of genome segregation of a kind not found in modern prokaryotes. Otherwise, the change in the gene complement brought about by each cell division would, effectively, prevent reproduction. Those segregation mechanisms that do operate in modern bacteria (and, probably, archaea) involve pumping of dsDNA into daughter cells with the help of a specific ATPase and, probably, coevolved with large dsDNA genomes 47–50. Thus, if LUCA indeed lacked a large dsDNA genome and instead had a “collective” genome comprised of numerous RNA segments, it must have been a life form distinct from modern cells, perhaps, actually, a non-cellular one.
Another broadly discussed aspect of early life forms, including LUCA, is the rampant HGT that is often considered a pre-requisite for the evolution of complex life 51, 52. Indeed, HGT is the route of rapid innovation, and innovation was bound to be rapid at the earliest stages of life’s evolution. Moreover, it has been recently suggested and illustrated by mathematical modeling that the very universality of the genetic code might be linked to the critical role of HGT at the early phase of evolution: in the presence of extensive HGT, a single version of the code would necessarily sweep the population of ancestral life forms, whereas any organisms with deviant code would be unable to benefit from HGT and, being isolated from other organisms, would be eliminated by selection 53, 54. Analogies with the history of human civilization are obvious and, perhaps, illuminating: the existence of a lingua franca greatly accelerates progress, and conversely, isolated communities are stalled in their development and doomed to eventual extinction. Constant, extensive HGT is an intrinsic feature of the models of non-cellular, compartmentalized LUCA 45 but certainly cannot be taken for granted within the framework of the cellular LUCA models. An updated version of the non-cellular LUCA model is presented below.
Russell and coworkers proposed that networks of microcompartments that exist at both extant and ancient hydrothermal vents, and consist, primarily, of iron sulfide could be ideal habitats for early life. These inorganic compartment networks provide gradients of temperature and pH that could fuel primordial energetics, and versatile catalytic surfaces for primitive biochemistry 55, 56. These might have been the sites of prebiological and pre-cellular biological evolution, from mixtures of organic molecules to the putative, primordial RNA world to the independent escapes of archaeal and bacterial cells 23, 45. These compartments are envisaged being inhabited by diverse populations of genetic elements, initially, segments of RNA, subsequently, larger and more complex RNA molecules encompassing one or a few protein-coding genes, and later yet, also DNA segments of gradually increasing size (Fig. 3). Notably, a computer simulation study has shown that, in the presence of thermal gradient that inevitably exists at a hydrothermal vent, extremely high concentrations of small molecules and polymers can be reached 57, a condition that would substantially facilitate a variety of reactions including RNA ligation 58.
Thus, early life forms, likely including LUCA, are perceived as complex ensembles of genetic elements that inhabited networks of inorganic compartments 45, 59. A key feature of this model is that genetic elements with different replication and expression strategies (including replicating DNA segments) encoding distinct replication machineries would coexist within a network or even within the same compartment. Thus, the earlier, somewhat artificial scheme, in which the universally conserved components of the DNA replication machinery were implicated in a primordial, retrovirus-like replication cycle 22, might be superfluous. The model of the compartmentalized primordial gene pool implies evolution of the retrovirus-like replication cycle within the RNA-protein world and subsequent evolution of diverse DNA replication systems (Fig. 3) but does not necessarily require the components of these distinct genetic systems to function together within the same replication cycle.
This model explains the lack of homology between the membranes, membrane biogenesis systems, and the DNA replication machineries of archaea and bacteria by inferring a LUCA that did not have a single, large DNA genome and was not a membrane-bounded cell. However, under this model, the primordial, pre-cellular life forms are envisaged as “laboratories” in which various strategies of genome replication-expression as well as rudimentary forms of biogenic compartmentalization were “invented” and tried out (Fig. 3 and see below).
The central point of this scenario of life’s early evolution is the virus-like nature of the perceived pre-cellular life forms. The idea that viruses could be related to the first life forms is almost as old as virology itself. Apparently, it was first proposed by Felix d’Herelle, the discoverer of bacteriophages 60 and was incorporated and developed by J. B. S. Haldane in his classic 1928 essay on the origin of life 61. Haldane came up with the striking speculation that the first self-reproducing agents were viruses or virus-like agents and that a virus stage in life’s evolution preceded the emergence of cells. Subsequently, the concept of the primordial origin of viruses was, largely, abandoned as it became obvious that viruses were obligate intracellular parasites that depend on the host cells for most of their functions; instead, the scenarios of cell degeneration or escaped cellular genes became dominant in the thinking on the origins of viruses 62–64.
Very recently, the study of fundamental aspects of virus evolution experienced a true renaissance that led to the proliferation of hypotheses and models that revolve around the concept that viruses were important contributors to the origin and evolution of cells 42, 44, 59, 65–70. In particular, Forterre proposed the hypothesis of “three DNA cells and thee DNA viruses” according to which modern-type DNA-based cells evolved when three distinct DNA viruses displaced the original RNA genomes in three cellular lineages (progenitors of bacteria, archaea, and eukaryotes, respectively); the DNA viruses themselves are thought to have evolved as parasites of these primordial RNA cells 44. However, as discussed above, RNA cells do not appear to be a viable proposition. Therefore, the alternative scenario that seems to reconcile the results of comparative genomics and the general logic of precellular evolution revives Haldane’s idea at a new level and involves evolution of diverse virus-like elements and even virus-like particles prior to the advent of modern-type cells 59.
The emergence of cells is the epitome of the problems encountered by all explanations of the evolution of complex biological structures, the crucial conundrum of biology that was first recognized and explored by Darwin in his famous discussion of the evolution of the animal eyes 26. Darwin’s solution, with some embellishments, has since become the standard scenario for the origin of complex systems: the intermediates might not be fit to perform the function of the final, complex structure but they are good enough for either a simplified version of that function or, perhaps, a distinct function that is not always easy to deduce from the present one. For the latter case, Gould coined the succinct term exaptation, that is, recruitment of a structure for a new function 71. The virus-like early stage in life’s early evolution belongs to the same family of solutions and might be the most plausible if not the only way to avoid the ultimate “irreducible complexity” trap associated with the origin of cellular organization itself.
Like all biological evolution, pre-cellular evolution was undoubtedly driven, in large part, by natural selection. Selection enters the scene with the appearance of replicating entities, initially, it is currently presumed, RNA molecules replicated by ribozymes, and subsequently, after the emergence of translation, RNA molecules replicating with the aid of proteins 72, 73. These earliest stages of evolution are beyond the scope of this discussion. It is important to note, however, that one of the central aspects of the model of a virus-like, compartmentalized, pre-cellular stage of evolution is a gradual transition from selection at the level of individual genetic elements to group selection for ensembles of such elements encoding both enzymes directly involved in replication and proteins responsible for accessory functions, such as translation and nucleic acid precursor synthesis 45, 74.
Ensembles of “selfish cooperators” could potentially evolve by two routes: i) physical joining of genetics elements and ii) compartmentalization 45. The former route is considered to be the onset of the evolution of operons including the ribosomal-RNA polymerase superoperon, the only substantially conserved feature of the genome organization between archaea and bacteria 75, 76. The compartmentalization route would depend on the evolution of virus-like particles that could harbor (relatively) stable sets of genomic segments resembling the extant RNA viruses with multipartite genomes. Unlike cells, the virions of viruses with small genomes, particularly, the nearly ubiquitous icosahedral (spherical) capsids, are simple, symmetrical structures that, in many cases, are formed by self-assembly of a single capsid protein 77–80. Thus, it is attractive e to speculate that simple virus-like particles were the first form of genuine, biological compartmentalization that were important at the pre-cellular stage of evolution. In addition to the benefit of compartmentalization, virus-like particles would protect genetic elements (especially, RNA) from degradation and could be vehicles for gene transfer within and between networks of inorganic compartments.
Most of the spherical viruses with relatively complex genomes possess molecular motors for DNA or RNA packaging within the capsid 79, 81–84; at least in some cases, these machines also mediate extrusion of mRNA transcripts from the capsid 85, 86. The viral packaging and extrusion machines contain motor ATPases of at least three distinct families that seem to share a common architecture, forming hexameric channels through which DNA or RNA is actively translocated 86, 87. Notably, one of the groups of viral packaging ATPases is a branch of the FtsK-HerA superfamily that also includes prokaryotic ATPases responsible for DNA pumping into daughter cells during cell division 50 whereas another family is homologous to bacterial twitching mobility ATPases (Ref. 86 and EVK, unpublished observations). In membrane-containing virions of many viruses, the packaging motors translocate the DNA or RNA both across the capsid and the lipid membrane of the virion. It is tempting to hypothesize that viral packaging machines were evolutionary precursors of the cellular pumping and motility ATPases. Moreover, the H+-ATPase/ATP synthase, the key, universal membrane enzyme and the centerpiece of modern cellular energetics, also forms a similar hexameric channel 88 and might have started out as part of the packaging/extrusion machinery in a still uncharacterized (possibly, extinct) class of virus-like agents. Indeed, a recent comparative-genomic analysis has suggested that that the common ancestor of the two major branches of membrane ATPases, F-ATPases typically found in bacteria and V-ATPases characteristic of archaea and eukaryotes, evolved from a common ancestor that functioned as a protein or RNA translocase 89. More generally, it seems an attractive possibility that primordial viral membranes were intermediate steps in the evolution of membranes that antedated the emergence evolution of the first cellular membranes, a major challenge in terms of evolution of complexity. Just as genome replication of virus-like agents can be viewed as the original test ground for replication strategies 42, two of which have been subsequently recruited for the two major lineages of cellular life forms, evolving virus particles might have been the “laboratory” for testing molecular devices that were later incorporated into the membranes of emerging cells (Fig. 3).
From the selection for gene ensembles, there is a direct path to selection for compartment contents such that compartments sustaining rapid replication of genetic elements would “infect” adjacent compartment and, effectively, propagate their “genomes” 45; primordial virus-like particles would have been important for this process. The pre-cellular equivalent of HGT, that is, transfer of the genetic content between compartments, is part and parcel of this model, in agreement with the general concept that rampant HGT was an essential feature of the early stages of life’s evolution 51, 53, 54. After a substantial degree of complexity has been reached through the evolution of selfish cooperators within the networks of inorganic compartments, repeated escapes of cell-like entities that combined (relatively) large DNA genomes and membranes containing transport and translocation devices (originally evolved in virus-like agents, under this model) became possible. There is no telling how many such attempts have failed quickly and how many might have been initially successful but the fact is that only two, archaea and bacteria (assuming a symbiotic scenario for the origin of eukaryotes 90), or three, archaea, bacteria and eukaryotes (assuming the so-called archezoan scenario of eukaryotic origin 91) survived for extended time intervals (the scenario for the origin of eukaryotes is peripheral in this context and is outside the scope of this article). The first successful escapes of cellular life forms from the hypothetical pre-cellular pool would correspond to the “Darwinian Threshold” for cellular life postulated by Woese 51, that is, the threshold beyond which HGT would be substantially curtailed, and evolution of distinct lineages (species) of cellular organisms could take off.
Like other models of the early stages of evolution of biological complexity, and perhaps, even more explicitly, the “primordial virus world” scenario outlined here faces the problem of takeover by selfish elements 74, 92, 93. If the primordial parasites became too aggressive, they would kill off their hosts within a compartment and could survive only by infecting a new compartment (where they could be dangerous again). Devastating “pandemics” sweeping through entire networks and eventually wiping out their entire content are imaginable, and indeed, this would be the likely fate of many, if not most, primordial “organisms”. The conditions for the survival of pre-cellular life forms were, first, emergence of temperate virus-like agents that do not kill the host, and second, early invention of defense mechanisms, likely, based on RNA interference (RNAi). The ubiquity of both temperate selfish elements and RNAi-based defense systems in all major branches of cellular life 94, 95 suggests that these phenomena evolved at a very early, quite possibly, pre-cellular stage of evolution.
The primordial virus world model of pre-cellular evolution sketched here seems to offer plausible, even if, to a large extent, speculative solutions to many puzzles associated with the origin of cells. Comparative genomics of viruses and other selfish elements seems to provide substantial empirical support for this model. Considering that, under the primordial virus world scenario, the first cells emerged from a non-cellular ancestral state in multiple, independent escapes, it seems sensible to replace the acronym LUCA with LUCAS, for Last Common Ancestral State.
Viruses and other selfish replicons show remarkable diversity in terms of both replication-expression strategy and genomic complexity 62, 69, 70, 96–98. The selfish replicons comprising the virus world span, roughly, the same range of genome sizes, about four orders of magnitude (from ~102 nucleotides in the smallest viroid genome to >106 nucleotides in the giant mimivirus) as genomes of cellular life forms (from ~2×105 nucleotides in the smallest bacterial genome to ~3×109 nucleotides in mammals, some extremely large plant and animal genomes excluded). Predictably, within such a huge span of genome size, viruses show a tremendous variety of gene repertoires. In viruses with large genomes, such as poxviruses, the mimivirus or T-even bacteriophages, there are many genes with readily recognizable homologs in cellular life forms that, clearly, have been transferred from the host at a relatively late stage of viral evolution 99–101. The origins of many other viral genes remain obscure as they are present in one or more lineages of viruses but not in any sequenced genomes of cellular life forms. Conceivably, such genes are products of rapid evolution at the base of the respective viral lineages so that the traces of their origin have been obliterated.
In addition, however, a distinct class of viral genes shows a truly remarkable distribution. These “viral hallmark genes” are shared by many groups of viruses with extremely diverse replication-expression strategies, genome sizes, and host ranges (Table 1) 59. No single hallmark gene is found in all groups of viruses but, together, the partially overlapping distribution ranges of the hallmark genes cover almost the entirety of the virus world. There are only very distant homologs of the viral hallmark genes in cellular organisms, and all viral members of the respective gene families appear to be have a common origin. All hallmark genes encode proteins with central, essential roles in the replication, expression, and virion morphogenesis of the respective viruses (Table 1). The relative contribution of the hallmark genes to the gene complement of a virus strongly depends on the genome size. Viruses with small genomes, such as most of the RNA viruses, often have only a few genes, so that the hallmark genes comprise the majority 102. By contrast, in viruses with large genomes, the hallmark genes account only for a small fraction of the gene complement. Considering the broad range of genome sizes and gene contents, and the even more dramatic, qualitative difference between the replication-expression strategies (e.g., positive-strand RNA viruses contrasted to dsDNA viruses) of viruses sharing some of the hallmark genes, it is striking and certainly calls for an explanation that the life cycles of these diverse viruses center around homologous genes (such as those for the jelly-roll capsid protein or the superfamily 3 helicase involved in genome replication).
Various evolutionary scenarios accounting for the highly unusual phyletic spread of the viral hallmark genes have been examined in detail elsewhere 59. In brief, the simplest explanation for the fact that the hallmark proteins involved in viral replication and virion formation are present in a broad variety of viruses but not in any cellular life forms seems to be that the latter actually never possessed these genes. Rather, the hallmark genes, probably, antedate cells and descend directly from the primordial pool of virus-like genetic elements. Given the spread of the hallmark genes among numerous groups of extremely diverse viruses, a major corollary is that, at least, several lineages of viruses and other selfish elements with distinct genome structures and replication-expression strategies derive from the precellular stage of evolution (although the current distribution of the hallmark genes, certainly, was affected by later HGT).
The concept of a pre-cellular stage of biological evolution outlined here posits that the precellular stage of life’s evolution took place within networks of inorganic compartments that hosted a diverse mix of virus-like genetic elements 45, 59. It is further proposed that these ensembles of genetic elements were the ancestral state from which cells emerged, probably, in multiple, independent escapes only two or three of which (the ancestors of bacteria and archaea, and possibly, eukaryotes) yielded stable cellular lineages that enjoyed a long-term evolutionary success. Considering this hypothetical consortial state of primordial life forms that eventually gave rise to cells, it seems reasonable to replace the acronym LUCA with LUCAS, for the Last Universal Common Ancestral State.
The viral model of cellular origin recapitulates, at a quite different stage in the development of biology, the early ideas of Haldane 61. Since 1928, when Haldane’s essay was published, the status of the model has radically changed. At this time, the support and, indeed, the incentives for this model derive from four lines of substantive comparative-genomic evidence:
Although bacterial and archaeal chromosomes are large dsDNA molecules and are relatively stable over the short scale of evolution, these genomes of cellular life forms are in an equilibrium with the mobilome, and over the longer time scale, were shaped by accretion of diverse, smaller replicons 104, 105. Thus, there seems to be a continuity between the hypothetical, primordial virus stage of life’s evolution and the dynamic prokaryotic world, the principal distinction being the additional compartmentalization that is brought about by the cellular organization and provides for the persistence of large genomes.
In addition to being compatible with multiple lines of empirical evidence, the viral model of early evolution seems to offer at least a tentative solution to the classic Darwinian challenge of the evolution of complex structures that can function only as a whole, in this case, the cell itself. This solution comes along the lines first outlined by Darwin himself 26, that is, gradual evolution of the complex organization via intermediates whose functions are different from, even if mechanistically similar to, those of the fully developed structure. Under this model, primordial functions are envisaged to evolve as parts of the life cycles of virus-like genetic elements. Within this context, the model addresses the most daunting challenges to the hypothesis of a pre-cellular LUCA(S), namely, the universal conservation of some essential membrane proteins and complexes: the ancestors of these membrane devices might function within emerging membranes of virus-like particles.
The primordial virus world model is, at least in parts, refutable and, potentially, testable. A discovery of an organism with an archaeal replication system but a bacterial membrane (or vice versa) would come close to a refutation. Further study of the diversity of viruses might reveal new membrane translocation devices, for instance, packaging machines homologous to the H+-ATPases of cellular organisms. Such evidence would provide support for a role of viruses in the evolution of cellular membranes. Direct biochemical experiments on early evolution are inherently hard. However, this model might make them easier by splitting the Gargantuan feat of evolving a cell into more manageable steps of evolution of virus-like agents.
Valerian Dolja, Bill Martin, Tania Senkevich, and Yuri Wolf contributed to the development of various aspects of this model. I also thank the participants of the meeting on the LUCA at Fondacion Les Treilles (France), in September, 2007, and specifically, the organizers of the meeting, Patrick Forterre, Celine Brochier-Armanet, and Simonetta Gribaldo, for most helpful discussions during which the acronym LUCAS was coined collectively. This work was supported by the Intramural Research Program of the National Institutes of Health, National Library of Medicine.