Follicular lymphoma (FL) is an uncurable cancer characterized by progressive severity of relapses. We analyzed sequence context specificity of mutations in the B cells from a large cohort of FL patients. We revealed substantial excess of mutations within a novel hybrid nucleotide motif: the signature of somatic hypermutation (SHM) enzyme, Activation Induced Deaminase (AID), which overlaps the CpG methylation site. This finding implies that in FL the SHM machinery acts at genomic sites containing methylated cytosine. We identified the prevalence of this hybrid mutational signature in many other types of human cancer, suggesting that AID-mediated, CpG-methylation dependent mutagenesis is a common feature of tumorigenesis.
Virus genomes are prone to extensive gene loss, gain, and exchange and share no universal genes. Therefore, in a broad-scale study of virus evolution, gene and genome network analyses can complement traditional phylogenetics. We performed an exhaustive comparative analysis of the genomes of double-stranded DNA (dsDNA) viruses by using the bipartite network approach and found a robust hierarchical modularity in the dsDNA virosphere. Bipartite networks consist of two classes of nodes, with nodes in one class, in this case genomes, being connected via nodes of the second class, in this case genes. Such a network can be partitioned into modules that combine nodes from both classes. The bipartite network of dsDNA viruses includes 19 modules that form 5 major and 3 minor supermodules. Of these modules, 11 include tailed bacteriophages, reflecting the diversity of this largest group of viruses. The module analysis quantitatively validates and refines previously proposed nontrivial evolutionary relationships. An expansive supermodule combines the large and giant viruses of the putative order “Megavirales” with diverse moderate-sized viruses and related mobile elements. All viruses in this supermodule share a distinct morphogenetic tool kit with a double jelly roll major capsid protein. Herpesviruses and tailed bacteriophages comprise another supermodule, held together by a distinct set of morphogenetic proteins centered on the HK97-like major capsid protein. Together, these two supermodules cover the great majority of currently known dsDNA viruses. We formally identify a set of 14 viral hallmark genes that comprise the hubs of the network and account for most of the intermodule connections.
Viruses and related mobile genetic elements are the dominant biological entities on earth, but their evolution is not sufficiently understood and their classification is not adequately developed. The key reason is the characteristic high rate of virus evolution that involves not only sequence change but also extensive gene loss, gain, and exchange. Therefore, in the study of virus evolution on a large scale, traditional phylogenetic approaches have limited applicability and have to be complemented by gene and genome network analyses. We applied state-of-the art methods of such analysis to reveal robust hierarchical modularity in the genomes of double-stranded DNA viruses. Some of the identified modules combine highly diverse viruses infecting bacteria, archaea, and eukaryotes, in support of previous hypotheses on direct evolutionary relationships between viruses from the three domains of cellular life. We formally identify a set of 14 viral hallmark genes that hold together the genomic network.
The CRISPR-Cas adaptive immune system defends microbes against foreign genetic elements via DNA or RNA-DNA interference. We characterize the Class 2 type VI-A CRISPR-Cas effector C2c2 and demonstrate its RNA-guided RNase function. C2c2 from the bacterium Leptotrichia shahii provides interference against RNA phage. In vitro biochemical analysis show that C2c2 is guided by a single crRNA and can be programmed to cleave ssRNA targets carrying complementary protospacers. In bacteria, C2c2 can be programmed to knock down specific mRNAs. Cleavage is mediated by catalytic residues in the two conserved HEPN domains, mutations in which generate catalytically inactive RNA-binding proteins. These results broaden our understanding of CRISPR-Cas systems and suggest that C2c2 can be used to develop new RNA-targeting tools.
Bacterial genomes encode numerous homologs of Cas9, the effector protein of the type II CRISPR-Cas systems. The homology region includes the arginine-rich helix and the HNH nuclease domain that is inserted into the RuvC-like nuclease domain. These genes, however, are not linked to cas genes or CRISPR. Here, we show that Cas9 homologs represent a distinct group of nonautonomous transposons, which we denote ISC (insertion sequences Cas9-like). We identify many diverse families of full-length ISC transposons and demonstrate that their terminal sequences (particularly 3′ termini) are similar to those of IS605 superfamily transposons that are mobilized by the Y1 tyrosine transposase encoded by the TnpA gene and often also encode the TnpB protein containing the RuvC-like endonuclease domain. The terminal regions of the ISC and IS605 transposons contain palindromic structures that are likely recognized by the Y1 transposase. The transposons from these two groups are inserted either exactly in the middle or upstream of specific 4-bp target sites, without target site duplication. We also identify autonomous ISC transposons that encode TnpA-like Y1 transposases. Thus, the nonautonomous ISC transposons could be mobilized in trans either by Y1 transposases of other, autonomous ISC transposons or by Y1 transposases of the more abundant IS605 transposons. These findings imply an evolutionary scenario in which the ISC transposons evolved from IS605 family transposons, possibly via insertion of a mobile group II intron encoding the HNH domain, and Cas9 subsequently evolved via immobilization of an ISC transposon.
IMPORTANCE Cas9 endonucleases, the effectors of type II CRISPR-Cas systems, represent the new generation of genome-engineering tools. Here, we describe in detail a novel family of transposable elements that encode the likely ancestors of Cas9 and outline the evolutionary scenario connecting different varieties of these transposons and Cas9.
Microbial CRISPR-Cas systems are divided into Class 1, with multisubunit effector complexes, and Class 2, with single protein effectors. Currently, only two Class 2 effectors, Cas9 and Cpf1, are known. We describe here three distinct Class 2 CRISPR-Cas systems. The effectors of two of the identified systems, C2c1 and C2c3, contain RuvC-like endonuclease domains distantly related to Cpf1. The third system, C2c2, contains an effector with two predicted HEPN RNase domains. Whereas production of mature CRISPR RNA (crRNA) by C2c1 depends on tracrRNA, C2c2 crRNA maturation is tracrRNA-independent. We found that C2c1 systems can mediate DNA interference in a 5’-PAM-dependent fashion analogous to Cpf1. However, unlike Cpf1, which is a single-RNA-guided nuclease, C2c1 depends on both crRNA and tracrRNA for DNA cleavage. Finally, comparative analysis indicates that Class 2 CRISPR-Cas systems evolved on multiple occasions through recombination of Class 1 adaptation modules with effector proteins acquired from distinct mobile elements.
CRISPR-Cas adaptive immunity; Cas9; Cpf1; crRNA; tracrRNA; PAM; RuvC-like endonuclease; HEPN domain; computational discovery pipeline; RNA-seq
The microbial adaptive immune system CRISPR mediates defense against foreign genetic elements through two classes of RNA-guided nuclease effectors. Class 1 effectors utilize multi-protein complexes, whereas Class 2 effectors rely on single-component effector proteins such as the well-characterized Cas9. Here we report characterization of Cpf1, a putative Class 2 CRISPR effector. We demonstrate that Cpf1 mediates robust DNA interference with features distinct from Cas9. Cpf1 is a single RNA-guided endonuclease lacking tracrRNA, and it utilizes a T-rich protospacer adjacent motif. Moreover, Cpf1 cleaves DNA via a staggered DNA double stranded break. Out of 16 Cpf1-family proteins, we identified two candidate enzymes, from Acidominococcus and Lachnospiraceae, with efficient genome editing activity in human cells. Identifying this mechanism of interference broadens our understanding of CRISPR-Cas systems and advances their genome editing applications.
Casposons are a recently discovered group of large DNA transposons present in diverse bacterial and archaeal genomes. For integration into the host chromosome, casposons employ an endonuclease that is homologous to the Cas1 protein involved in protospacer integration by the CRISPR-Cas adaptive immune system. Here we describe the site-preference of integration by the Cas1 integrase (casposase) encoded by the casposon of the archaeon Aciduliprofundum boonei. Oligonucleotide duplexes derived from the terminal inverted repeats (TIR) of the A. boonei casposon as well as mini-casposons flanked by the TIR inserted preferentially at a site reconstituting the original A. boonei target site. As in the A. boonei genome, the insertion was accompanied by a 15-bp direct target site duplication (TSD). The minimal functional target consisted of the 15-bp TSD segment and the adjacent 18-bp sequence which comprises the 3′ end of the tRNA-Pro gene corresponding to the TΨC loop. The functional casposase target site bears clear resemblance to the leader sequence-repeat junction which is the target for protospacer integration catalyzed by the Cas1–Cas2 adaptation module of CRISPR-Cas. These findings reinforce the mechanistic similarities and evolutionary connection between the casposons and the adaptation module of the prokaryotic adaptive immunity systems.
The history of life is punctuated by evolutionary transitions which engender emergence of new levels of biological organization that involves selection acting at increasingly complex ensembles of biological entities. Major evolutionary transitions include the origin of prokaryotic and then eukaryotic cells, multicellular organisms and eusocial animals. All or nearly all cellular life forms are hosts to diverse selfish genetic elements with various levels of autonomy including plasmids, transposons and viruses. I present evidence that, at least up to and including the origin of multicellularity, evolutionary transitions are driven by the coevolution of hosts with these genetic parasites along with sharing of ‘public goods’. Selfish elements drive evolutionary transitions at two distinct levels. First, mathematical modelling of evolutionary processes, such as evolution of primitive replicator populations or unicellular organisms, indicates that only increasing organizational complexity, e.g. emergence of multicellular aggregates, can prevent the collapse of the host–parasite system under the pressure of parasites. Second, comparative genomic analysis reveals numerous cases of recruitment of genes with essential functions in cellular life forms, including those that enable evolutionary transitions.
This article is part of the themed issue ‘The major synthetic evolutionary transitions’.
evolutionary transitions; mobile genetic elements; parasites; viruses; antivirus defence; host–parasite coevolution
Specific structures in mRNA modulate translation rate and thus can affect protein folding. Using the protein structures from two eukaryotes and three prokaryotes, we explore the connections between the protein compactness, inferred from solvent accessibility, and mRNA structure, inferred from mRNA folding energy (ΔG). In both prokaryotes and eukaryotes, the ΔG value of the most stable 30 nucleotide segment of the mRNA (ΔGmin) strongly, positively correlates with protein solvent accessibility. Thus, mRNAs containing exceptionally stable secondary structure elements typically encode compact proteins. The correlations between ΔG and protein compactness are much more pronounced in predicted ordered parts of proteins compared to the predicted disordered parts, indicative of an important role of mRNA secondary structure elements in the control of protein folding. Additionally, ΔG correlates with the mRNA length and the evolutionary rate of synonymous positions. The correlations are partially independent and were used to construct multiple regression models which explain about half of the variance of protein solvent accessibility. These findings suggest a model in which the mRNA structure, particularly exceptionally stable RNA structural elements, act as gauges of protein co-translational folding by reducing ribosome speed when the nascent peptide needs time to form and optimize the core structure.
The wide spread of gene exchange and loss in the prokaryotic world has prompted the concept of ‘lateral genomics’ to the point of an outright denial of the relevance of phylogenetic trees for evolution. However, the pronounced coherence congruence of the topologies of numerous gene trees, particularly those for (nearly) universal genes, translates into the notion of a statistical tree of life (STOL), which reflects a central trend of vertical evolution. The STOL can be employed as a framework for reconstruction of the evolutionary processes in the prokaryotic world. Quantitatively, however, horizontal gene transfer (HGT) dominates microbial evolution, with the rate of gene gain and loss being comparable to the rate of point mutations and much greater than the duplication rate. Theoretical models of evolution suggest that HGT is essential for the survival of microbial populations that otherwise deteriorate due to the Muller’s ratchet effect. Apparently, at least some bacteria and archaea evolved dedicated vehicles for gene transfer that evolved from selfish elements such as plasmids and viruses. Recent phylogenomic analyses suggest that episodes of massive HGT were pivotal for the emergence of major groups of organisms such as multiple archaeal phyla as well as eukaryotes. Similar analyses appear to indicate that, in addition to donating hundreds of genes to the emerging eukaryotic lineage, mitochondrial endosymbiosis severely curtailed HGT. These results shed new light on the routes of evolutionary transitions, but caution is due given the inherent uncertainty of deep phylogenies.
Horizontal gene transfer; prokaryotes; evolutionary transitions; microbial evolution; statistical tree of life
Unicellular eukaryotes and most prokaryotes possess distinct mechanisms of programmed cell death (PCD). How an “altruistic” trait, such as PCD, could evolve in unicellular organisms? To address this question, we developed a mathematical model of the virus-host co-evolution that involves interaction between immunity, PCD and cellular aggregation. Analysis of the parameter space of this model shows that under high virus load and imperfect immunity, joint evolution of cell aggregation and PCD is the optimal evolutionary strategy. Given the abundance of viruses in diverse habitats and the wide spread of PCD in most organisms, these findings imply that multiple instances of the emergence of multicellularity and its essential attribute, PCD, could have been driven, at least in part, by the virus-host arms race.
programmed cell death; host-parasite arms race; viruses; evolution of multicellularity
Germline endogenous viral elements (EVEs) genetically preserve viral nucleotide sequences useful to the study of viral evolution, gene mutation, and the phylogenetic relationships among host organisms. Here, we describe a lineage-specific, adeno-associated virus (AAV)-derived endogenous viral element (mAAV-EVE1) found within the germline of numerous closely related marsupial species. Molecular screening of a marsupial DNA panel indicated that mAAV-EVE1 occurs specifically within the marsupial suborder Macropodiformes (present-day kangaroos, wallabies, and related macropodoids), to the exclusion of other Diprotodontian lineages. Orthologous mAAV-EVE1 locus sequences from sixteen macropodoid species, representing a speciation history spanning an estimated 30 million years, facilitated compilation of an inferred ancestral sequence that recapitulates the genome of an ancient marsupial AAV that circulated among Australian metatherian fauna sometime during the late Eocene to early Oligocene. In silico gene reconstruction and molecular modelling indicate remarkable conservation of viral structure over a geologic timescale. Characterisation of AAV-EVE loci among disparate species affords insight into AAV evolution and, in the case of macropodoid species, may offer an additional genetic basis for assignment of phylogenetic relationships among the Macropodoidea. From an applied perspective, the identified AAV “fossils” provide novel capsid sequences for use in translational research and clinical applications.
Research in quantitative evolutionary genomics and systems biology led to the discovery of several universal regularities connecting genomic and molecular phenomic variables. These universals include the log-normal distribution of the evolutionary rates of orthologous genes; the power law–like distributions of paralogous family size and node degree in various biological networks; the negative correlation between a gene's sequence evolution rate and expression level; and differential scaling of functional classes of genes with genome size. The universals of genome evolution can be accounted for by simple mathematical models similar to those used in statistical physics, such as the birth-death-innovation model. These models do not explicitly incorporate selection; therefore, the observed universal regularities do not appear to be shaped by selection but rather are emergent properties of gene ensembles. Although a complete physical theory of evolutionary biology is inconceivable, the universals of genome evolution might qualify as “laws of evolutionary genomics” in the same sense “law” is understood in modern physics.
Cpf1 is an RNA-guided endonuclease of a type V CRISPR-Cas system that has been recently harnessed for genome editing. Here, we report the crystal structure of Acidaminococcus sp. Cpf1 (AsCpf1) in complex with the guide RNA and its target DNA, at 2.8 Å resolution. AsCpf1 adopts a bilobed architecture, with the RNA–DNA heteroduplex bound inside the central channel. The structural comparison of AsCpf1 with Cas9, a type II CRISPR-Cas nuclease, reveals both striking similarity and major differences, thereby explaining their distinct functionalities. AsCpf1 contains the RuvC domain and a putative novel nuclease domain, which are responsible for the cleavage of the non-target and target strands, respectively, and jointly generate staggered DNA double-strand breaks. AsCpf1 recognizes the 5′-TTTN-3′ protospacer adjacent motif by base and shape readout mechanisms. Our findings provide mechanistic insights into RNA-guided DNA cleavage by Cpf1, and establish a framework for rational engineering of the CRISPR-Cpf1 toolbox.
Many surface structures in archaea including various types of pili and the archaellum (archaeal flagellum) are homologous to bacterial type IV pili systems (T4P). The T4P consist of multiple proteins, often with poorly conserved sequences, complicating their identification in sequenced genomes. Here we report a comprehensive census of T4P encoded in archaeal genomes using sensitive methods for protein sequence comparison. This analysis confidently identifies as T4P components about 5000 archaeal gene products, 56% of which are currently annotated as hypothetical in public databases. Combining results of this analysis with a comprehensive comparison of genomic neighborhoods of the T4P, we present models of organization of 10 most abundant variants of archaeal T4P. In addition to the differentiation between major and minor pilins, these models include extra components, such as S-layer proteins, adhesins and other membrane and intracellular proteins. For most of these systems, dedicated major pilin families are identified including numerous stand alone major pilin genes of the PilA family. Evidence is presented that secretion ATPases of the T4P and cognate TadC proteins can interact with different pilin sets. Modular evolution of T4P results in combinatorial variability of these systems. Potential regulatory or modulating proteins for the T4P are identified including KaiC family ATPases, vWA domain-containing proteins and the associated MoxR/GvpN ATPase, TFIIB homologs and multiple unrelated transcription regulators some of which are associated specific T4P. Phylogenomic analysis suggests that at least one T4P system was present in the last common ancestor of the extant archaea. Multiple cases of horizontal transfer and lineage-specific duplication of T4P loci were detected. Generally, the T4P of the archaeal TACK superphylum are more diverse and evolve notably faster than those of euryarchaea. The abundance and enormous diversity of T4P in hyperthermophilic archaea present a major enigma. Apparently, fundamental aspects of the biology of hyperthermophiles remain to be elucidated.
type IV pili; archaea; evolution; comparative genomics; secretion ATPase
Robustness to destabilizing effects of mutations is thought of as a key factor of protein evolution. The connections between two measures of robustness, the relative core size and the computationally estimated effect of mutations on protein stability (ΔΔG), protein abundance and the selection pressure on protein-coding genes (dN/dS) were analyzed for the organisms with a large number of available protein structures including four eukaryotes, two bacteria and one archaeon. The distribution of the effects of mutations in the core on protein stability is universal and indistinguishable in eukaryotes and bacteria, centered at slightly destabilizing amino acid replacements, and with a heavy tail of more strongly destabilizing replacements. The distribution of mutational effects in the hyperthermophilic archaeon Thermococcus gammatolerans is significantly shifted toward strongly destabilizing replacements which is indicative of stronger constraints that are imposed on proteins in hyperthermophiles. The median effect of mutations is strongly, positively correlated with the relative core size, in evidence of the congruence between the two measures of protein robustness. However, both measures show only limited correlations to the expression level and selection pressure on protein-coding genes. Thus, the degree of robustness reflected in the universal distribution of mutational effects appears to be a fundamental, ancient feature of globular protein folds whereas the observed variations are largely neutral and uncoupled from short term protein evolution. A weak anticorrelation between protein core size and selection pressure is observed only for surface residues in prokaryotes but a stronger anticorrelation is observed for all residues in eukaryotic proteins. This substantial difference between proteins of prokaryotes and eukaryotes is likely to stem from the demonstrable higher compactness of prokaryotic proteins.
Viruses were defined as one of the two principal types of organisms in the biosphere, namely, as capsid-encoding organisms in contrast to ribosome-encoding organisms, i.e., all cellular life forms. Structurally similar, apparently homologous capsids are present in a huge variety of icosahedral viruses that infect bacteria, archaea, and eukaryotes. These findings prompted the concept of the capsid as the virus “self” that defines the identity of deep, ancient viral lineages. However, several other widespread viral “hallmark genes” encode key components of the viral replication apparatus (such as polymerases and helicases) and combine with different capsid proteins, given the inherently modular character of viral evolution. Furthermore, diverse, widespread, capsidless selfish genetic elements, such as plasmids and various types of transposons, share hallmark genes with viruses. Viruses appear to have evolved from capsidless selfish elements, and vice versa, on multiple occasions during evolution. At the earliest, precellular stage of life's evolution, capsidless genetic parasites most likely emerged first and subsequently gave rise to different classes of viruses. In this review, we develop the concept of a greater virus world which forms an evolutionary network that is held together by shared conserved genes and includes both bona fide capsid-encoding viruses and different classes of capsidless replicons. Theoretical studies indicate that selfish replicons (genetic parasites) inevitably emerge in any sufficiently complex evolving ensemble of replicators. Therefore, the key signature of the greater virus world is not the presence of a capsid but rather genetic, informational parasitism itself, i.e., various degrees of reliance on the information processing systems of the host.
Diverse eukaryotes including animals and protists are hosts to a broad variety of viruses with double-stranded (ds) DNA genomes, from the largest known viruses, such as pandoraviruses and mimiviruses, to tiny polyomaviruses. Recent comparative genomic analyses have revealed many evolutionary connections between dsDNA viruses of eukaryotes, bacteriophages, transposable elements, and linear DNA plasmids. These findings provide an evolutionary scenario that derives several major groups of eukaryotic dsDNA viruses, including the proposed order “Megavirales,” adenoviruses, and virophages from a group of large virus-like transposons known as Polintons (Mavericks). The Polintons have been recently shown to encode two capsid proteins, suggesting that these elements lead a dual lifestyle with both a transposon and a viral phase and should perhaps more appropriately be named polintoviruses. Here, we describe the recently identified evolutionary relationships between bacteriophages of the family Tectiviridae, polintoviruses, adenoviruses, virophages, large and giant DNA viruses of eukaryotes of the proposed order “Megavirales,” and linear mitochondrial and cytoplasmic plasmids. We outline an evolutionary scenario under which the polintoviruses were the first group of eukaryotic dsDNA viruses that evolved from bacteriophages and became the ancestors of most large DNA viruses of eukaryotes and a variety of other selfish elements. Distinct lines of origin are detectable only for herpesviruses (from a different bacteriophage root) and polyoma/papillomaviruses (from single-stranded DNA viruses and ultimately from plasmids). Phylogenomic analysis of giant viruses provides compelling evidence of their independent origins from smaller members of the putative order “Megavirales,” refuting the speculations on the evolution of these viruses from an extinct fourth domain of cellular life.
Polintons; Megavirales; virus evolution; capsid proteins; translation
The ancestral set of eukaryotic genes is a chimera composed of genes of archaeal and bacterial origins thanks to the endosymbiosis event that gave rise to the mitochondria and apparently antedated the last common ancestor of the extant eukaryotes. The proto-mitochondrial endosymbiont is confidently identified as an α-proteobacterium. In contrast, the archaeal ancestor of eukaryotes remains elusive, although evidence is accumulating that it could have belonged to a deep lineage within the TACK (Thaumarchaeota, Aigarchaeota, Crenarchaeota, Korarchaeota) superphylum of the Archaea. Recent surveys of archaeal genomes show that the apparent ancestors of several key functional systems of eukaryotes, the components of the archaeal “eukaryome,” such as ubiquitin signaling, RNA interference, and actin-based and tubulin-based cytoskeleton structures, are identifiable in different archaeal groups. We suggest that the archaeal ancestor of eukaryotes was a complex form, rooted deeply within the TACK superphylum, that already possessed some quintessential eukaryotic features, in particular, a cytoskeleton, and perhaps was capable of a primitive form of phagocytosis that would facilitate the engulfment of potential symbionts. This putative group of Archaea could have existed for a relatively short time before going extinct or undergoing genome streamlining, resulting in the dispersion of the eukaryome. This scenario might explain the difficulty with the identification of the archaeal ancestor of eukaryotes despite the straightforward detection of apparent ancestors to many signature eukaryotic functional systems.
The apparent ancestors of key eukaryotic features (e.g., ubiquitin signaling, RNA interference, and cytoskeletal structures) are identifiable in different Archaea. But the specific archaeal ancestor of eukaryotes remains elusive.
Prokaryotes harbor a variety of genetic replicators, including plasmids, viruses, and chromosomes, each having differing effects on the phenotype of the hosting cell. Here, we propose a classification for replicators of bacteria and archaea on the basis of their horizontal-transfer potential and the type of relationships (mutualistic, symbiotic, commensal, or parasitic) that they have with the host cell vehicle. Horizontal movement of replicators can be either active or passive, reflecting whether or not the replicator encodes the means to mediate its own transfer from one cell to another. Some replicators also have an infectious extracellular state, thus separating viruses from other mobile elements. From the perspective of the cell vehicle, the different types of replicators form a continuum from genuinely mutualistic to completely parasitic replicators. This classification provides a general framework for dissecting prokaryotic systems into evolutionarily meaningful components.
bacteria; archaea; prokaryotes; classification; replicators; cell vehicles
In a series of conceptual articles published around the millennium, Carl Woese emphasized that evolution of cells is the central problem of evolutionary biology, that the three-domain ribosomal tree of life is an essential framework for reconstructing cellular evolution, and that the evolutionary dynamics of functionally distinct cellular systems are fundamentally different, with the information processing systems “crystallizing” earlier than operational systems. The advances of evolutionary genomics over the last decade vindicate major aspects of Woese’s vision. Despite the observations of pervasive horizontal gene transfer among bacteria and archaea, the ribosomal tree of life comes across as a central statistical trend in the “forest” of phylogenetic trees of individual genes, and hence, an appropriate scaffold for evolutionary reconstruction. The evolutionary stability of information processing systems, primarily translation, becomes ever more striking with the accumulation of comparative genomic data indicating that nearly allof the few universal genes encode translation system components. Woese’s view on the fundamental distinctions between the three domains of cellular life also withstand the test of comparative genomics, although his non-acceptance of symbiogenetic scenarios for the origin of eukaryotes might not. Above all, Woese’s key prediction that understanding evolution of microbes will be the core of the new evolutionary biology appears to be materializing.
Darwinian threshold; cellular evolution; domains of life; evolutionary transitions; horizontal gene transfer; progenote
Accurate inference of orthologous genes is a pre-requisite for most comparative genomics studies, and is also important for functional annotation of new genomes. Identification of orthologous gene sets typically involves phylogenetic tree analysis, heuristic algorithms based on sequence conservation, synteny analysis, or some combination of these approaches. The most direct tree-based methods typically rely on the comparison of an individual gene tree with a species tree. Once the two trees are accurately constructed, orthologs are straightforwardly identified by the definition of orthology as those homologs that are related by speciation, rather than gene duplication, at their most recent point of origin. Although ideal for the purpose of orthology identification in principle, phylogenetic trees are computationally expensive to construct for large numbers of genes and genomes, and they often contain errors, especially at large evolutionary distances. Moreover, in many organisms, in particular prokaryotes and viruses, evolution does not appear to have followed a simple ‘tree-like’ mode, which makes conventional tree reconciliation inapplicable. Other, heuristic methods identify probable orthologs as the closest homologous pairs or groups of genes in a set of organisms. These approaches are faster and easier to automate than tree-based methods, with efficient implementations provided by graph-theoretical algorithms enabling comparisons of thousands of genomes. Comparisons of these two approaches show that, despite conceptual differences, they produce similar sets of orthologs, especially at short evolutionary distances. Synteny also can aid in identification of orthologs. Often, tree-based, sequence similarity- and synteny-based approaches can be combined into flexible hybrid methods.
homolog; ortholog; paralog; xenolog; orthologous groups; tree reconciliation; comparative genomics
Biological information encoded in genomes is fundamentally different from and effectively orthogonal to Shannon entropy. The biologically relevant concept of information has to do with ‘meaning’, i.e. encoding various biological functions with various degree of evolutionary conservation. Apart from direct experimentation, the meaning, or biological information content, can be extracted and quantified from alignments of homologous nucleotide or amino acid sequences but generally not from a single sequence, using appropriately modified information theoretical formulae. For short, information encoded in genomes is defined vertically but not horizontally. Informally but substantially, biological information density seems to be equivalent to ‘meaning’ of genomic sequences that spans the entire range from sharply defined, universal meaning to effective meaninglessness. Large fractions of genomes, up to 90% in some plants, belong within the domain of fuzzy meaning. The sequences with fuzzy meaning can be recruited for various functions, with the meaning subsequently fixed, and also could perform generic functional roles that do not require sequence conservation. Biological meaning is continuously transferred between the genomes of selfish elements and hosts in the process of their coevolution. Thus, in order to adequately describe genome function and evolution, the concepts of information theory have to be adapted to incorporate the notion of meaning that is central to biology.
information; meaning; evolution; selfish elements
The CRISPR-Cas system of prokaryotic adaptive immunity displays features of a mechanism for directional, Lamarckian evolution. Indeed, this system modifies a specific locus in a bacterial or archaeal genome by inserting a piece of foreign DNA into a CRISPR array which results in acquired, heritable resistance to the cognate selfish element. A key element of the Lamarckian scheme is the specificity and directionality of the mutational process whereby an environmental cue causes only mutations that provide specific adaptations to the original challenge. In the case of adaptive immunity, the specificity of mutations is equivalent to self-nonself discrimination. Recent studies on the CRISPR mechanism have shown that the levels of discrimination can substantially differ such that in some CRISPR-Cas variants incorporation of DNA is random whereas discrimination occurs by selection of cells that carry cognate inserts. In other systems, a higher level of specificity appears to be achieved via specialized mechanisms. These findings emphasize the continuity between random and directed mutations and the critical importance of evolved mechanisms that govern the mutational process.
Reviewers: This article has been reviewed by Yitzhak Pilpel, Martijn Huynen, and Bojan Zagrovic.
CRISPR-Cas; Self-nonself discrimination; Lamarckian evolution; Darwinian evolution; DNA repair
Casposons are a superfamily of putative self-synthesizing transposable elements that are predicted to employ a homolog of Cas1 protein as a recombinase and could have contributed to the origin of the CRISPR-Cas adaptive immunity systems in archaea and bacteria. Casposons remain uncharacterized experimentally, except for the recent demonstration of the integrase activity of the Cas1 homolog, and given their relative rarity in archaea and bacteria, original comparative genomic analysis has not provided direct indications of their mobility. Here, we report evidence of casposon mobility obtained by comparison of the genomes of 62 strains of the archaeon Methanosarcina mazei. In these genomes, casposons are variably inserted in three distinct sites indicative of multiple, recent gains, and losses. Some casposons are inserted into other mobile genetic elements that might provide vehicles for horizontal transfer of the casposons. Additionally, many M. mazei genomes contain previously undetected solo terminal inverted repeats that apparently are derived from casposons and could resemble intermediates in CRISPR evolution. We further demonstrate the sequence specificity of casposon insertion and note clear parallels with the adaptation mechanism of CRISPR-Cas. Finally, besides identifying additional representatives in each of the three originally defined families, we describe a new, fourth, family of casposons.
casposons; self-synthesizing transposons; CRISPR-Cas; mobile genetic elements; transposition