The question whether adaptation follows a deterministic route largely prescribed by the environment or can proceed along a large number of alternative trajectories has engaged extensive research over the recent years. Experimental evolution studies enabled by advances in high throughput techniques for genome sequencing and manipulation, along with increasingly detailed mathematical modeling of fitness landscapes, are beginning to allow quantitative exploration of the repeatability of evolutionary trajectories. It is becoming clear that evolutionary trajectories in static correlated fitness landscapes are substantially non-random but the relative contributions of determinism and stochasticity in the evolution of specific phenotypes strongly depend on the specific conditions, particularly the magnitude of the selective pressure and the number of available beneficial mutations.
evolutionary trajectory; predictability of evolution; fitness landscape; divergence of trajectories
The recently discovered CRISPR-Cas adaptive immune system is present in almost all archaea and many bacteria. It consists of cassettes of CRISPR repeats that incorporate spacers homologous to fragments of viral or plasmid genomes that are employed as guide RNAs in the immune response, along with numerous CRISPR-associated (cas) genes that encode proteins possessing diverse, only partially characterized activities required for the action of the system. Here, we investigate the evolution of the cas genes and show that they evolve under purifying selection that is typically much weaker than the median strength of purifying selection affecting genes in the respective genomes. The exceptions are the cas1 and cas2 genes that typically evolve at levels of purifying selection close to the genomic median. Thus, although these genes are implicated in the acquisition of spacers from alien genomes, they do not appear to be directly involved in an arms race between bacterial and archaeal hosts and infectious agents. These genes might possess functions distinct from and additional to their role in the CRISPR-Cas-mediated immune response. Taken together with evidence of the frequent horizontal transfer of cas genes reported previously and with the wide-spread microscale recombination within these genes detected in this work, these findings reveal the highly dynamic evolution of cas genes. This conclusion is in line with the involvement of CRISPR-Cas in antiviral immunity that is likely to entail a coevolutionary arms race with rapidly evolving viruses. However, we failed to detect evidence of strong positive selection in any of the cas genes.
When Charles Darwin formulated the central principles of evolutionary biology in the Origin of Species in 1859 and the architects of the Modern Synthesis integrated these principles with population genetics almost a century later, the principal if not the sole objects of evolutionary biology were multicellular eukaryotes, primarily animals and plants. Before the advent of efficient gene sequencing, all attempts to extend evolutionary studies to bacteria have been futile. Sequencing of the rRNA genes in thousands of microbes allowed the construction of the three- domain “ribosomal Tree of Life” that was widely thought to have resolved the evolutionary relationships between the cellular life forms. However, subsequent massive sequencing of numerous, complete microbial genomes revealed novel evolutionary phenomena, the most fundamental of these being: (1) pervasive horizontal gene transfer (HGT), in large part mediated by viruses and plasmids, that shapes the genomes of archaea and bacteria and call for a radical revision (if not abandonment) of the Tree of Life concept, (2) Lamarckian-type inheritance that appears to be critical for antivirus defense and other forms of adaptation in prokaryotes, and (3) evolution of evolvability, i.e., dedicated mechanisms for evolution such as vehicles for HGT and stress-induced mutagenesis systems. In the non-cellular part of the microbial world, phylogenomics and metagenomics of viruses and related selfish genetic elements revealed enormous genetic and molecular diversity and extremely high abundance of viruses that come across as the dominant biological entities on earth. Furthermore, the perennial arms race between viruses and their hosts is one of the defining factors of evolution. Thus, microbial phylogenomics adds new dimensions to the fundamental picture of evolution even as the principle of descent with modification discovered by Darwin and the laws of population genetics remain at the core of evolutionary biology.
Darwin; modern synthesis; comparative genomics; tree of life; horizontal gene transfer
Collections of Clusters of Orthologous Genes (COGs) provide indispensable tools for comparative genomic analysis, evolutionary reconstruction and functional annotation of new genomes. Initially, COGs were made for all complete genomes of cellular life forms that were available at the time. However, with the accumulation of thousands of complete genomes, construction of a comprehensive COG set has become extremely computationally demanding and prone to error propagation, necessitating the switch to taxon-specific COG collections. Previously, we reported the collection of COGs for 41 genomes of Archaea (arCOGs). Here we present a major update of the arCOGs and describe evolutionary reconstructions to reveal general trends in the evolution of Archaea.
The updated version of the arCOG database incorporates 91% of the pangenome of 120 archaea (251,032 protein-coding genes altogether) into 10,335 arCOGs. Using this new set of arCOGs, we performed maximum likelihood reconstruction of the genome content of archaeal ancestral forms and gene gain and loss events in archaeal evolution. This reconstruction shows that the last Common Ancestor of the extant Archaea was an organism of greater complexity than most of the extant archaea, probably with over 2,500 protein-coding genes. The subsequent evolution of almost all archaeal lineages was apparently dominated by gene loss resulting in genome streamlining. Overall, in the evolution of Archaea as well as a representative set of bacteria that was similarly analyzed for comparison, gene losses are estimated to outnumber gene gains at least 4 to 1. Analysis of specific patterns of gene gain in Archaea shows that, although some groups, in particular Halobacteria, acquire substantially more genes than others, on the whole, gene exchange between major groups of Archaea appears to be largely random, with no major ‘highways’ of horizontal gene transfer.
The updated collection of arCOGs is expected to become a key resource for comparative genomics, evolutionary reconstruction and functional annotation of new archaeal genomes. Given that, in spite of the major increase in the number of genomes, the conserved core of archaeal genes appears to be stabilizing, the major evolutionary trends revealed here have a chance to stand the test of time.
This article was reviewed by (for complete reviews see the Reviewers’ Reports section): Dr. PLG, Prof. PF, Dr. PL (nominated by Prof. JPG).
Archaea; Orthologs; Horizontal gene transfer
Bacteria and archaea face continual onslaughts of rapidly diversifying viruses and plasmids. Many prokaryotes maintain adaptive immune systems known as clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated genes (Cas). CRISPR-Cas systems are genomic sensors that serially acquire viral and plasmid DNA fragments (spacers) that are utilized to target and cleave matching viral and plasmid DNA in subsequent genomic invasions, offering critical immunological memory. Only 50% of sequenced bacteria possess CRISPR-Cas immunity, in contrast to over 90% of sequenced archaea. To probe why half of bacteria lack CRISPR-Cas immunity, we combined comparative genomics and mathematical modeling. Analysis of hundreds of diverse prokaryotic genomes shows that CRISPR-Cas systems are substantially more prevalent in thermophiles than in mesophiles. With sequenced bacteria disproportionately mesophilic and sequenced archaea mostly thermophilic, the presence of CRISPR-Cas appears to depend more on environmental temperature than on bacterial-archaeal taxonomy. Mutation rates are typically severalfold higher in mesophilic prokaryotes than in thermophilic prokaryotes. To quantitatively test whether accelerated viral mutation leads microbes to lose CRISPR-Cas systems, we developed a stochastic model of virus-CRISPR coevolution. The model competes CRISPR-Cas-positive (CRISPR-Cas+) prokaryotes against CRISPR-Cas-negative (CRISPR-Cas−) prokaryotes, continually weighing the antiviral benefits conferred by CRISPR-Cas immunity against its fitness costs. Tracking this cost-benefit analysis across parameter space reveals viral mutation rate thresholds beyond which CRISPR-Cas cannot provide sufficient immunity and is purged from host populations. These results offer a simple, testable viral diversity hypothesis to explain why mesophilic bacteria disproportionately lack CRISPR-Cas immunity. More generally, fundamental limits on the adaptability of biological sensors (Lamarckian evolution) are predicted.
A remarkable recent discovery in microbiology is that bacteria and archaea possess systems conferring immunological memory and adaptive immunity. Clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated genes (CRISPR-Cas) are genomic sensors that allow prokaryotes to acquire DNA fragments from invading viruses and plasmids. Providing immunological memory, these stored fragments destroy matching DNA in future viral and plasmid invasions. CRISPR-Cas systems also provide adaptive immunity, keeping up with mutating viruses and plasmids by continually acquiring new DNA fragments. Surprisingly, less than 50% of mesophilic bacteria, in contrast to almost 90% of thermophilic bacteria and Archaea, maintain CRISPR-Cas immunity. Using mathematical modeling, we probe this dichotomy, showing how increased viral mutation rates can explain the reduced prevalence of CRISPR-Cas systems in mesophiles. Rapidly mutating viruses outrun CRISPR-Cas immune systems, likely decreasing their prevalence in bacterial populations. Thus, viral adaptability may select against, rather than for, immune adaptability in prokaryotes.
The 1,021,348 base pair genome sequence of the Acanthamoeba polyphaga moumouvirus, a new member of the Mimiviridae family infecting Acanthamoeba polyphaga, is reported. The moumouvirus represents a third lineage beside mimivirus and megavirus. Thereby, it is a new member of the recently proposed Megavirales order. This giant virus was isolated from a cooling tower water in southeastern France but is most closely related to Megavirus chiliensis, which was isolated from ocean water off the coast of Chile. The moumouvirus is predicted to encode 930 proteins, of which 879 have detectable homologs. Among these predicted proteins, for 702 the closest homolog was detected in Megavirus chiliensis, with the median amino acid sequence identity of 62%. The evolutionary affinity of moumouvirus and megavirus was further supported by phylogenetic tree analysis of conserved genes. The moumouvirus and megavirus genomes share near perfect orthologous gene collinearity in the central part of the genome, with the variations concentrated in the terminal regions. In addition, genomic comparisons of the Mimiviridae reveal substantial gene loss in the moumouvirus lineage. The majority of the remaining moumouvirus proteins are most similar to homologs from other Mimiviridae members, and for 27 genes the closest homolog was found in bacteria. Phylogenetic analysis of these genes supported gene acquisition from diverse bacteria after the separation of the moumouvirus and megavirus lineages. Comparative genome analysis of the three lineages of the Mimiviridae revealed significant mobility of Group I self-splicing introns, with the highest intron content observed in the moumouvirus genome.
moumouvirus; mimivirus; giant virus; megavirus; Mimiviridae; Megavirales; horizontal gene transfer; viral genome; nucleo-cytoplasmic large DNA viruses
A fundamental observation of comparative genomics is that the distribution of evolution rates across the complete sets of orthologous genes in pairs of related genomes remains virtually unchanged throughout the evolution of life, from bacteria to mammals. The most straightforward explanation for the conservation of this distribution appears to be that the relative evolution rates of all genes remain nearly constant, or in other words, that evolutionary rates of different genes are strongly correlated within each evolving genome. This correlation could be explained by a model that we denoted Universal PaceMaker (UPM) of genome evolution. The UPM model posits that the rate of evolution changes synchronously across genome-wide sets of genes in all evolving lineages. Alternatively, however, the correlation between the evolutionary rates of genes could be a simple consequence of molecular clock (MC). We sought to differentiate between the MC and UPM models by fitting thousands of phylogenetic trees for bacterial and archaeal genes to supertrees that reflect the dominant trend of vertical descent in the evolution of archaea and bacteria and that were constrained according to the two models. The goodness of fit for the UPM model was better than the fit for the MC model, with overwhelming statistical significance, although similarly to the MC, the UPM is strongly overdispersed. Thus, the results of this analysis reveal a universal, genome-wide pacemaker of evolution that could have been in operation throughout the history of life.
A central concept of evolution is Molecular Clock according to which each gene evolves at a characteristic, near constant rate. Numerous studies support the Molecular Clock hypothesis in principle but also show that the clock is indeed very approximate. Genome-wide comparative analysis of phylogenetic trees described here reveals a distinct, more general feature of genome evolution that we called Universal Pacemaker. Under this model, when the rate of evolution changes, the change occurs synchronously in many if not all genes in the evolving genome. In other words, the relative rates of gene evolution remain constant across long evolutionary spans: if a gene is slow relative to the rest of the genes in the given lineage, it is always slow, and if it evolves fast, it is always fast. We show here that the Universal Pacemaker model fits the available data much better than the traditional Molecular Clock model. These findings are compatible with the previously observed accelerations and decelerations of evolution in individual lineages but we show that synchronous, genome-wide change of evolutionary rates is a global feature of genome evolution that appears to pervade the entire history of life.
Orthologous relationships between genes are routinely inferred from bidirectional best hits (BBH) in pairwise genome comparisons. However, to our knowledge, it has never been quantitatively demonstrated that orthologs form BBH. To test this “BBH-orthology conjecture,” we take advantage of the operon organization of bacterial and archaeal genomes and assume that, when two genes in compared genomes are flanked by two BBH show statistically significant sequence similarity to one another, these genes are bona fide orthologs. Under this assumption, we tested whether middle genes in “syntenic orthologous gene triplets” form BBH. We found that this was the case in more than 95% of the syntenic gene triplets in all genome comparisons. A detailed examination of the exceptions to this pattern, including maximum likelihood phylogenetic tree analysis, showed that some of these deviations involved artifacts of genome annotation, whereas very small fractions represented random assignment of the best hit to one of closely related in-paralogs, paralogous displacement in situ, or even less frequent genuine violations of the BBH–orthology conjecture caused by acceleration of evolution in one of the orthologs. We conclude that, at least in prokaryotes, genes for which independent evidence of orthology is available typically form BBH and, conversely, BBH can serve as a strong indication of gene orthology.
orthology; bidirectional best hit; genome comparison; synteny
The virus-host arms race is a major theater for evolutionary innovation. Archaea and bacteria have evolved diverse, elaborate antivirus defense systems that function on two general principles: i) immune systems that discriminate self DNA from nonself DNA and specifically destroy the foreign, in particular viral, genomes, whereas the host genome is protected, or ii) programmed cell suicide or dormancy induced by infection.
Presentation of the hypothesis
Almost all genomic loci encoding immunity systems such as CRISPR-Cas, restriction-modification and DNA phosphorothioation also encompass suicide genes, in particular those encoding known and predicted toxin nucleases, which do not appear to be directly involved in immunity. In contrast, the immunity systems do not appear to encode antitoxins found in typical toxin-antitoxin systems. This raises the possibility that components of the immunity system themselves act as reversible inhibitors of the associated toxin proteins or domains as has been demonstrated for the Escherichia coli anticodon nuclease PrrC that interacts with the PrrI restriction-modification system. We hypothesize that coupling of diverse immunity and suicide/dormancy systems in prokaryotes evolved under selective pressure to provide robustness to the antivirus response. We further propose that the involvement of suicide/dormancy systems in the coupled antivirus response could take two distinct forms:
1) induction of a dormancy-like state in the infected cell to ‘buy time’ for activation of adaptive immunity; 2) suicide or dormancy as the final recourse to prevent viral spread triggered by the failure of immunity.
Testing the hypothesis
This hypothesis entails many experimentally testable predictions. Specifically, we predict that Cas2 protein present in all cas operons is a mRNA-cleaving nuclease (interferase) that might be activated at an early stage of virus infection to enable incorporation of virus-specific spacers into the CRISPR locus or to trigger cell suicide when the immune function of CRISPR-Cas systems fails. Similarly, toxin-like activity is predicted for components of numerous other defense loci.
Implications of the hypothesis
The hypothesis implies that antivirus response in prokaryotes involves key decision-making steps at which the cell chooses the path to follow by sensing the course of virus infection.
This article was reviewed by Arcady Mushegian, Etienne Joly and Nick Grishin. For complete reviews, go to the Reviewers’ reports section.
There are no known RNA viruses that infect Archaea. Filling this gap in our knowledge of viruses will enhance our understanding of the relationships between RNA viruses from the three domains of cellular life and, in particular, could shed light on the origin of the enormous diversity of RNA viruses infecting eukaryotes. We describe here the identification of novel RNA viral genome segments from high-temperature acidic hot springs in Yellowstone National Park in the United States. These hot springs harbor low-complexity cellular communities dominated by several species of hyperthermophilic Archaea. A viral metagenomics approach was taken to assemble segments of these RNA virus genomes from viral populations isolated directly from hot spring samples. Analysis of these RNA metagenomes demonstrated unique gene content that is not generally related to known RNA viruses of Bacteria and Eukarya. However, genes for RNA-dependent RNA polymerase (RdRp), a hallmark of positive-strand RNA viruses, were identified in two contigs. One of these contigs is approximately 5,600 nucleotides in length and encodes a polyprotein that also contains a region homologous to the capsid protein of nodaviruses, tetraviruses, and birnaviruses. Phylogenetic analyses of the RdRps encoded in these contigs indicate that the putative archaeal viruses form a unique group that is distinct from the RdRps of RNA viruses of Eukarya and Bacteria. Collectively, our findings suggest the existence of novel positive-strand RNA viruses that probably replicate in hyperthermophilic archaeal hosts and are highly divergent from RNA viruses that infect eukaryotes and even more distant from known bacterial RNA viruses. These positive-strand RNA viruses might be direct ancestors of RNA viruses of eukaryotes.
Viruses with large genomes encode numerous proteins that do not directly participate in virus biogenesis but rather modify key functional systems of infected cells. We report that a distinct group of giant viruses infecting unicellular eukaryotes that includes Organic Lake Phycodnaviruses and Phaeocystis globosa virus encode predicted proteorhodopsins that have not been previously detected in viruses. Search of metagenomic sequence data shows that putative viral proteorhodopsins are extremely abundant in marine environments. Phylogenetic analysis suggests that giant viruses acquired proteorhodopsins via horizontal gene transfer from proteorhodopsin-encoding protists although the actual donor(s) could not be presently identified. The pattern of conservation of the predicted functionally important amino acid residues suggests that viral proteorhodopsin homologs function as sensory rhodopsins. We hypothesize that viral rhodopsins modulate light-dependent signaling, in particular phototaxis, in infected protists.
This article was reviewed by Igor B. Zhulin and Laksminarayan M. Iyer. For the full reviews, see the Reviewers’ reports section.
The serotonin 2C receptor (5-HT2CR)–a key regulator of diverse neurological processes–exhibits functional variability derived from editing of its pre-mRNA by site-specific adenosine deamination (A-to-I pre-mRNA editing) in five distinct sites. Here we describe a statistical technique that was developed for analysis of the dependencies among the editing states of the five sites. The statistical significance of the observed correlations was estimated by comparing editing patterns in multiple individuals. For both human and rat 5-HT2CR, the editing states of the physically proximal sites A and B were found to be strongly dependent. In contrast, the editing states of sites C and D, which are also physically close, seem not to be directly dependent but instead are linked through the dependencies on sites A and B, respectively. We observed pronounced differences between the editing patterns in humans and rats: in humans site A is the key determinant of the editing state of the other sites, whereas in rats this role belongs to site B. The structure of the dependencies among the editing sites is notably simpler in rats than it is in humans implying more complex regulation of 5-HT2CR editing and, by inference, function in the human brain. Thus, exhaustive statistical analysis of the 5-HT2CR editing patterns indicates that the editing state of sites A and B is the primary determinant of the editing states of the other three sites, and hence the overall editing pattern. Taken together, these findings allow us to propose a mechanistic model of concerted action of ADAR1 and ADAR2 in 5-HT2CR editing. Statistical approach developed here can be applied to other cases of interdependencies among modification sites in RNA and proteins.
The serotonin receptor 2C is a key regulator of diverse neurological processes that affect feeding behavior, sleep, sexual behavior, anxiety and depression. The function of the receptor itself is regulated via so-called pre-mRNA editing, i.e. site-specific adenosine deamination in five distinct sites. The greater the number of edited sites in the serotonin receptor mRNA, the lower the activity of the receptor it encodes. Here we used the results of extensive massively parallel sequencing from human and rat brains to elucidate the dependencies among the editing states of the five sites. Despite the apparent simplicity of the problem, disambiguation of these dependencies is a difficult task that required development of a new statistical technique. We employed this method to analyse the dependencies among editing in the 5 susceptible sites of the receptor mRNA and found that the proximal, juxtaposed sites A and B are strongly interdependent, and that the editing state of these two sites is a major determinant of the editing states of the other three sites, and hence the overall editing pattern. The statistical approach we developed for the analysis of mRNA editing can be applied to other cases of multiple site modification in RNA and proteins.
Prions are agents of analog, protein conformation-based inheritance that can confer beneficial phenotypes to cells, especially under stress. Combined with genetic variation, prion-mediated inheritance can be channeled into prion-independent genomic inheritance. Latest screening shows that prions are common, at least in fungi. Thus, there is non-negligible flow of information from proteins to the genome in modern cells, in a direct violation of the Central Dogma of molecular biology. The prion-mediated heredity that violates the Central Dogma appears to be a specific, most radical manifestation of the widespread assimilation of protein (epigenetic) variation into genetic variation. The epigenetic variation precedes and facilitates genetic adaptation through a general ‘look-ahead effect’ of phenotypic mutations. This direction of the information flow is likely to be one of the important routes of environment-genome interaction and could substantially contribute to the evolution of complex adaptive traits.
This article was reviewed by Jerzy Jurka, Pierre Pontarotti and Juergen Brosius. For the complete reviews, see the Reviewers’ Reports section.
The Nucleo-Cytoplasmic Large DNA Viruses (NCLDV) constitute an apparently monophyletic group that consists of at least 6 families of viruses infecting a broad variety of eukaryotic hosts. A comprehensive genome comparison and maximum-likelihood reconstruction of the NCLDV evolution revealed a set of approximately 50 conserved, core genes that could be mapped to the genome of the common ancestor of this class of eukaryotic viruses.
We performed a detailed phylogenetic analysis of these core NCLDV genes and applied the constrained tree approach to show that the majority of the core genes are unlikely to be monophyletic. Several of the core genes have been independently acquired from different sources by different NCLDV lineages whereas for the majority of these genes displacement by homologs from cellular organisms in one or more groups of the NCLDV was demonstrated.
A detailed study of the evolution of the genomic core of the NCLDV reveals substantial complexity and diversity of evolutionary scenarios that was largely unsuspected previously. The phylogenetic coherence between the core genes is sufficient to validate the hypothesis on the evolution of all NCLDV from a common ancestral virus although the set of ancestral genes might be smaller than previously inferred from patterns of gene presence-absence.
A novel bacteriophage infecting Staphylococus pasteuri was isolated during a screen for phages in Antarctic soils. The phage named SpaA1 is morphologically similar to phages of the family Siphoviridae. The 42,784 bp genome of SpaA1 is a linear, double-stranded DNA molecule with 3′ protruding cohesive ends. The SpaA1 genome encompasses 63 predicted protein-coding genes which cluster within three regions of the genome, each of apparently different origin, in a mosaic pattern. In two of these regions, the gene sets resemble those in prophages of Bacillus thuringiensis kurstaki str. T03a001 (genes involved in DNA replication/transcription, cell entry and exit) and B. cereus AH676 (additional regulatory and recombination genes), respectively. The third region represents an almost complete genome (except for the short terminal segments) of a distinct bacteriophage, MZTP02. Nearly the same gene module was identified in prophages of B. thuringiensis serovar monterrey BGSC 4AJ1 and B. cereus Rock4-2. These findings suggest that MZTP02 can be shuttled between genomes of other bacteriophages and prophages, leading to the formation of chimeric genomes. The presence of a complete phage genome in the genome of other phages apparently has not been described previously and might represent a ‘fast track’ route of virus evolution and horizontal gene transfer. Another phage (BceA1) nearly identical in sequence to SpaA1, and also including the almost complete MZTP02 genome within its own genome, was isolated from a bacterium of the B. cereus/B. thuringiensis group. Remarkably, both SpaA1 and BceA1 phages can infect B. cereus and B. thuringiensis, but only one of them, SpaA1, can infect S. pasteuri. This finding is best compatible with a scenario in which MZTP02 was originally contained in BceA1 infecting Bacillus spp, the common hosts for these two phages, followed by emergence of SpaA1 infecting S. pasteuri.
In a recent BMC Evolutionary Biology article, Huiquan Liu and colleagues report two new genomes of double-stranded RNA (dsRNA) viruses from fungi and use these as a springboard to perform an extensive phylogenomic analysis of dsRNA viruses. The results support the old scenario of polyphyletic origin of dsRNA viruses from different groups of positive-strand RNA viruses and additionally reveal extensive horizontal gene transfer between diverse viruses consistent with the network-like rather than tree-like mode of viral evolution. Together with the unexpected discoveries of the first putative archaeal RNA virus and a RNA-DNA virus hybrid, this work shows that RNA viral genomics has major surprises to deliver.
See research article: http://www.biomedcentral.com/1471-2148/12/91
Archaeal and bacterial ribosomes contain more than 50 proteins, including 34 that are universally conserved in the three domains of cellular life (bacteria, archaea, and eukaryotes). Despite the high sequence conservation, annotation of ribosomal (r-) protein genes is often difficult because of their short lengths and biased sequence composition. We developed an automated computational pipeline for identification of r-protein genes and applied it to 995 completely sequenced bacterial and 87 archaeal genomes available in the RefSeq database. The pipeline employs curated seed alignments of r-proteins to run position-specific scoring matrix (PSSM)-based BLAST searches against six-frame genome translations, mitigating possible gene annotation errors. As a result of this analysis, we performed a census of prokaryotic r-protein complements, enumerated missing and paralogous r-proteins, and analyzed the distributions of ribosomal protein genes among chromosomal partitions. Phyletic patterns of bacterial and archaeal r-protein genes were mapped to phylogenetic trees reconstructed from concatenated alignments of r-proteins to reveal the history of likely multiple independent gains and losses. These alignments, available for download, can be used as search profiles to improve genome annotation of r-proteins and for further comparative genomics studies.
Evolution of exon-intron structure of eukaryotic genes has been a matter of long-standing, intensive debate. The introns-early concept, later rebranded ‘introns first’ held that protein-coding genes were interrupted by numerous introns even at the earliest stages of life's evolution and that introns played a major role in the origin of proteins by facilitating recombination of sequences coding for small protein/peptide modules. The introns-late concept held that introns emerged only in eukaryotes and new introns have been accumulating continuously throughout eukaryotic evolution. Analysis of orthologous genes from completely sequenced eukaryotic genomes revealed numerous shared intron positions in orthologous genes from animals and plants and even between animals, plants and protists, suggesting that many ancestral introns have persisted since the last eukaryotic common ancestor (LECA). Reconstructions of intron gain and loss using the growing collection of genomes of diverse eukaryotes and increasingly advanced probabilistic models convincingly show that the LECA and the ancestors of each eukaryotic supergroup had intron-rich genes, with intron densities comparable to those in the most intron-rich modern genomes such as those of vertebrates. The subsequent evolution in most lineages of eukaryotes involved primarily loss of introns, with only a few episodes of substantial intron gain that might have accompanied major evolutionary innovations such as the origin of metazoa. The original invasion of self-splicing Group II introns, presumably originating from the mitochondrial endosymbiont, into the genome of the emerging eukaryote might have been a key factor of eukaryogenesis that in particular triggered the origin of endomembranes and the nucleus. Conversely, splicing errors gave rise to alternative splicing, a major contribution to the biological complexity of multicellular eukaryotes. There is no indication that any prokaryote has ever possessed a spliceosome or introns in protein-coding genes, other than relatively rare mobile self-splicing introns. Thus, the introns-first scenario is not supported by any evidence but exon-intron structure of protein-coding genes appears to have evolved concomitantly with the eukaryotic cell, and introns were a major factor of evolution throughout the history of eukaryotes. This article was reviewed by I. King Jordan, Manuel Irimia (nominated by Anthony Poole), Tobias Mourier (nominated by Anthony Poole), and Fyodor Kondrashov. For the complete reports, see the Reviewers’ Reports section.
Intron sliding; Intron gain; Intron loss; Spliceosome; Splicing signals; Evolution of exon/intron structure; Alternative splicing; Phylogenetic trees; Mobile domains; Eukaryotic ancestor
Tubulins are a family of GTPases that are key components of the cytoskeleton in all eukaryotes and are distantly related to the FtsZ GTPase that is involved in cell division in most bacteria and many archaea. Among prokaryotes, bona fide tubulins have been identified only in bacteria of the genus Prosthecobacter. These bacterial tubulin genes appear to have been horizontally transferred from eukaryotes. Here we describe tubulins encoded in the genomes of thaumarchaeota of the genus Nitrosoarchaeum that we denote artubulins Phylogenetic analysis results are compatible with the origin of eukaryotic tubulins from artubulins. These findings expand the emerging picture of the origin of key components of eukaryotic functional systems from ancestral forms that are scattered among the extant archaea.
This article was reviewed by Gáspár Jékely and J. Peter Gogarten.
In eukaryotes, the CMG (CDC45, MCM, GINS) complex containing the replicative helicase MCM is a key player in DNA replication. Archaeal homologs of the eukaryotic MCM and GINS proteins have been identified but until recently no homolog of the CDC45 protein was known. Two recent developments, namely the discovery of archaeal GINS-associated nuclease (GAN) that belongs to the RecJ family of the DHH hydrolase superfamily and the demonstration of homology between the DHH domains of CDC45 and RecJ, show that at least some Archaea possess a full complement of homologs of the CMG complex subunits. Here we present the results of in-depth phylogenomic analysis of RecJ homologs in archaea.
We confirm and extend the recent hypothesis that CDC45 is the eukaryotic ortholog of the bacterial and archaeal RecJ family nucleases. At least one RecJ homolog was identified in all sequenced archaeal genomes, with the single exception of Caldivirga maquilingensis. These proteins include previously unnoticed remote RecJ homologs with inactivated DHH domain in Thermoproteales. Combined with phylogenetic tree reconstruction of diverse eukaryotic, archaeal and bacterial DHH subfamilies, this analysis yields a complex scenario of RecJ family evolution in Archaea which includes independent inactivation of the nuclease domain in Crenarchaeota and Halobacteria, and loss of this domain in Methanococcales.
The archaeal complex of a CDC45/RecJ homolog, MCM and GINS is homologous and most likely functionally analogous to the eukaryotic CMG complex, and appears to be a key component of the DNA replication machinery in all Archaea. It is inferred that the last common archaeo-eukaryotic ancestor encoded a CMG complex that contained an active nuclease of the RecJ family. The inactivated RecJ homologs in several archaeal lineages most likely are dedicated structural components of replication complexes.
This article was reviewed by Prof. Patrick Forterre, Dr. Stephen John Aves (nominated by Dr. Purificacion Lopez-Garcia) and Prof. Martijn Huynen.
For the full reviews, see the Reviewers' Comments section.