We compare the sets of experimentally validated long intergenic non-coding (linc)RNAs from human and mouse and apply a maximum likelihood approach to estimate the total number of lincRNA genes as well as the size of the conserved part of the lincRNome. Under the assumption that the sets of experimentally validated lincRNAs are random samples of the lincRNomes of the corresponding species, we estimate the total lincRNome size at approximately 40,000 to 50,000 species, at least twice the number of protein-coding genes. We further estimate that the fraction of the human and mouse euchromatic genomes encoding lincRNAs is more than twofold greater than the fraction of protein-coding sequences. Although the sequences of most lincRNAs are much less strongly conserved than protein sequences, the extent of orthology between the lincRNomes is unexpectedly high, with 60 to 70% of the lincRNA genes shared between human and mouse. The orthologous mammalian lincRNAs can be predicted to perform equivalent functions; accordingly, it appears likely that thousands of evolutionarily conserved functional roles of lincRNAs remain to be characterized.
Genome analysis of humans and other mammals reveals a surprisingly small number of protein-coding genes, only slightly over 20,000 (although the diversity of actual proteins is substantially augmented by alternative transcription and alternative splicing). Recent analysis of the mammalian genomes and transcriptomes, in particular, using the RNAseq technology, shows that, in addition to protein-coding genes, mammalian genomes encode many long non-coding RNAs. For some of these transcripts, various regulatory functions have been demonstrated, but on the whole the repertoire of long non-coding RNAs remains poorly characterized. We compared the identified long intergenic non-coding (linc)RNAs from human and mouse, and employed a specially developed statistical technique to estimate the size and evolutionary conservation of the human and mouse lincRNomes. The estimates show that there are at least twice as many human and mouse lincRNAs than there are protein-coding genes. Moreover, about two third of the lincRNA genes appear to be conserved between human and mouse, implying thousands of conserved but still uncharacterized functions.
Evolution of prokaryotes involves extensive loss and gain of genes, which lead to substantial differences in the gene repertoires even among closely related organisms. Through a wide range of phylogenetic depths, gene frequency distributions in prokaryotic pangenomes bear a characteristic, asymmetrical U-shape, with a core of (nearly) universal genes, a “shell” of moderately common genes, and a “cloud” of rare genes. We employ mathematical modeling to investigate evolutionary processes that might underlie this universal pattern. Gene frequency distributions for almost 400 groups of 10 bacterial or archaeal species each over a broad range of evolutionary distances were fit to steady-state, infinite allele models based on the distribution of gene replacement rates and the phylogenetic tree relating the species in each group. The fits of the theoretical frequency distributions to the empirical ones yield model parameters and estimates of the goodness of fit. Using the Akaike Information Criterion, we show that the neutral model of genome evolution, with the same replacement rate for all genes, can be confidently rejected. Of the three tested models with purifying selection, the one in which the distribution of replacement rates is derived from a stochastic population model with additive per-gene fitness yields the best fits to the data. The selection strength estimated from the fits declines with evolutionary divergence while staying well outside the neutral regime. These findings indicate that, unlike some other universal distributions of genomic variables, for example, the distribution of paralogous gene family membership, the gene frequency distribution is substantially affected by selection.
gene frequency distribution; steady genome model; goodness of fit; evolution mechanisms
Gene fusions can be used as tools for functional prediction and also as evolutionary markers. Fused genes often show a scattered phyletic distribution, which suggests a role for processes other than vertical inheritance in their evolution.
The evolutionary history of gene fusions was studied by phylogenetic analysis of the domains in the fused proteins and the orthologous domains that form stand-alone proteins. Clustering of fusion components from phylogenetically distant species was construed as evidence of dissemination of the fused genes by horizontal transfer. Of the 51 examined gene fusions that are represented in at least two of the three primary kingdoms (Bacteria, Archaea and Eukaryota), 31 were most probably disseminated by cross-kingdom horizontal gene transfer, whereas 14 appeared to have evolved independently in different kingdoms and two were probably inherited from the common ancestor of modern life forms. On many occasions, the evolutionary scenario also involves one or more secondary fissions of the fusion gene. For approximately half of the fusions, stand-alone forms of the fusion components are encoded by juxtaposed genes, which are known or predicted to belong to the same operon in some of the prokaryotic genomes. This indicates that evolution of gene fusions often, if not always, involves an intermediate stage, during which the future fusion components exist as juxtaposed and co-regulated, but still distinct, genes within operons.
These findings suggest a major role for horizontal transfer of gene fusions in the evolution of protein-domain architectures, but also indicate that independent fusions of the same pair of domains in distant species is not uncommon, which suggests positive selection for the multidomain architectures.
Ribosomal proteins are encoded in all genomes of cellular life forms and are, generally, well conserved during evolution. In prokaryotes, the genes for most ribosomal proteins are clustered in several highly conserved operons, which ensures efficient co-regulation of their expression. Duplications of ribosomal-protein genes are infrequent, and given their coordinated expression and functioning, it is generally assumed that ribosomal-protein genes are unlikely to undergo horizontal transfer. However, with the accumulation of numerous complete genome sequences of prokaryotes, several paralogous pairs of ribosomal protein genes have been identified. Here we analyze all such cases and attempt to reconstruct the evolutionary history of these ribosomal proteins.
Complete bacterial genomes were searched for duplications of ribosomal proteins. Ribosomal proteins L36, L33, L31, S14 are each duplicated in several bacterial genomes and ribosomal proteins L11, L28, L7/L12, S1, S15, S18 are so far duplicated in only one genome each. Sequence analysis of the four ribosomal proteins, for which paralogs were detected in several genomes, two of the ribosomal proteins duplicated in one genome (L28 and S18), and the ribosomal protein L32 showed that each of them comes in two distinct versions. One form contains a predicted metal-binding Zn-ribbon that consists of four conserved cysteines (in some cases replaced by histidines), whereas, in the second form, these metal-chelating residues are completely or partially replaced. Typically, genomes containing paralogous genes for these ribosomal proteins encode both versions, designated C+ and C-, respectively. Analysis of phylogenetic trees for these seven ribosomal proteins, combined with comparison of genomic contexts for the respective genes, indicates that in most, if not all cases, their evolution involved a duplication of the ancestral C+ form early in bacterial evolution, with subsequent alternative loss of the C+ and C- forms in different lineages. Additionally, evidence was obtained for a role of horizontal gene transfer in the evolution of these ribosomal proteins, with multiple cases of gene displacement 'in situ', that is, without a change of the gene order in the recipient genome.
A more complex picture of evolution of bacterial ribosomal proteins than previously suspected is emerging from these results, with major contributions of lineage-specific gene loss and horizontal gene transfer. The recurrent theme of emergence and disruption of Zn-ribbons in bacterial ribosomal proteins awaits a functional interpretation.
Collections of Clusters of Orthologous Genes (COGs) provide indispensable tools for comparative genomic analysis, evolutionary reconstruction and functional annotation of new genomes. Initially, COGs were made for all complete genomes of cellular life forms that were available at the time. However, with the accumulation of thousands of complete genomes, construction of a comprehensive COG set has become extremely computationally demanding and prone to error propagation, necessitating the switch to taxon-specific COG collections. Previously, we reported the collection of COGs for 41 genomes of Archaea (arCOGs). Here we present a major update of the arCOGs and describe evolutionary reconstructions to reveal general trends in the evolution of Archaea.
The updated version of the arCOG database incorporates 91% of the pangenome of 120 archaea (251,032 protein-coding genes altogether) into 10,335 arCOGs. Using this new set of arCOGs, we performed maximum likelihood reconstruction of the genome content of archaeal ancestral forms and gene gain and loss events in archaeal evolution. This reconstruction shows that the last Common Ancestor of the extant Archaea was an organism of greater complexity than most of the extant archaea, probably with over 2,500 protein-coding genes. The subsequent evolution of almost all archaeal lineages was apparently dominated by gene loss resulting in genome streamlining. Overall, in the evolution of Archaea as well as a representative set of bacteria that was similarly analyzed for comparison, gene losses are estimated to outnumber gene gains at least 4 to 1. Analysis of specific patterns of gene gain in Archaea shows that, although some groups, in particular Halobacteria, acquire substantially more genes than others, on the whole, gene exchange between major groups of Archaea appears to be largely random, with no major ‘highways’ of horizontal gene transfer.
The updated collection of arCOGs is expected to become a key resource for comparative genomics, evolutionary reconstruction and functional annotation of new archaeal genomes. Given that, in spite of the major increase in the number of genomes, the conserved core of archaeal genes appears to be stabilizing, the major evolutionary trends revealed here have a chance to stand the test of time.
This article was reviewed by (for complete reviews see the Reviewers’ Reports section): Dr. PLG, Prof. PF, Dr. PL (nominated by Prof. JPG).
Archaea; Orthologs; Horizontal gene transfer
Bacteria and archaea face continual onslaughts of rapidly diversifying viruses and plasmids. Many prokaryotes maintain adaptive immune systems known as clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated genes (Cas). CRISPR-Cas systems are genomic sensors that serially acquire viral and plasmid DNA fragments (spacers) that are utilized to target and cleave matching viral and plasmid DNA in subsequent genomic invasions, offering critical immunological memory. Only 50% of sequenced bacteria possess CRISPR-Cas immunity, in contrast to over 90% of sequenced archaea. To probe why half of bacteria lack CRISPR-Cas immunity, we combined comparative genomics and mathematical modeling. Analysis of hundreds of diverse prokaryotic genomes shows that CRISPR-Cas systems are substantially more prevalent in thermophiles than in mesophiles. With sequenced bacteria disproportionately mesophilic and sequenced archaea mostly thermophilic, the presence of CRISPR-Cas appears to depend more on environmental temperature than on bacterial-archaeal taxonomy. Mutation rates are typically severalfold higher in mesophilic prokaryotes than in thermophilic prokaryotes. To quantitatively test whether accelerated viral mutation leads microbes to lose CRISPR-Cas systems, we developed a stochastic model of virus-CRISPR coevolution. The model competes CRISPR-Cas-positive (CRISPR-Cas+) prokaryotes against CRISPR-Cas-negative (CRISPR-Cas−) prokaryotes, continually weighing the antiviral benefits conferred by CRISPR-Cas immunity against its fitness costs. Tracking this cost-benefit analysis across parameter space reveals viral mutation rate thresholds beyond which CRISPR-Cas cannot provide sufficient immunity and is purged from host populations. These results offer a simple, testable viral diversity hypothesis to explain why mesophilic bacteria disproportionately lack CRISPR-Cas immunity. More generally, fundamental limits on the adaptability of biological sensors (Lamarckian evolution) are predicted.
A remarkable recent discovery in microbiology is that bacteria and archaea possess systems conferring immunological memory and adaptive immunity. Clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated genes (CRISPR-Cas) are genomic sensors that allow prokaryotes to acquire DNA fragments from invading viruses and plasmids. Providing immunological memory, these stored fragments destroy matching DNA in future viral and plasmid invasions. CRISPR-Cas systems also provide adaptive immunity, keeping up with mutating viruses and plasmids by continually acquiring new DNA fragments. Surprisingly, less than 50% of mesophilic bacteria, in contrast to almost 90% of thermophilic bacteria and Archaea, maintain CRISPR-Cas immunity. Using mathematical modeling, we probe this dichotomy, showing how increased viral mutation rates can explain the reduced prevalence of CRISPR-Cas systems in mesophiles. Rapidly mutating viruses outrun CRISPR-Cas immune systems, likely decreasing their prevalence in bacterial populations. Thus, viral adaptability may select against, rather than for, immune adaptability in prokaryotes.
The 1,021,348 base pair genome sequence of the Acanthamoeba polyphaga moumouvirus, a new member of the Mimiviridae family infecting Acanthamoeba polyphaga, is reported. The moumouvirus represents a third lineage beside mimivirus and megavirus. Thereby, it is a new member of the recently proposed Megavirales order. This giant virus was isolated from a cooling tower water in southeastern France but is most closely related to Megavirus chiliensis, which was isolated from ocean water off the coast of Chile. The moumouvirus is predicted to encode 930 proteins, of which 879 have detectable homologs. Among these predicted proteins, for 702 the closest homolog was detected in Megavirus chiliensis, with the median amino acid sequence identity of 62%. The evolutionary affinity of moumouvirus and megavirus was further supported by phylogenetic tree analysis of conserved genes. The moumouvirus and megavirus genomes share near perfect orthologous gene collinearity in the central part of the genome, with the variations concentrated in the terminal regions. In addition, genomic comparisons of the Mimiviridae reveal substantial gene loss in the moumouvirus lineage. The majority of the remaining moumouvirus proteins are most similar to homologs from other Mimiviridae members, and for 27 genes the closest homolog was found in bacteria. Phylogenetic analysis of these genes supported gene acquisition from diverse bacteria after the separation of the moumouvirus and megavirus lineages. Comparative genome analysis of the three lineages of the Mimiviridae revealed significant mobility of Group I self-splicing introns, with the highest intron content observed in the moumouvirus genome.
moumouvirus; mimivirus; giant virus; megavirus; Mimiviridae; Megavirales; horizontal gene transfer; viral genome; nucleo-cytoplasmic large DNA viruses
A fundamental observation of comparative genomics is that the distribution of evolution rates across the complete sets of orthologous genes in pairs of related genomes remains virtually unchanged throughout the evolution of life, from bacteria to mammals. The most straightforward explanation for the conservation of this distribution appears to be that the relative evolution rates of all genes remain nearly constant, or in other words, that evolutionary rates of different genes are strongly correlated within each evolving genome. This correlation could be explained by a model that we denoted Universal PaceMaker (UPM) of genome evolution. The UPM model posits that the rate of evolution changes synchronously across genome-wide sets of genes in all evolving lineages. Alternatively, however, the correlation between the evolutionary rates of genes could be a simple consequence of molecular clock (MC). We sought to differentiate between the MC and UPM models by fitting thousands of phylogenetic trees for bacterial and archaeal genes to supertrees that reflect the dominant trend of vertical descent in the evolution of archaea and bacteria and that were constrained according to the two models. The goodness of fit for the UPM model was better than the fit for the MC model, with overwhelming statistical significance, although similarly to the MC, the UPM is strongly overdispersed. Thus, the results of this analysis reveal a universal, genome-wide pacemaker of evolution that could have been in operation throughout the history of life.
A central concept of evolution is Molecular Clock according to which each gene evolves at a characteristic, near constant rate. Numerous studies support the Molecular Clock hypothesis in principle but also show that the clock is indeed very approximate. Genome-wide comparative analysis of phylogenetic trees described here reveals a distinct, more general feature of genome evolution that we called Universal Pacemaker. Under this model, when the rate of evolution changes, the change occurs synchronously in many if not all genes in the evolving genome. In other words, the relative rates of gene evolution remain constant across long evolutionary spans: if a gene is slow relative to the rest of the genes in the given lineage, it is always slow, and if it evolves fast, it is always fast. We show here that the Universal Pacemaker model fits the available data much better than the traditional Molecular Clock model. These findings are compatible with the previously observed accelerations and decelerations of evolution in individual lineages but we show that synchronous, genome-wide change of evolutionary rates is a global feature of genome evolution that appears to pervade the entire history of life.
Orthologous relationships between genes are routinely inferred from bidirectional best hits (BBH) in pairwise genome comparisons. However, to our knowledge, it has never been quantitatively demonstrated that orthologs form BBH. To test this “BBH-orthology conjecture,” we take advantage of the operon organization of bacterial and archaeal genomes and assume that, when two genes in compared genomes are flanked by two BBH show statistically significant sequence similarity to one another, these genes are bona fide orthologs. Under this assumption, we tested whether middle genes in “syntenic orthologous gene triplets” form BBH. We found that this was the case in more than 95% of the syntenic gene triplets in all genome comparisons. A detailed examination of the exceptions to this pattern, including maximum likelihood phylogenetic tree analysis, showed that some of these deviations involved artifacts of genome annotation, whereas very small fractions represented random assignment of the best hit to one of closely related in-paralogs, paralogous displacement in situ, or even less frequent genuine violations of the BBH–orthology conjecture caused by acceleration of evolution in one of the orthologs. We conclude that, at least in prokaryotes, genes for which independent evidence of orthology is available typically form BBH and, conversely, BBH can serve as a strong indication of gene orthology.
orthology; bidirectional best hit; genome comparison; synteny
The virus-host arms race is a major theater for evolutionary innovation. Archaea and bacteria have evolved diverse, elaborate antivirus defense systems that function on two general principles: i) immune systems that discriminate self DNA from nonself DNA and specifically destroy the foreign, in particular viral, genomes, whereas the host genome is protected, or ii) programmed cell suicide or dormancy induced by infection.
Presentation of the hypothesis
Almost all genomic loci encoding immunity systems such as CRISPR-Cas, restriction-modification and DNA phosphorothioation also encompass suicide genes, in particular those encoding known and predicted toxin nucleases, which do not appear to be directly involved in immunity. In contrast, the immunity systems do not appear to encode antitoxins found in typical toxin-antitoxin systems. This raises the possibility that components of the immunity system themselves act as reversible inhibitors of the associated toxin proteins or domains as has been demonstrated for the Escherichia coli anticodon nuclease PrrC that interacts with the PrrI restriction-modification system. We hypothesize that coupling of diverse immunity and suicide/dormancy systems in prokaryotes evolved under selective pressure to provide robustness to the antivirus response. We further propose that the involvement of suicide/dormancy systems in the coupled antivirus response could take two distinct forms:
1) induction of a dormancy-like state in the infected cell to ‘buy time’ for activation of adaptive immunity; 2) suicide or dormancy as the final recourse to prevent viral spread triggered by the failure of immunity.
Testing the hypothesis
This hypothesis entails many experimentally testable predictions. Specifically, we predict that Cas2 protein present in all cas operons is a mRNA-cleaving nuclease (interferase) that might be activated at an early stage of virus infection to enable incorporation of virus-specific spacers into the CRISPR locus or to trigger cell suicide when the immune function of CRISPR-Cas systems fails. Similarly, toxin-like activity is predicted for components of numerous other defense loci.
Implications of the hypothesis
The hypothesis implies that antivirus response in prokaryotes involves key decision-making steps at which the cell chooses the path to follow by sensing the course of virus infection.
This article was reviewed by Arcady Mushegian, Etienne Joly and Nick Grishin. For complete reviews, go to the Reviewers’ reports section.
Many viruses infecting animals and plants share common cores of homologous genes involved in the key processes of viral replication. In contrast, genes that mediate virus – host interactions including in many cases capsid protein genes are markedly different. There are three distinct scenarios for the origin of related viruses of plants and animals: i) evolution from a common ancestral virus predating the divergence of plants and animals; ii) horizontal transfer of viruses, for example, through insect vectors; iii) parallel origin from related genetic elements. We present evidence that each of these scenarios contributed, to a varying extent, to the evolution of different groups of viruses.
There are no known RNA viruses that infect Archaea. Filling this gap in our knowledge of viruses will enhance our understanding of the relationships between RNA viruses from the three domains of cellular life and, in particular, could shed light on the origin of the enormous diversity of RNA viruses infecting eukaryotes. We describe here the identification of novel RNA viral genome segments from high-temperature acidic hot springs in Yellowstone National Park in the United States. These hot springs harbor low-complexity cellular communities dominated by several species of hyperthermophilic Archaea. A viral metagenomics approach was taken to assemble segments of these RNA virus genomes from viral populations isolated directly from hot spring samples. Analysis of these RNA metagenomes demonstrated unique gene content that is not generally related to known RNA viruses of Bacteria and Eukarya. However, genes for RNA-dependent RNA polymerase (RdRp), a hallmark of positive-strand RNA viruses, were identified in two contigs. One of these contigs is approximately 5,600 nucleotides in length and encodes a polyprotein that also contains a region homologous to the capsid protein of nodaviruses, tetraviruses, and birnaviruses. Phylogenetic analyses of the RdRps encoded in these contigs indicate that the putative archaeal viruses form a unique group that is distinct from the RdRps of RNA viruses of Eukarya and Bacteria. Collectively, our findings suggest the existence of novel positive-strand RNA viruses that probably replicate in hyperthermophilic archaeal hosts and are highly divergent from RNA viruses that infect eukaryotes and even more distant from known bacterial RNA viruses. These positive-strand RNA viruses might be direct ancestors of RNA viruses of eukaryotes.
Viruses with large genomes encode numerous proteins that do not directly participate in virus biogenesis but rather modify key functional systems of infected cells. We report that a distinct group of giant viruses infecting unicellular eukaryotes that includes Organic Lake Phycodnaviruses and Phaeocystis globosa virus encode predicted proteorhodopsins that have not been previously detected in viruses. Search of metagenomic sequence data shows that putative viral proteorhodopsins are extremely abundant in marine environments. Phylogenetic analysis suggests that giant viruses acquired proteorhodopsins via horizontal gene transfer from proteorhodopsin-encoding protists although the actual donor(s) could not be presently identified. The pattern of conservation of the predicted functionally important amino acid residues suggests that viral proteorhodopsin homologs function as sensory rhodopsins. We hypothesize that viral rhodopsins modulate light-dependent signaling, in particular phototaxis, in infected protists.
This article was reviewed by Igor B. Zhulin and Laksminarayan M. Iyer. For the full reviews, see the Reviewers’ reports section.
The serotonin 2C receptor (5-HT2CR)–a key regulator of diverse neurological processes–exhibits functional variability derived from editing of its pre-mRNA by site-specific adenosine deamination (A-to-I pre-mRNA editing) in five distinct sites. Here we describe a statistical technique that was developed for analysis of the dependencies among the editing states of the five sites. The statistical significance of the observed correlations was estimated by comparing editing patterns in multiple individuals. For both human and rat 5-HT2CR, the editing states of the physically proximal sites A and B were found to be strongly dependent. In contrast, the editing states of sites C and D, which are also physically close, seem not to be directly dependent but instead are linked through the dependencies on sites A and B, respectively. We observed pronounced differences between the editing patterns in humans and rats: in humans site A is the key determinant of the editing state of the other sites, whereas in rats this role belongs to site B. The structure of the dependencies among the editing sites is notably simpler in rats than it is in humans implying more complex regulation of 5-HT2CR editing and, by inference, function in the human brain. Thus, exhaustive statistical analysis of the 5-HT2CR editing patterns indicates that the editing state of sites A and B is the primary determinant of the editing states of the other three sites, and hence the overall editing pattern. Taken together, these findings allow us to propose a mechanistic model of concerted action of ADAR1 and ADAR2 in 5-HT2CR editing. Statistical approach developed here can be applied to other cases of interdependencies among modification sites in RNA and proteins.
The serotonin receptor 2C is a key regulator of diverse neurological processes that affect feeding behavior, sleep, sexual behavior, anxiety and depression. The function of the receptor itself is regulated via so-called pre-mRNA editing, i.e. site-specific adenosine deamination in five distinct sites. The greater the number of edited sites in the serotonin receptor mRNA, the lower the activity of the receptor it encodes. Here we used the results of extensive massively parallel sequencing from human and rat brains to elucidate the dependencies among the editing states of the five sites. Despite the apparent simplicity of the problem, disambiguation of these dependencies is a difficult task that required development of a new statistical technique. We employed this method to analyse the dependencies among editing in the 5 susceptible sites of the receptor mRNA and found that the proximal, juxtaposed sites A and B are strongly interdependent, and that the editing state of these two sites is a major determinant of the editing states of the other three sites, and hence the overall editing pattern. The statistical approach we developed for the analysis of mRNA editing can be applied to other cases of multiple site modification in RNA and proteins.
The F16L gene of vaccinia virus (VACV) is conserved in all chordopoxviruses except avipoxviruses. The crocodile poxvirus F16 protein ortholog has highly significant similarity to prokaryotic serine recombinases and contains all amino acids that comprise the catalytic site. In contrast, F16 orthologs encoded by other poxviruses show only marginally significant similarity to serine recombinases, lack essential amino acids of the active site and are most likely inactive derivatives of serine recombinases. Nevertheless, the conservation of F16L in non-avian poxviruses suggested an important function. However, a VACV mutant with the F16L gene knocked out replicated normally in dividing and quiescent cells. The F16 protein was synthesized early after infection and detected in virus cores. When expressed in infected or uninfected cells, F16 accumulated in nucleoli depending on the level of expression and confluency of cells. Evidence was obtained that F16 forms multimers, which might regulate concentration-dependent intracellular localization.
poxvirus orthologs; poxvirus prokaryotic homologs; vaccinia virus replication; poxvirus recombinase
Prions are agents of analog, protein conformation-based inheritance that can confer beneficial phenotypes to cells, especially under stress. Combined with genetic variation, prion-mediated inheritance can be channeled into prion-independent genomic inheritance. Latest screening shows that prions are common, at least in fungi. Thus, there is non-negligible flow of information from proteins to the genome in modern cells, in a direct violation of the Central Dogma of molecular biology. The prion-mediated heredity that violates the Central Dogma appears to be a specific, most radical manifestation of the widespread assimilation of protein (epigenetic) variation into genetic variation. The epigenetic variation precedes and facilitates genetic adaptation through a general ‘look-ahead effect’ of phenotypic mutations. This direction of the information flow is likely to be one of the important routes of environment-genome interaction and could substantially contribute to the evolution of complex adaptive traits.
This article was reviewed by Jerzy Jurka, Pierre Pontarotti and Juergen Brosius. For the complete reviews, see the Reviewers’ Reports section.
The Nucleo-Cytoplasmic Large DNA Viruses (NCLDV) constitute an apparently monophyletic group that consists of at least 6 families of viruses infecting a broad variety of eukaryotic hosts. A comprehensive genome comparison and maximum-likelihood reconstruction of the NCLDV evolution revealed a set of approximately 50 conserved, core genes that could be mapped to the genome of the common ancestor of this class of eukaryotic viruses.
We performed a detailed phylogenetic analysis of these core NCLDV genes and applied the constrained tree approach to show that the majority of the core genes are unlikely to be monophyletic. Several of the core genes have been independently acquired from different sources by different NCLDV lineages whereas for the majority of these genes displacement by homologs from cellular organisms in one or more groups of the NCLDV was demonstrated.
A detailed study of the evolution of the genomic core of the NCLDV reveals substantial complexity and diversity of evolutionary scenarios that was largely unsuspected previously. The phylogenetic coherence between the core genes is sufficient to validate the hypothesis on the evolution of all NCLDV from a common ancestral virus although the set of ancestral genes might be smaller than previously inferred from patterns of gene presence-absence.
Spliceosomal introns are one of the principal distinctive features of eukaryotes. Nevertheless, different large-scale studies disagree about even the most basic features of their evolution. In order to come up with a more reliable reconstruction of intron evolution, we developed a model that is far more comprehensive than previous ones. This model is rich in parameters, and estimating them accurately is infeasible by straightforward likelihood maximization. Thus, we have developed an expectation-maximization algorithm that allows for efficient maximization. Here, we outline the model and describe the expectation-maximization algorithm in detail. Since the method works with intron presence–absence maps, it is expected to be instrumental for the analysis of the evolution of other binary characters as well.
Maximum likelihood; expectation-maximization; intron evolution; ancestral reconstruction; eukaryotic gene structure
A novel bacteriophage infecting Staphylococus pasteuri was isolated during a screen for phages in Antarctic soils. The phage named SpaA1 is morphologically similar to phages of the family Siphoviridae. The 42,784 bp genome of SpaA1 is a linear, double-stranded DNA molecule with 3′ protruding cohesive ends. The SpaA1 genome encompasses 63 predicted protein-coding genes which cluster within three regions of the genome, each of apparently different origin, in a mosaic pattern. In two of these regions, the gene sets resemble those in prophages of Bacillus thuringiensis kurstaki str. T03a001 (genes involved in DNA replication/transcription, cell entry and exit) and B. cereus AH676 (additional regulatory and recombination genes), respectively. The third region represents an almost complete genome (except for the short terminal segments) of a distinct bacteriophage, MZTP02. Nearly the same gene module was identified in prophages of B. thuringiensis serovar monterrey BGSC 4AJ1 and B. cereus Rock4-2. These findings suggest that MZTP02 can be shuttled between genomes of other bacteriophages and prophages, leading to the formation of chimeric genomes. The presence of a complete phage genome in the genome of other phages apparently has not been described previously and might represent a ‘fast track’ route of virus evolution and horizontal gene transfer. Another phage (BceA1) nearly identical in sequence to SpaA1, and also including the almost complete MZTP02 genome within its own genome, was isolated from a bacterium of the B. cereus/B. thuringiensis group. Remarkably, both SpaA1 and BceA1 phages can infect B. cereus and B. thuringiensis, but only one of them, SpaA1, can infect S. pasteuri. This finding is best compatible with a scenario in which MZTP02 was originally contained in BceA1 infecting Bacillus spp, the common hosts for these two phages, followed by emergence of SpaA1 infecting S. pasteuri.
The two types of eukaryotic spliceosomal introns, U2 and U12, possess different splice signals and are excised by distinct spliceosomes. The nature of the primordial introns remains uncertain. A comparison of the amino acid distributions at insertion sites of introns that retained their positions throughout eukaryotic evolution with the distributions for human and Arabidopsis thaliana U2 and U12 introns reveals close similarity with U2 but not U12. Thus, the primordial spliceosomal introns were, most likely, U2-type.
It is proposed that the pre-cellular stage of biological evolution unraveled within networks of inorganic compartments that harbored a diverse mix of virus-like genetic elements. This stage of evolution might comprise the Last Universal Cellular Ancestor (LUCA) that more appropriately could be denoted Last Universal Cellular Ancestral State (LUCAS). This scenario for the origin of cellular life recapitulates the early ideas of J. B. S. Haldane sketched in his classic 1928 essay. However, unlike in Haldane’s day, there is now considerable support for this scenario from three major lines of comparative-genomic evidence: i) lack of homology between the core components of the DNA replication systems of the two primary lines of descent of cellular life forms, archaea and bacteria, ii) distinct membrane chemistries and lack of homology between the enzymes of lipid biosynthesis in archaea and bacteria, iii) spread of several viral hallmark genes, which encode proteins with key functions in viral replication and morphogenesis, among numerous and extremely diverse groups of viruses, in contrast to their absence in cellular life forms, iv) the extant archaeal and bacterial chromosomes appear to be shaped by accretion of diverse, smaller replicons, suggesting a continuity between the hypothetical, primordial virus stage of life’s evolution and the dynamic prokaryotic world that existed ever since. Under the viral model of pre-cellular evolution, the key components of cells including the replication apparatus, membranes, and molecular complexes involved in membrane transport and translocation originated as components of virus-like entities. The two surviving types of cellular life forms, archaea and bacteria, might have emerged from the LUCAS independently, along with, probably, numerous forms now extinct.
comparative genomics; evolution of cells; evolution of viruses; origin of membranes; viral hallmark genes
The most common form of protein-coding gene overlap in eukaryotes is a simple nested structure, whereby one gene is embedded in an intron of another. Analysis of nested protein-coding genes in vertebrates, fruit flies and nematodes revealed substantially higher rates of evolutionary gains than losses. The accumulation of nested gene structures could not be attributed to any obvious functional relationships between the genes involved and represents an increase of the organizational complexity of animal genomes via a neutral process.
In a recent BMC Evolutionary Biology article, Huiquan Liu and colleagues report two new genomes of double-stranded RNA (dsRNA) viruses from fungi and use these as a springboard to perform an extensive phylogenomic analysis of dsRNA viruses. The results support the old scenario of polyphyletic origin of dsRNA viruses from different groups of positive-strand RNA viruses and additionally reveal extensive horizontal gene transfer between diverse viruses consistent with the network-like rather than tree-like mode of viral evolution. Together with the unexpected discoveries of the first putative archaeal RNA virus and a RNA-DNA virus hybrid, this work shows that RNA viral genomics has major surprises to deliver.
See research article: http://www.biomedcentral.com/1471-2148/12/91
Archaeal and bacterial ribosomes contain more than 50 proteins, including 34 that are universally conserved in the three domains of cellular life (bacteria, archaea, and eukaryotes). Despite the high sequence conservation, annotation of ribosomal (r-) protein genes is often difficult because of their short lengths and biased sequence composition. We developed an automated computational pipeline for identification of r-protein genes and applied it to 995 completely sequenced bacterial and 87 archaeal genomes available in the RefSeq database. The pipeline employs curated seed alignments of r-proteins to run position-specific scoring matrix (PSSM)-based BLAST searches against six-frame genome translations, mitigating possible gene annotation errors. As a result of this analysis, we performed a census of prokaryotic r-protein complements, enumerated missing and paralogous r-proteins, and analyzed the distributions of ribosomal protein genes among chromosomal partitions. Phyletic patterns of bacterial and archaeal r-protein genes were mapped to phylogenetic trees reconstructed from concatenated alignments of r-proteins to reveal the history of likely multiple independent gains and losses. These alignments, available for download, can be used as search profiles to improve genome annotation of r-proteins and for further comparative genomics studies.