The nucleo-cytoplasmic large DNA viruses (NCLDV) constitute an apparently monophyletic group that consists of 6 families of viruses infecting a broad variety of eukaryotes. A comprehensive genome comparison and maximum-likelihood reconstruction of NCLDV evolution reveal a set of approximately 50 conserved genes that can be tentatively mapped to the genome of the common ancestor of this class of eukaryotic viruses. We address the origins and evolution of NCLDV.
Phylogenetic analysis indicates that some of the major clades of NCLDV infect diverse animals and protists, suggestive of early radiation of the NCLDV, possibly concomitant with eukaryogenesis. The core NCLDV genes seem to have originated from different sources including homologous genes of bacteriophages, bacteria and eukaryotes. These observations are compatible with a scenario of the origin of the NCLDV at an early stage of the evolution of eukaryotes through extensive mixing of genes from widely different genomes.
The common ancestor of the NCLDV probably evolved from a bacteriophage as a result of recruitment of numerous eukaryotic and some bacterial genes, and concomitant loss of the majority of phage genes except for a small core of genes coding for proteins essential for virus genome replication and virion formation.
Bacteriophage; Eukaryogenesis; Nucleo-cytoplasmic large DNA viruses, evolution; Phylogenetic analysis
The recent discovery of protein modification by SAMPs, ubiquitin-like (Ubl) proteins from the archaeon Haloferax volcanii, prompted a comprehensive comparative-genomic analysis of archaeal Ubl protein genes and the genes for enzymes thought to be functionally associated with Ubl proteins. This analysis showed that most archaea encode members of two major groups of Ubl proteins with the β-grasp fold, the ThiS and MoaD families, and indicated that the ThiS family genes are rarely linked to genes for thiamine or Mo/W cofactor metabolism enzymes but instead are most often associated with genes for enzymes of tRNA modification. Therefore it is hypothesized that the ancestral function of the archaeal Ubl proteins is sulfur insertion into modified nucleotides in tRNAs, an activity analogous to that of the URM1 protein in eukaryotes. Together with additional, previously described genomic associations, these findings indicate that systems for protein quality control operating at different levels, including tRNA modification that controls translation fidelity, protein ubiquitination that regulates protein degradation, and, possibly, mRNA degradation by the exosome, are functionally and evolutionarily linked.
Evolutionary binary characters are features of species or genes, indicating the absence (value zero) or presence (value one) of some property. Examples include eukaryotic gene architecture (the presence or absence of an intron in a particular locus), gene content, and morphological characters. In many studies, the acquisition of such binary characters is assumed to represent a rare evolutionary event, and consequently, their evolution is analyzed using various flavors of parsimony. However, when gain and loss of the character are not rare enough, a probabilistic analysis becomes essential. Here, we present a comprehensive probabilistic model to describe the evolution of binary characters on a bifurcating phylogenetic tree. A fast software tool, EREM, is provided, using maximum likelihood to estimate the parameters of the model and to reconstruct ancestral states (presence and absence in internal nodes) and events (gain and loss events along branches).
Recently a novel cell division system comprised of homologues of eukaryotic ESCRT-III (endosomal sorting complex required for transport III) proteins was discovered in the hyperthermophilic crenarchaeote Sulfolobus acidocaldarius. On the basis of this discovery, we undertook a comparative genomic analysis of the machineries for cell division and vesicle formation in Archaea. Archaea possess at least three distinct membrane remodelling systems: the FtsZ-based bacterial-type system, the ESCRT-III-based eukaryote-like system and a putative novel system that uses an archaeal actin-related protein. Many archaeal genomes encode assortments of components from different systems. Evolutionary reconstruction from these findings suggests that the last common ancestor of the extant Archaea possessed a complex membrane remodelling apparatus, different components of which were lost during subsequent evolution of archaeal lineages. By contrast, eukaryotes seem to have inherited all three ancestral systems.
Comparing the genome sequences of free-living organisms in the five eukaryotic supergroups enables predictions to be made about the genome of the last common ancestor of eukaryotes. The genome sequence of the amoeboflagellate Naegleria gruberi reported by Fritz-Laylin et al. (2010) reveals the surprising complexity of this unicellular organism and, by inference, of the last common eukaryotic ancestor.
The first congress on Viruses of Microbes took place at the Institut Pasteur in Paris, France, on 21–25 June 2010. The advances in genomics and metagenomics reported at this meeting reveal striking and unexpected complexity of the virus world. Viruses, in particular viruses that infect prokaryotes and unicellular eukaryotes, are emerging as the most abundant class of biological entities on earth and a major evolutionary and geochemical force.
Multiple constraints variously affect different parts of the genomes of diverse life forms. The selective pressures that shape the evolution of viral, archaeal, bacterial and eukaryotic genomes differ markedly, even among relatively closely related animal and bacterial lineages; by contrast, constraints affecting protein evolution seem to be more universal. The constraints that shape the evolution of genomes and phenomes are complemented by the plasticity and robustness of genome architecture, expression and regulation. Taken together, these findings are starting to reveal complex networks of evolutionary processes that must be integrated to attain a new synthesis of evolutionary biology.
Regulation of gene expression during infection of the thermophilic bacterium Thermus thermophilus (T. th.) HB8 with the bacteriophage P23-45 was investigated. Macroarray analysis revealed host transcription shut-off and identified three temporal classes of phage genes: early, middle, and late. Primer extension experiments revealed that the 5′ ends of P23-45 early transcripts are preceded by a common sequence motif that likely defines early viral promoters. T. th. HB8 RNA polymerase (RNAP) recognizes middle and late phage promoters in vitro but does not recognize early promoters. In vivo experiments revealed the presence of rifampicin-resistant RNA polymerizing activity in infected cells responsible for early transcription. The product of the P23-45 early gene 64 shows a distant sequence similarity with the largest, catalytic subunits of multisubunit RNAPs and contains the conserved metal-binding motif that is diagnostic of these proteins. We hypothesize that ORF64 encodes rifampicin-resistant phage RNAP that recognizes early phage promoters. Affinity isolation of T. th. HB8 RNAP from P23-45-infected cells identified two phage-encoded proteins: gp39 and gp76, that bind the host RNAP and inhibit in vitro transcription from host promoters, but not from middle or late phage promoters, and may thus control the shift from host to viral gene expression during infection. To our knowledge, gp39 and gp76 are the first characterized bacterial RNAP-binding proteins encoded by a thermophilic phage.
Thermus thermophilus; thermophage; phage promoters; RNA polymerase; RNAP-binding proteins
Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs) and the associated proteins (Cas) comprise a system of adaptive immunity against viruses and plasmids in prokaryotes. Cas1 is a CRISPR-associated protein that is common to all CRISPR-containing prokaryotes but its function remains obscure. Here we show that the purified Cas1 protein of Escherichia coli (YgbT) exhibits nuclease activity against single-stranded and branched DNAs including Holliday junctions, replication forks, and 5′-flaps. The crystal structure of YgbT and site-directed mutagenesis have revealed the potential active site. Genome-wide screens show that YgbT physically and genetically interacts with key components of DNA repair systems, including recB, recC and ruvB. Consistent with these findings, the ygbT deletion strain showed increased sensitivity to DNA damage and impaired chromosomal segregation. Similar phenotypes were observed in strains with deletion of CRISPR clusters, suggesting that the function of YgbT in repair involves interaction with the CRISPRs. These results show that YgbT belongs to a novel, structurally distinct family of nucleases acting on branched DNAs and suggest that, in addition to antiviral immunity, at least some components of the CRISPR-Cas system have a function in DNA repair.
Cas1; CRISPR; DNA recombination; DNA repair; nuclease; YgbT
Comparative genomics and new phylogenies of eukaryote groups suggest a scenario in which the mitochondrial endosymbiosis triggered the origin of eukaryotes.
Phylogenomics of eukaryote supergroups suggest a highly complex last common ancestor of eukaryotes and a key role of mitochondrial endosymbiosis in the origin of eukaryotes.
The rapidly accumulating genome sequence data allow researchers to address fundamental biological questions that were not even asked just a few years ago. A major problem in genomics is the widening gap between the rapid progress in genome sequencing and the comparatively slow progress in the functional characterization of sequenced genomes. Here we discuss two key questions of genome biology: whether we need more genomes, and how deep is our understanding of biology based on genomic analysis. We argue that overly specific annotations of gene functions are often less useful than the more generic, but also more robust, functional assignments based on protein family classification. We also discuss problems in understanding the functions of the remaining “conserved hypothetical” genes.
It is common belief that all cellular life forms on earth have a common origin. This view is supported by the universality of the genetic code and the universal conservation of multiple genes, particularly those that encode key components of the translation system. A remarkable recent study claims to provide a formal, homology independent test of the Universal Common Ancestry hypothesis by comparing the ability of a common-ancestry model and a multiple-ancestry model to predict sequences of universally conserved proteins.
We devised a computational experiment on a concatenated alignment of universally conserved proteins which shows that the purported demonstration of the universal common ancestry is a trivial consequence of significant sequence similarity between the analyzed proteins. The nature and origin of this similarity are irrelevant for the prediction of "common ancestry" of by the model-comparison approach. Thus, homology (common origin) of the compared proteins remains an inference from sequence similarity rather than an independent property demonstrated by the likelihood analysis.
A formal demonstration of the Universal Common Ancestry hypothesis has not been achieved and is unlikely to be feasible in principle. Nevertheless, the evidence in support of this hypothesis provided by comparative genomics is overwhelming.
this article was reviewed by William Martin, Ivan Iossifov (nominated by Andrey Rzhetsky) and Arcady Mushegian. For the complete reviews, see the Reviewers' Report section.
The majority of mammalian genes produce multiple transcripts resulting from alternative splicing (AS) and/or alternative transcription initiation (ATI) and alternative transcription termination (ATT). Comparative analysis of the number of alternative nucleotides, isoforms, and introns per locus in genes with different types of alternative events suggests that ATI and ATT contribute to the diversity of human and mouse transcriptome even more than AS. There is a strong negative correlation between AS and ATI in 5′ untranslated regions (UTRs) and AS in coding sequences (CDSs) but an even stronger positive correlation between AS in CDSs and ATT in 3′ UTRs. These observations could reflect preferential regulation of distinct, large groups of genes by different mechanisms: 1) regulation at the level of transcription initiation and initiation of translation resulting from ATI and AS in 5′ UTRs and 2) posttranslational regulation by different protein isoforms. The tight linkage between AS in CDSs and ATT in 3′ UTRs suggests that variability of 3′ UTRs mediates differential translational regulation of alternative protein forms. Together, the results imply coordinate evolution of AS and alternative transcription, processes that occur concomitantly within gene expression factories.
alternative splicing; alternative transcription initiation; alternative transcription termination; gene expression factories
Phylogenetic trees of individual genes of prokaryotes (archaea and bacteria) generally have different topologies, largely owing to extensive horizontal gene transfer (HGT), suggesting that the Tree of Life (TOL) should be replaced by a “net of life” as the paradigm of prokaryote evolution. However, trees remain the natural representation of the histories of individual genes given the fundamentally bifurcating process of gene replication. Therefore, although no single tree can fully represent the evolution of prokaryote genomes, the complete picture of evolution will necessarily combine trees and nets. A quantitative measure of the signals of tree and net evolution is derived from an analysis of all quartets of species in all trees of the “Forest of Life” (FOL), which consists of approximately 7,000 phylogenetic trees for prokaryote genes including approximately 100 nearly universal trees (NUTs). Although diverse routes of net-like evolution collectively dominate the FOL, the pattern of tree-like evolution that reflects the consistent topologies of the NUTs is the most prominent coherent trend. We show that the contributions of tree-like and net-like evolutionary processes substantially differ across bacterial and archaeal lineages and between functional classes of genes. Evolutionary simulations indicate that the central tree-like signal cannot be realistically explained by a self-reinforcing pattern of biased HGT.
phylogenetic tree; horizontal gene transfer; species quartets; computer simulation
New distinct versions of known protein folds provide a powerful means of protein-function prediction that complements sequence and genomic context analysis.
New distinct versions of known protein folds provide a powerful means of protein-function prediction that complements sequence and genomic context analysis. These structures do not supplant direct biochemical experiments, but are indispensable for the complete characterization of proteins.
new variants of known folds
The arylalkylamine N-acetyltransferase (AANAT) family is divided into structurally distinct vertebrate and non-vertebrate groups. Expression of vertebrate AANATs is limited primarily to the pineal gland and retina, where it plays a role in controlling the circadian rhythm in melatonin synthesis. Based on the role melatonin plays in biological timing, AANAT has been given the moniker "the Timezyme". Non-vertebrate AANATs, which occur in fungi and protists, are thought to play a role in detoxification and are not known to be associated with a specific tissue.
We have found that the amphioxus genome contains seven AANATs, all having non-vertebrate type features. This and the absence of AANATs from the genomes of Hemichordates and Urochordates support the view that a major transition in the evolution of the AANATs may have occurred at the onset of vertebrate evolution. Analysis of the expression pattern of the two most structurally divergent AANATs in Branchiostoma lanceolatum (bl) revealed that they are expressed early in development and also in the adult at low levels throughout the body, possibly associated with the neural tube. Expression is clearly not exclusively associated with the proposed analogs of the pineal gland and retina. blAANAT activity is influenced by environmental lighting, but light/dark differences do not persist under constant light or constant dark conditions, indicating they are not circadian in nature. bfAANATα and bfAANATδ' have unusually alkaline (> 9.0) optimal pH, more than two pH units higher than that of vertebrate AANATs.
The substrate selectivity profiles of bfAANATα and δ' are relatively broad, including alkylamines, arylalkylamines and diamines, in contrast to vertebrate forms, which selectively acetylate serotonin and other arylalkylamines. Based on these features, it appears that amphioxus AANATs could play several roles, including detoxification and biogenic amine inactivation. The presence of seven AANATs in amphioxus genome supports the view that arylalkylamine and polyamine acetylation is important to the biology of this organism and that these genes evolved in response to specific pressures related to requirements for amine acetylation.
Several recent discoveries reveal unexpected versatility of the bacterial and archaeal cytoskeleton systems that are involved in cell division and other processes based on membrane remodeling. Here we apply methods for distant protein sequence similarity detection, phylogenetic approaches, and genome context analysis to described two previously unnoticed families of the FtsZ-tubulin superfamily. One of these families is limited in its spread to Proteobacteria whereas the other is represented in diverse bacteria and archaea, and might be the key component of a novel, multicomponent membrane remodeling system that also includes a Von Willebrand A domain-containing protein, a distinct GTPase and membrane transport proteins of the OmpA family.
This article was reviewed by Purificación López-García and Gáspár Jékely; for complete reviews, see the Reviewers Reports section.
Motivation: Identifying orthologous genes in multiple genomes is a fundamental task in comparative genomics. Construction of intergenomic symmetrical best matches (SymBets) and joining them into clusters is a popular method of ortholog definition, embodied in several software programs. Despite their wide use, the computational complexity of these programs has not been thoroughly examined.
Results: In this work, we show that in the standard approach of iteration through all triangles of SymBets, the memory scales with at least the number of these triangles, O(g3) (where g = number of genomes), and construction time scales with the iteration through each pair, i.e. O(g6). We propose the EdgeSearch algorithm that iterates over edges in the SymBet graph rather than triangles of SymBets, and as a result has a worst-case complexity of only O(g3log g). Several optimizations reduce the run-time even further in realistically sparse graphs. In two real-world datasets of genomes from bacteriophages (POGs) and Mollicutes (MOGs), an implementation of the EdgeSearch algorithm runs about an order of magnitude faster than the original algorithm and scales much better with increasing number of genomes, with only minor differences in the final results, and up to 60 times faster than the popular OrthoMCL program with a 90% overlap between the identified groups of orthologs.
Availability and implementation: C++ source code freely available for download at ftp.ncbi.nih.gov/pub/wolf/COGs/COGsoft/
Supplementary information: Supplementary materials are available at Bioinformatics online.
Evolutionarily unrelated proteins that catalyze the same biochemical reactions are often referred to as analogous - as opposed to homologous - enzymes. The existence of numerous alternative, non-homologous enzyme isoforms presents an interesting evolutionary problem; it also complicates genome-based reconstruction of the metabolic pathways in a variety of organisms. In 1998, a systematic search for analogous enzymes resulted in the identification of 105 Enzyme Commission (EC) numbers that included two or more proteins without detectable sequence similarity to each other, including 34 EC nodes where proteins were known (or predicted) to have distinct structural folds, indicating independent evolutionary origins. In the past 12 years, many putative non-homologous isofunctional enzymes were identified in newly sequenced genomes. In addition, efforts in structural genomics resulted in a vastly improved structural coverage of proteomes, providing for definitive assessment of (non)homologous relationships between proteins.
We report the results of a comprehensive search for non-homologous isofunctional enzymes (NISE) that yielded 185 EC nodes with two or more experimentally characterized - or predicted - structurally unrelated proteins. Of these NISE sets, only 74 were from the original 1998 list. Structural assignments of the NISE show over-representation of proteins with the TIM barrel fold and the nucleotide-binding Rossmann fold. From the functional perspective, the set of NISE is enriched in hydrolases, particularly carbohydrate hydrolases, and in enzymes involved in defense against oxidative stress.
These results indicate that at least some of the non-homologous isofunctional enzymes were recruited relatively recently from enzyme families that are active against related substrates and are sufficiently flexible to accommodate changes in substrate specificity.
This article was reviewed by Andrei Osterman, Keith F. Tipton (nominated by Martijn Huynen) and Igor B. Zhulin. For the full reviews, go to the Reviewers' comments section.
Comparison of expression levels and breadth and evolutionary rates of intronless and intron-containing mammalian genes shows that intronless genes are expressed at lower levels, tend to be tissue specific, and evolve significantly faster than spliced genes. By contrast, monomorphic spliced genes that are not subject to detectable alternative splicing and polymorphic alternatively spliced genes show similar statistically indistinguishable patterns of expression and evolution. Alternative splicing is most common in ancient genes, whereas intronless genes appear to have relatively recent origins. These results imply tight coupling between different stages of gene expression, in particular, transcription, splicing, and nucleocytosolic transport of transcripts, and suggest that formation of intronless genes is an important route of evolution of novel tissue-specific functions in animals.
alternative splicing; intronless genes; monomorphic genes; polymorphic genes; mammalian gene evolution
A long-standing assumption in evolutionary biology is that the evolution rate of protein-coding genes depends, largely, on specific constraints that affect the function of the given protein. However, recent research in evolutionary systems biology revealed unexpected, significant correlations between evolution rate and characteristics of genes or proteins that are not directly related to specific protein functions, such as expression level and protein–protein interactions. The strongest connections were consistently detected between protein sequence evolution rate and the expression level of the respective gene. A recent genome-wide proteomic study revealed an extremely strong correlation between the abundances of orthologous proteins in distantly related animals, the nematode Caenorhabditis elegans and the fruit fly Drosophila melanogaster. We used the extensive protein abundance data from this study along with short-term evolutionary rates (ERs) of orthologous genes in nematodes and flies to estimate the relative contributions of structural–functional constraints and the translation rate to the evolution rate of protein-coding genes. Together the intrinsic constraints and translation rate account for approximately 50% of the variance of the ERs. The contribution of constraints is estimated to be 3- to 5-fold greater than the contribution of translation rate.
protein evolution; structural–functional constraints; misfolding; protein abundance
Small, hydrophobic proteins whose synthesis is repressed by small RNAs (sRNAs), denoted type I toxin–antitoxin modules, were first discovered on plasmids where they regulate plasmid stability, but were subsequently found on a few bacterial chromosomes. We used exhaustive PSI-BLAST and TBLASTN searches across 774 bacterial genomes to identify homologs of known type I toxins. These searches substantially expanded the collection of predicted type I toxins, revealed homology of the Ldr and Fst toxins, and suggested that type I toxin–antitoxin loci are not spread by horizontal gene transfer. To discover novel type I toxin–antitoxin systems, we developed a set of search parameters based on characteristics of known loci including the presence of tandem repeats and clusters of charged and bulky amino acids at the C-termini of short proteins containing predicted transmembrane regions. We detected sRNAs for three predicted toxins from enterohemorrhagic Escherichia coli and Bacillus subtilis, and showed that two of the respective proteins indeed are toxic when overexpressed. We also demonstrated that the local free-energy minima of RNA folding can be used to detect the positions of the sRNA genes. Our results suggest that type I toxin–antitoxin modules are much more widely distributed among bacteria than previously appreciated.
Genomes of several yeast species contain integrated DNA copies of complete genomes or individual genes of non-retroviral double-strand RNA viruses as reported in a recent BMC Biology article by Taylor and Bruenn. The integrated virus-specific sequences are at least partially expressed and seem to evolve under pressure of purifying selection, indicating that these are functional genes. Together with similar reports on integrated copies of some animal RNA viruses, these results suggest that integration of DNA copies of non-reverse-transcribing RNA viruses might be much more common than previously thought. The integrated copies could contribute to acquired immunity to the respective viruses.