Accurate inference of orthologous genes is a pre-requisite for most comparative genomics studies, and is also important for functional annotation of new genomes. Identification of orthologous gene sets typically involves phylogenetic tree analysis, heuristic algorithms based on sequence conservation, synteny analysis, or some combination of these approaches. The most direct tree-based methods typically rely on the comparison of an individual gene tree with a species tree. Once the two trees are accurately constructed, orthologs are straightforwardly identified by the definition of orthology as those homologs that are related by speciation, rather than gene duplication, at their most recent point of origin. Although ideal for the purpose of orthology identification in principle, phylogenetic trees are computationally expensive to construct for large numbers of genes and genomes, and they often contain errors, especially at large evolutionary distances. Moreover, in many organisms, in particular prokaryotes and viruses, evolution does not appear to have followed a simple ‘tree-like’ mode, which makes conventional tree reconciliation inapplicable. Other, heuristic methods identify probable orthologs as the closest homologous pairs or groups of genes in a set of organisms. These approaches are faster and easier to automate than tree-based methods, with efficient implementations provided by graph-theoretical algorithms enabling comparisons of thousands of genomes. Comparisons of these two approaches show that, despite conceptual differences, they produce similar sets of orthologs, especially at short evolutionary distances. Synteny also can aid in identification of orthologs. Often, tree-based, sequence similarity- and synteny-based approaches can be combined into flexible hybrid methods.
homolog; ortholog; paralog; xenolog; orthologous groups; tree reconciliation; comparative genomics
The widespread exchange of genes among prokaryotes, known as horizontal gene transfer (HGT), is often considered to “uproot” the Tree of Life (TOL). Indeed, it is by now fully clear that genes in general possess different evolutionary histories. However, the possibility remains that the TOL concept can be reformulated and remain valid as a statistical central trend in the phylogenetic “Forest of Life” (FOL). This article describes a computational pipeline developed to chart the FOL by comparative analysis of thousands of phylogenetic trees. This analysis reveals a distinct, consistent phylogenetic signal that is particularly strong among the Nearly Universal Trees (NUTs), which correspond to genes represented in all or most of the analyzed organisms. Despite the substantial amount of apparent HGT seen even among the NUTs, these gene transfers appear to be distributed randomly and do not obscure the central tree-like trend.
The recently discovered CRISPR-Cas adaptive immune system is present in almost all archaea and many bacteria. It consists of cassettes of CRISPR repeats that incorporate spacers homologous to fragments of viral or plasmid genomes that are employed as guide RNAs in the immune response, along with numerous CRISPR-associated (cas) genes that encode proteins possessing diverse, only partially characterized activities required for the action of the system. Here, we investigate the evolution of the cas genes and show that they evolve under purifying selection that is typically much weaker than the median strength of purifying selection affecting genes in the respective genomes. The exceptions are the cas1 and cas2 genes that typically evolve at levels of purifying selection close to the genomic median. Thus, although these genes are implicated in the acquisition of spacers from alien genomes, they do not appear to be directly involved in an arms race between bacterial and archaeal hosts and infectious agents. These genes might possess functions distinct from and additional to their role in the CRISPR-Cas-mediated immune response. Taken together with evidence of the frequent horizontal transfer of cas genes reported previously and with the wide-spread microscale recombination within these genes detected in this work, these findings reveal the highly dynamic evolution of cas genes. This conclusion is in line with the involvement of CRISPR-Cas in antiviral immunity that is likely to entail a coevolutionary arms race with rapidly evolving viruses. However, we failed to detect evidence of strong positive selection in any of the cas genes.
When Charles Darwin formulated the central principles of evolutionary biology in the Origin of Species in 1859 and the architects of the Modern Synthesis integrated these principles with population genetics almost a century later, the principal if not the sole objects of evolutionary biology were multicellular eukaryotes, primarily animals and plants. Before the advent of efficient gene sequencing, all attempts to extend evolutionary studies to bacteria have been futile. Sequencing of the rRNA genes in thousands of microbes allowed the construction of the three- domain “ribosomal Tree of Life” that was widely thought to have resolved the evolutionary relationships between the cellular life forms. However, subsequent massive sequencing of numerous, complete microbial genomes revealed novel evolutionary phenomena, the most fundamental of these being: (1) pervasive horizontal gene transfer (HGT), in large part mediated by viruses and plasmids, that shapes the genomes of archaea and bacteria and call for a radical revision (if not abandonment) of the Tree of Life concept, (2) Lamarckian-type inheritance that appears to be critical for antivirus defense and other forms of adaptation in prokaryotes, and (3) evolution of evolvability, i.e., dedicated mechanisms for evolution such as vehicles for HGT and stress-induced mutagenesis systems. In the non-cellular part of the microbial world, phylogenomics and metagenomics of viruses and related selfish genetic elements revealed enormous genetic and molecular diversity and extremely high abundance of viruses that come across as the dominant biological entities on earth. Furthermore, the perennial arms race between viruses and their hosts is one of the defining factors of evolution. Thus, microbial phylogenomics adds new dimensions to the fundamental picture of evolution even as the principle of descent with modification discovered by Darwin and the laws of population genetics remain at the core of evolutionary biology.
Darwin; modern synthesis; comparative genomics; tree of life; horizontal gene transfer
We compare the sets of experimentally validated long intergenic non-coding (linc)RNAs from human and mouse and apply a maximum likelihood approach to estimate the total number of lincRNA genes as well as the size of the conserved part of the lincRNome. Under the assumption that the sets of experimentally validated lincRNAs are random samples of the lincRNomes of the corresponding species, we estimate the total lincRNome size at approximately 40,000 to 50,000 species, at least twice the number of protein-coding genes. We further estimate that the fraction of the human and mouse euchromatic genomes encoding lincRNAs is more than twofold greater than the fraction of protein-coding sequences. Although the sequences of most lincRNAs are much less strongly conserved than protein sequences, the extent of orthology between the lincRNomes is unexpectedly high, with 60 to 70% of the lincRNA genes shared between human and mouse. The orthologous mammalian lincRNAs can be predicted to perform equivalent functions; accordingly, it appears likely that thousands of evolutionarily conserved functional roles of lincRNAs remain to be characterized.
Genome analysis of humans and other mammals reveals a surprisingly small number of protein-coding genes, only slightly over 20,000 (although the diversity of actual proteins is substantially augmented by alternative transcription and alternative splicing). Recent analysis of the mammalian genomes and transcriptomes, in particular, using the RNAseq technology, shows that, in addition to protein-coding genes, mammalian genomes encode many long non-coding RNAs. For some of these transcripts, various regulatory functions have been demonstrated, but on the whole the repertoire of long non-coding RNAs remains poorly characterized. We compared the identified long intergenic non-coding (linc)RNAs from human and mouse, and employed a specially developed statistical technique to estimate the size and evolutionary conservation of the human and mouse lincRNomes. The estimates show that there are at least twice as many human and mouse lincRNAs than there are protein-coding genes. Moreover, about two third of the lincRNA genes appear to be conserved between human and mouse, implying thousands of conserved but still uncharacterized functions.
Evolution of prokaryotes involves extensive loss and gain of genes, which lead to substantial differences in the gene repertoires even among closely related organisms. Through a wide range of phylogenetic depths, gene frequency distributions in prokaryotic pangenomes bear a characteristic, asymmetrical U-shape, with a core of (nearly) universal genes, a “shell” of moderately common genes, and a “cloud” of rare genes. We employ mathematical modeling to investigate evolutionary processes that might underlie this universal pattern. Gene frequency distributions for almost 400 groups of 10 bacterial or archaeal species each over a broad range of evolutionary distances were fit to steady-state, infinite allele models based on the distribution of gene replacement rates and the phylogenetic tree relating the species in each group. The fits of the theoretical frequency distributions to the empirical ones yield model parameters and estimates of the goodness of fit. Using the Akaike Information Criterion, we show that the neutral model of genome evolution, with the same replacement rate for all genes, can be confidently rejected. Of the three tested models with purifying selection, the one in which the distribution of replacement rates is derived from a stochastic population model with additive per-gene fitness yields the best fits to the data. The selection strength estimated from the fits declines with evolutionary divergence while staying well outside the neutral regime. These findings indicate that, unlike some other universal distributions of genomic variables, for example, the distribution of paralogous gene family membership, the gene frequency distribution is substantially affected by selection.
gene frequency distribution; steady genome model; goodness of fit; evolution mechanisms
Bacteria and archaea face continual onslaughts of rapidly diversifying viruses and plasmids. Many prokaryotes maintain adaptive immune systems known as clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated genes (Cas). CRISPR-Cas systems are genomic sensors that serially acquire viral and plasmid DNA fragments (spacers) that are utilized to target and cleave matching viral and plasmid DNA in subsequent genomic invasions, offering critical immunological memory. Only 50% of sequenced bacteria possess CRISPR-Cas immunity, in contrast to over 90% of sequenced archaea. To probe why half of bacteria lack CRISPR-Cas immunity, we combined comparative genomics and mathematical modeling. Analysis of hundreds of diverse prokaryotic genomes shows that CRISPR-Cas systems are substantially more prevalent in thermophiles than in mesophiles. With sequenced bacteria disproportionately mesophilic and sequenced archaea mostly thermophilic, the presence of CRISPR-Cas appears to depend more on environmental temperature than on bacterial-archaeal taxonomy. Mutation rates are typically severalfold higher in mesophilic prokaryotes than in thermophilic prokaryotes. To quantitatively test whether accelerated viral mutation leads microbes to lose CRISPR-Cas systems, we developed a stochastic model of virus-CRISPR coevolution. The model competes CRISPR-Cas-positive (CRISPR-Cas+) prokaryotes against CRISPR-Cas-negative (CRISPR-Cas−) prokaryotes, continually weighing the antiviral benefits conferred by CRISPR-Cas immunity against its fitness costs. Tracking this cost-benefit analysis across parameter space reveals viral mutation rate thresholds beyond which CRISPR-Cas cannot provide sufficient immunity and is purged from host populations. These results offer a simple, testable viral diversity hypothesis to explain why mesophilic bacteria disproportionately lack CRISPR-Cas immunity. More generally, fundamental limits on the adaptability of biological sensors (Lamarckian evolution) are predicted.
A remarkable recent discovery in microbiology is that bacteria and archaea possess systems conferring immunological memory and adaptive immunity. Clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated genes (CRISPR-Cas) are genomic sensors that allow prokaryotes to acquire DNA fragments from invading viruses and plasmids. Providing immunological memory, these stored fragments destroy matching DNA in future viral and plasmid invasions. CRISPR-Cas systems also provide adaptive immunity, keeping up with mutating viruses and plasmids by continually acquiring new DNA fragments. Surprisingly, less than 50% of mesophilic bacteria, in contrast to almost 90% of thermophilic bacteria and Archaea, maintain CRISPR-Cas immunity. Using mathematical modeling, we probe this dichotomy, showing how increased viral mutation rates can explain the reduced prevalence of CRISPR-Cas systems in mesophiles. Rapidly mutating viruses outrun CRISPR-Cas immune systems, likely decreasing their prevalence in bacterial populations. Thus, viral adaptability may select against, rather than for, immune adaptability in prokaryotes.
A fundamental observation of comparative genomics is that the distribution of evolution rates across the complete sets of orthologous genes in pairs of related genomes remains virtually unchanged throughout the evolution of life, from bacteria to mammals. The most straightforward explanation for the conservation of this distribution appears to be that the relative evolution rates of all genes remain nearly constant, or in other words, that evolutionary rates of different genes are strongly correlated within each evolving genome. This correlation could be explained by a model that we denoted Universal PaceMaker (UPM) of genome evolution. The UPM model posits that the rate of evolution changes synchronously across genome-wide sets of genes in all evolving lineages. Alternatively, however, the correlation between the evolutionary rates of genes could be a simple consequence of molecular clock (MC). We sought to differentiate between the MC and UPM models by fitting thousands of phylogenetic trees for bacterial and archaeal genes to supertrees that reflect the dominant trend of vertical descent in the evolution of archaea and bacteria and that were constrained according to the two models. The goodness of fit for the UPM model was better than the fit for the MC model, with overwhelming statistical significance, although similarly to the MC, the UPM is strongly overdispersed. Thus, the results of this analysis reveal a universal, genome-wide pacemaker of evolution that could have been in operation throughout the history of life.
A central concept of evolution is Molecular Clock according to which each gene evolves at a characteristic, near constant rate. Numerous studies support the Molecular Clock hypothesis in principle but also show that the clock is indeed very approximate. Genome-wide comparative analysis of phylogenetic trees described here reveals a distinct, more general feature of genome evolution that we called Universal Pacemaker. Under this model, when the rate of evolution changes, the change occurs synchronously in many if not all genes in the evolving genome. In other words, the relative rates of gene evolution remain constant across long evolutionary spans: if a gene is slow relative to the rest of the genes in the given lineage, it is always slow, and if it evolves fast, it is always fast. We show here that the Universal Pacemaker model fits the available data much better than the traditional Molecular Clock model. These findings are compatible with the previously observed accelerations and decelerations of evolution in individual lineages but we show that synchronous, genome-wide change of evolutionary rates is a global feature of genome evolution that appears to pervade the entire history of life.
Evolutionary binary characters are features of species or genes, indicating the absence (value zero) or presence (value one) of some property. Examples include eukaryotic gene architecture (the presence or absence of an intron in a particular locus), gene content, and morphological characters. In many studies, the acquisition of such binary characters is assumed to represent a rare evolutionary event, and consequently, their evolution is analyzed using various flavors of parsimony. However, when gain and loss of the character are not rare enough, a probabilistic analysis becomes essential. Here, we present a comprehensive probabilistic model to describe the evolution of binary characters on a bifurcating phylogenetic tree. A fast software tool, EREM, is provided, using maximum likelihood to estimate the parameters of the model and to reconstruct ancestral states (presence and absence in internal nodes) and events (gain and loss events along branches).
Orthologous relationships between genes are routinely inferred from bidirectional best hits (BBH) in pairwise genome comparisons. However, to our knowledge, it has never been quantitatively demonstrated that orthologs form BBH. To test this “BBH-orthology conjecture,” we take advantage of the operon organization of bacterial and archaeal genomes and assume that, when two genes in compared genomes are flanked by two BBH show statistically significant sequence similarity to one another, these genes are bona fide orthologs. Under this assumption, we tested whether middle genes in “syntenic orthologous gene triplets” form BBH. We found that this was the case in more than 95% of the syntenic gene triplets in all genome comparisons. A detailed examination of the exceptions to this pattern, including maximum likelihood phylogenetic tree analysis, showed that some of these deviations involved artifacts of genome annotation, whereas very small fractions represented random assignment of the best hit to one of closely related in-paralogs, paralogous displacement in situ, or even less frequent genuine violations of the BBH–orthology conjecture caused by acceleration of evolution in one of the orthologs. We conclude that, at least in prokaryotes, genes for which independent evidence of orthology is available typically form BBH and, conversely, BBH can serve as a strong indication of gene orthology.
orthology; bidirectional best hit; genome comparison; synteny
There are no known RNA viruses that infect Archaea. Filling this gap in our knowledge of viruses will enhance our understanding of the relationships between RNA viruses from the three domains of cellular life and, in particular, could shed light on the origin of the enormous diversity of RNA viruses infecting eukaryotes. We describe here the identification of novel RNA viral genome segments from high-temperature acidic hot springs in Yellowstone National Park in the United States. These hot springs harbor low-complexity cellular communities dominated by several species of hyperthermophilic Archaea. A viral metagenomics approach was taken to assemble segments of these RNA virus genomes from viral populations isolated directly from hot spring samples. Analysis of these RNA metagenomes demonstrated unique gene content that is not generally related to known RNA viruses of Bacteria and Eukarya. However, genes for RNA-dependent RNA polymerase (RdRp), a hallmark of positive-strand RNA viruses, were identified in two contigs. One of these contigs is approximately 5,600 nucleotides in length and encodes a polyprotein that also contains a region homologous to the capsid protein of nodaviruses, tetraviruses, and birnaviruses. Phylogenetic analyses of the RdRps encoded in these contigs indicate that the putative archaeal viruses form a unique group that is distinct from the RdRps of RNA viruses of Eukarya and Bacteria. Collectively, our findings suggest the existence of novel positive-strand RNA viruses that probably replicate in hyperthermophilic archaeal hosts and are highly divergent from RNA viruses that infect eukaryotes and even more distant from known bacterial RNA viruses. These positive-strand RNA viruses might be direct ancestors of RNA viruses of eukaryotes.
Spliceosomal introns are one of the principal distinctive features of eukaryotes. Nevertheless, different large-scale studies disagree about even the most basic features of their evolution. In order to come up with a more reliable reconstruction of intron evolution, we developed a model that is far more comprehensive than previous ones. This model is rich in parameters, and estimating them accurately is infeasible by straightforward likelihood maximization. Thus, we have developed an expectation-maximization algorithm that allows for efficient maximization. Here, we outline the model and describe the expectation-maximization algorithm in detail. Since the method works with intron presence–absence maps, it is expected to be instrumental for the analysis of the evolution of other binary characters as well.
Maximum likelihood; expectation-maximization; intron evolution; ancestral reconstruction; eukaryotic gene structure
Archaeal and bacterial ribosomes contain more than 50 proteins, including 34 that are universally conserved in the three domains of cellular life (bacteria, archaea, and eukaryotes). Despite the high sequence conservation, annotation of ribosomal (r-) protein genes is often difficult because of their short lengths and biased sequence composition. We developed an automated computational pipeline for identification of r-protein genes and applied it to 995 completely sequenced bacterial and 87 archaeal genomes available in the RefSeq database. The pipeline employs curated seed alignments of r-proteins to run position-specific scoring matrix (PSSM)-based BLAST searches against six-frame genome translations, mitigating possible gene annotation errors. As a result of this analysis, we performed a census of prokaryotic r-protein complements, enumerated missing and paralogous r-proteins, and analyzed the distributions of ribosomal protein genes among chromosomal partitions. Phyletic patterns of bacterial and archaeal r-protein genes were mapped to phylogenetic trees reconstructed from concatenated alignments of r-proteins to reveal the history of likely multiple independent gains and losses. These alignments, available for download, can be used as search profiles to improve genome annotation of r-proteins and for further comparative genomics studies.
Multiple constraints variously affect different parts of the genomes of diverse life forms. The selective pressures that shape the evolution of viral, archaeal, bacterial and eukaryotic genomes differ markedly, even among relatively closely related animal and bacterial lineages; by contrast, constraints affecting protein evolution seem to be more universal. The constraints that shape the evolution of genomes and phenomes are complemented by the plasticity and robustness of genome architecture, expression and regulation. Taken together, these findings are starting to reveal complex networks of evolutionary processes that must be integrated to attain a new synthesis of evolutionary biology.
Experimental studies on enzyme evolution show that only a small fraction of all possible mutation trajectories are accessible to evolution. However, these experiments deal with individual enzymes and explore a tiny part of the fitness landscape. We report an exhaustive analysis of fitness landscapes constructed with an off-lattice model of protein folding where fitness is equated with robustness to misfolding. This model mimics the essential features of the interactions between amino acids, is consistent with the key paradigms of protein folding and reproduces the universal distribution of evolutionary rates among orthologous proteins. We introduce mean path divergence as a quantitative measure of the degree to which the starting and ending points determine the path of evolution in fitness landscapes. Global measures of landscape roughness are good predictors of path divergence in all studied landscapes: the mean path divergence is greater in smooth landscapes than in rough ones. The model-derived and experimental landscapes are significantly smoother than random landscapes and resemble additive landscapes perturbed with moderate amounts of noise; thus, these landscapes are substantially robust to mutation. The model landscapes show a deficit of suboptimal peaks even compared with noisy additive landscapes with similar overall roughness. We suggest that smoothness and the substantial deficit of peaks in the fitness landscapes of protein evolution are fundamental consequences of the physics of protein folding.
Is evolution deterministic, hence predictable, or stochastic, that is unpredictable? What would happen if one could “replay the tape of evolution”: will the outcomes of evolution be completely different or is evolution so constrained that history will be repeated? Arguably, these questions are among the most intriguing and most difficult in evolutionary biology. In other words, the predictability of evolution depends on the fraction of the trajectories on fitness landscapes that are accessible for evolutionary exploration. Because direct experimental investigation of fitness landscapes is technically challenging, the available studies only explore a minuscule portion of the landscape for individual enzymes. We therefore sought to investigate the topography of fitness landscapes within the framework of a previously developed model of protein folding and evolution where fitness is equated with robustness to misfolding. We show that model-derived and experimental landscapes are significantly smoother than random landscapes and resemble moderately perturbed additive landscapes; thus, these landscapes are substantially robust to mutation. The model landscapes show a deficit of suboptimal peaks even compared with noisy additive landscapes with similar overall roughness. Thus, the smoothness and substantial deficit of peaks in fitness landscapes of protein evolution could be fundamental consequences of the physics of protein folding.
The arms race between cellular life forms and viruses is a major driving force of evolution. A substantial fraction of bacterial and archaeal genomes is dedicated to antivirus defense. We analyzed the distribution of defense genes and typical mobilome components (such as viral and transposon genes) in bacterial and archaeal genomes and demonstrated statistically significant clustering of antivirus defense systems and mobile genes and elements in genomic islands. The defense islands are enriched in putative operons and contain numerous overrepresented gene families. A detailed sequence analysis of the proteins encoded by genes in these families shows that many of them are diverged variants of known defense system components, whereas others show features, such as characteristic operonic organization, that are suggestive of novel defense systems. Thus, genomic islands provide abundant material for the experimental study of bacterial and archaeal antivirus defense. Except for the CRISPR-Cas systems, different classes of defense systems, in particular toxin-antitoxin and restriction-modification systems, show nonrandom clustering in defense islands. It remains unclear to what extent these associations reflect functional cooperation between different defense systems and to what extent the islands are genomic “sinks” that accumulate diverse nonessential genes, particularly those acquired via horizontal gene transfer. The characteristics of defense islands resemble those of mobilome islands. Defense and mobilome genes are nonrandomly associated in islands, suggesting nonadaptive evolution of the islands via a preferential attachment-like mechanism underpinned by the addictive properties of defense systems such as toxins-antitoxins and an important role of horizontal mobility in the evolution of these islands.
Phylogenetic trees of individual genes of prokaryotes (archaea and bacteria) generally have different topologies, largely owing to extensive horizontal gene transfer (HGT), suggesting that the Tree of Life (TOL) should be replaced by a “net of life” as the paradigm of prokaryote evolution. However, trees remain the natural representation of the histories of individual genes given the fundamentally bifurcating process of gene replication. Therefore, although no single tree can fully represent the evolution of prokaryote genomes, the complete picture of evolution will necessarily combine trees and nets. A quantitative measure of the signals of tree and net evolution is derived from an analysis of all quartets of species in all trees of the “Forest of Life” (FOL), which consists of approximately 7,000 phylogenetic trees for prokaryote genes including approximately 100 nearly universal trees (NUTs). Although diverse routes of net-like evolution collectively dominate the FOL, the pattern of tree-like evolution that reflects the consistent topologies of the NUTs is the most prominent coherent trend. We show that the contributions of tree-like and net-like evolutionary processes substantially differ across bacterial and archaeal lineages and between functional classes of genes. Evolutionary simulations indicate that the central tree-like signal cannot be realistically explained by a self-reinforcing pattern of biased HGT.
phylogenetic tree; horizontal gene transfer; species quartets; computer simulation
Motivation: Identifying orthologous genes in multiple genomes is a fundamental task in comparative genomics. Construction of intergenomic symmetrical best matches (SymBets) and joining them into clusters is a popular method of ortholog definition, embodied in several software programs. Despite their wide use, the computational complexity of these programs has not been thoroughly examined.
Results: In this work, we show that in the standard approach of iteration through all triangles of SymBets, the memory scales with at least the number of these triangles, O(g3) (where g = number of genomes), and construction time scales with the iteration through each pair, i.e. O(g6). We propose the EdgeSearch algorithm that iterates over edges in the SymBet graph rather than triangles of SymBets, and as a result has a worst-case complexity of only O(g3log g). Several optimizations reduce the run-time even further in realistically sparse graphs. In two real-world datasets of genomes from bacteriophages (POGs) and Mollicutes (MOGs), an implementation of the EdgeSearch algorithm runs about an order of magnitude faster than the original algorithm and scales much better with increasing number of genomes, with only minor differences in the final results, and up to 60 times faster than the popular OrthoMCL program with a 90% overlap between the identified groups of orthologs.
Availability and implementation: C++ source code freely available for download at ftp.ncbi.nih.gov/pub/wolf/COGs/COGsoft/
Supplementary information: Supplementary materials are available at Bioinformatics online.
A long-standing assumption in evolutionary biology is that the evolution rate of protein-coding genes depends, largely, on specific constraints that affect the function of the given protein. However, recent research in evolutionary systems biology revealed unexpected, significant correlations between evolution rate and characteristics of genes or proteins that are not directly related to specific protein functions, such as expression level and protein–protein interactions. The strongest connections were consistently detected between protein sequence evolution rate and the expression level of the respective gene. A recent genome-wide proteomic study revealed an extremely strong correlation between the abundances of orthologous proteins in distantly related animals, the nematode Caenorhabditis elegans and the fruit fly Drosophila melanogaster. We used the extensive protein abundance data from this study along with short-term evolutionary rates (ERs) of orthologous genes in nematodes and flies to estimate the relative contributions of structural–functional constraints and the translation rate to the evolution rate of protein-coding genes. Together the intrinsic constraints and translation rate account for approximately 50% of the variance of the ERs. The contribution of constraints is estimated to be 3- to 5-fold greater than the contribution of translation rate.
protein evolution; structural–functional constraints; misfolding; protein abundance
In order to explore microevolutionary trends in bacteria and archaea, we constructed a data set of 41 alignable tight genome clusters (ATGCs). We show that the ratio of the medians of nonsynonymous to synonymous substitution rates (dN/dS) that is used as a measure of the purifying selection pressure on protein sequences is a stable characteristic of the ATGCs. In agreement with previous findings, parasitic bacteria, notwithstanding the sometimes dramatic genome shrinkage caused by gene loss, are typically subjected to relatively weak purifying selection, presumably owing to relatively small effective population sizes and frequent bottlenecks. However, no evidence of genome streamlining caused by strong selective pressure was found in any of the ATGCs. On the contrary, a significant positive correlation between the genome size, as well as gene size, and selective pressure was observed, although a variety of free-living prokaryotes with very close selective pressures span nearly the entire range of genome sizes. In addition, we examined the connections between the sequence evolution rate and other genomic features. Although gene order changes much faster than protein sequences during the evolution of prokaryotes, a strong positive correlation was observed between the “rearrangement distance” and the amino acid distance, suggesting that at least some of the events leading to genome rearrangement are subjected to the same type of selective constraints as the evolution of amino acid sequences.
The first bacterial genome was sequenced in 1995, and the first archaeal genome in 1996. Soon after these breakthroughs, an exponential rate of genome sequencing was established, with a doubling time of approximately 20 months for bacteria and approximately 34 months for archaea. Comparative analysis of the hundreds of sequenced bacterial and dozens of archaeal genomes leads to several generalizations on the principles of genome organization and evolution. A crucial finding that enables functional characterization of the sequenced genomes and evolutionary reconstruction is that the majority of archaeal and bacterial genes have conserved orthologs in other, often, distant organisms. However, comparative genomics also shows that horizontal gene transfer (HGT) is a dominant force of prokaryotic evolution, along with the loss of genetic material resulting in genome contraction. A crucial component of the prokaryotic world is the mobilome, the enormous collection of viruses, plasmids and other selfish elements, which are in constant exchange with more stable chromosomes and serve as HGT vehicles. Thus, the prokaryotic genome space is a tightly connected, although compartmentalized, network, a novel notion that undermines the ‘Tree of Life’ model of evolution and requires a new conceptual framework and tools for the study of prokaryotic evolution.
The database of Alignable Tight Genomic Clusters (ATGCs) consists of closely related genomes of archaea and bacteria, and is a resource for research into prokaryotic microevolution. Construction of a data set with appropriate characteristics is a major hurdle for this type of studies. With the current rate of genome sequencing, it is difficult to follow the progress of the field and to determine which of the available genome sets meet the requirements of a given research project, in particular, with respect to the minimum and maximum levels of similarity between the included genomes. Additionally, extraction of specific content, such as genomic alignments or families of orthologs, from a selected set of genomes is a complicated and time-consuming process. The database addresses these problems by providing an intuitive and efficient web interface to browse precomputed ATGCs, select appropriate ones and access ATGC-derived data such as multiple alignments of orthologous proteins, matrices of pairwise intergenomic distances based on genome-wide analysis of synonymous and nonsynonymous substitution rates and others. The ATGC database will be regularly updated following new releases of the NCBI RefSeq. The database is hosted by the Genomics Division at Lawrence Berkeley National laboratory and is publicly available at http://atgc.lbl.gov
The set of conserved eukaryotic protein-coding genes includes distinct subsets one of which appears to be most closely related to and, by inference, derived from archaea, whereas another one appears to be of bacterial, possibly, endosymbiotic origin. The “archaeal” genes of eukaryotes, primarily, encode components of information-processing systems, whereas the “bacterial” genes are predominantly operational. The precise nature of the archaeo–eukaryotic relationship remains uncertain, and it has been variously argued that eukaryotic informational genes evolved from the homologous genes of Euryarchaeota or Crenarchaeota (the major branches of extant archaea) or that the origin of eukaryotes lies outside the known diversity of archaea. We describe a comprehensive set of 355 eukaryotic genes of apparent archaeal origin identified through ortholog detection and phylogenetic analysis. Phylogenetic hypothesis testing using constrained trees, combined with a systematic search for shared derived characters in the form of homologous inserts in conserved proteins, indicate that, for the majority of these genes, the preferred tree topology is one with the eukaryotic branch placed outside the extant diversity of archaea although small subsets of genes show crenarchaeal and euryarchaeal affinities. Thus, the archaeal genes in eukaryotes appear to descend from a distinct, ancient, and otherwise uncharacterized archaeal lineage that acquired some euryarchaeal and crenarchaeal genes via early horizontal gene transfer.
archaea; eukaryotes; Euryarchaeota; Crenarchaeota; phylogenetic analysis
Gene duplication is a crucial mechanism of evolutionary innovation. A substantial fraction of eukaryotic genomes consists of paralogous gene families. We assess the extent of ancestral paralogy, which dates back to the last common ancestor of all eukaryotes, and examine the origins of the ancestral paralogs and their potential roles in the emergence of the eukaryotic cell complexity. A parsimonious reconstruction of ancestral gene repertoires shows that 4137 orthologous gene sets in the last eukaryotic common ancestor (LECA) map back to 2150 orthologous sets in the hypothetical first eukaryotic common ancestor (FECA) [paralogy quotient (PQ) of 1.92]. Analogous reconstructions show significantly lower levels of paralogy in prokaryotes, 1.19 for archaea and 1.25 for bacteria. The only functional class of eukaryotic proteins with a significant excess of paralogous clusters over the mean includes molecular chaperones and proteins with related functions. Almost all genes in this category underwent multiple duplications during early eukaryotic evolution. In structural terms, the most prominent sets of paralogs are superstructure-forming proteins with repetitive domains, such as WD-40 and TPR. In addition to the true ancestral paralogs which evolved via duplication at the onset of eukaryotic evolution, numerous pseudoparalogs were detected, i.e. homologous genes that apparently were acquired by early eukaryotes via different routes, including horizontal gene transfer (HGT) from diverse bacteria. The results of this study demonstrate a major increase in the level of gene paralogy as a hallmark of the early evolution of eukaryotes.
We describe a simple theoretical framework for identifying orthologous sets of genes that deviate from a clock-like model of evolution. The approach used is based on comparing the evolutionary distances within a set of orthologs to a standard intergenomic distance, which was defined as the median of the distribution of the distances between all one-to-one orthologs. Under the clock-like model, the points on a plot of intergenic distances versus intergenomic distances are expected to fit a straight line. A statistical technique to identify significant deviations from the clock-like behavior is described. For several hundred analyzed orthologous sets representing three well-defined bacterial lineages, the α-Proteobacteria, the γ-Proteobacteria, and the Bacillus-Clostridium group, the clock-like null hypothesis could not be rejected for ∼70% of the sets, whereas the rest showed substantial anomalies. Subsequent detailed phylogenetic analysis of the genes with the strongest deviations indicated that over one-half of these genes probably underwent a distinct form of horizontal gene transfer, xenologous gene displacement, in which a gene is displaced by an ortholog from a different lineage. The remaining deviations from the clock-like model could be explained by lineage-specific acceleration of evolution. The results indicate that although xenologous gene displacement is a major force in bacterial evolution, a significant majority of orthologous gene sets in three major bacterial lineages evolved in accordance with the clock-like model. The approach described here allows rapid detection of deviations from this mode of evolution on the genome scale.