We compare the sets of experimentally validated long intergenic non-coding (linc)RNAs from human and mouse and apply a maximum likelihood approach to estimate the total number of lincRNA genes as well as the size of the conserved part of the lincRNome. Under the assumption that the sets of experimentally validated lincRNAs are random samples of the lincRNomes of the corresponding species, we estimate the total lincRNome size at approximately 40,000 to 50,000 species, at least twice the number of protein-coding genes. We further estimate that the fraction of the human and mouse euchromatic genomes encoding lincRNAs is more than twofold greater than the fraction of protein-coding sequences. Although the sequences of most lincRNAs are much less strongly conserved than protein sequences, the extent of orthology between the lincRNomes is unexpectedly high, with 60 to 70% of the lincRNA genes shared between human and mouse. The orthologous mammalian lincRNAs can be predicted to perform equivalent functions; accordingly, it appears likely that thousands of evolutionarily conserved functional roles of lincRNAs remain to be characterized.
Genome analysis of humans and other mammals reveals a surprisingly small number of protein-coding genes, only slightly over 20,000 (although the diversity of actual proteins is substantially augmented by alternative transcription and alternative splicing). Recent analysis of the mammalian genomes and transcriptomes, in particular, using the RNAseq technology, shows that, in addition to protein-coding genes, mammalian genomes encode many long non-coding RNAs. For some of these transcripts, various regulatory functions have been demonstrated, but on the whole the repertoire of long non-coding RNAs remains poorly characterized. We compared the identified long intergenic non-coding (linc)RNAs from human and mouse, and employed a specially developed statistical technique to estimate the size and evolutionary conservation of the human and mouse lincRNomes. The estimates show that there are at least twice as many human and mouse lincRNAs than there are protein-coding genes. Moreover, about two third of the lincRNA genes appear to be conserved between human and mouse, implying thousands of conserved but still uncharacterized functions.
Clusters of localized hypermutation in human breast cancer genomes, named “kataegis” (from the Greek for thunderstorm), are hypothesized to result from multiple cytosine deaminations catalyzed by AID/APOBEC proteins. However, a direct link between APOBECs and kataegis is still lacking. We have sequenced the genomes of yeast mutants induced in diploids by expression of the gene for PmCDA1, a hypermutagenic deaminase from sea lamprey. Analysis of the distribution of 5,138 induced mutations revealed localized clusters very similar to those found in tumors. Our data provide evidence that unleashed cytosine deaminase activity is an evolutionary conserved, prominent source of genome-wide kataegis events.
This article was reviewed by: Professor Sandor Pongor, Professor Shamil R. Sunyaev, and Dr Vladimir Kuznetsov.
APOBEC; Deaminase; Mutation; Kataegis; Cancer; Diploid yeast; Hypermutation
A substantial fraction of eukaryotic proteins contains multiple domains, some of which show a tendency to occur in diverse domain architectures and can be considered mobile (or ‘promiscuous’). These promiscuous domains are typically involved in protein–protein interactions and play crucial roles in interaction networks, particularly those contributing to signal transduction. They also play a major role in creating diversity of protein domain architecture in the proteome. It is now apparent that promiscuity is a volatile and relatively fast-changing feature in evolution, and that only a few domains retain their promiscuity status throughout evolution. Many such domains attained their promiscuity status independently in different lineages. Only recently, we have begun to understand the diversity of protein domain architectures and the role the promiscuous domains play in evolution of this diversity. However, many of the biological mechanisms of protein domain mobility remain shrouded in mystery. In this review, we discuss our present understanding of protein domain promiscuity, its evolution and its role in cellular function.
mobile domain; promiscuous domain; domain network; domain architecture; domain evolution
In order to maintain visual sensitivity at all light levels, the vertebrate eye possesses a mechanism to regenerate the visual pigment chromophore 11-cis retinal in the dark enzymatically, unlike in all other taxa, which rely on photoisomerization. This mechanism is termed the visual cycle and is localized to the retinal pigment epithelium (RPE), a support layer of the neural retina. Speculation has long revolved around whether more primitive chordates, such as tunicates and cephalochordates, anticipated this feature. The two key enzymes of the visual cycle are RPE65, the visual cycle all-trans retinyl ester isomerohydrolase, and lecithin:retinol acyltransferase (LRAT), which generates RPE65’s substrate. We hypothesized that the origin of the vertebrate visual cycle is directly connected to an ancestral carotenoid oxygenase acquiring a new retinyl ester isomerohydrolase function. Our phylogenetic analyses of the RPE65/BCMO and N1pC/P60 (LRAT) superfamilies show that neither RPE65 nor LRAT orthologs occur in tunicates (Ciona) or cephalochordates (Branchiostoma), but occur in Petromyzon marinus (Sea Lamprey), a jawless vertebrate. The closest homologs to RPE65 in Ciona and Branchiostoma lacked predicted functionally diverged residues found in all authentic RPE65s, but lamprey RPE65 contained all of them. We cloned RPE65 and LRATb cDNAs from lamprey RPE and demonstrated appropriate enzymatic activities. We show that Ciona ß-carotene monooxygenase a (BCMOa) (previously annotated as an RPE65) has carotenoid oxygenase cleavage activity but not RPE65 activity. We verified the presence of RPE65 in lamprey RPE by immunofluorescence microscopy, immunoblot and mass spectrometry. On the basis of these data we conclude that the crucial transition from the typical carotenoid double bond cleavage functionality (BCMO) to the isomerohydrolase functionality (RPE65), coupled with the origin of LRAT, occurred subsequent to divergence of the more primitive chordates (tunicates, etc.) in the last common ancestor of the jawless and jawed vertebrates.
Among thousands of long non-coding RNAs (lncRNAs) only a small subset is functionally characterized and the functional annotation of lncRNAs on the genomic scale remains inadequate. In this study we computationally characterized two functionally different parts of human lncRNAs transcriptome based on their ability to bind the polycomb repressive complex, PRC2. This classification is enabled by the fact that while all lncRNAs constitute a diverse set of sequences, the classes of PRC2-binding and PRC2 non-binding lncRNAs possess characteristic combinations of sequence-structure patterns and, therefore, can be separated within the feature space. Based on the specific combination of features, we built several machine-learning classifiers and identified the SVM-based classifier as the best performing. We further showed that the SVM-based classifier is able to generalize on the independent data sets. We observed that this classifier, trained on the human lncRNAs, can predict up to 59.4% of PRC2-binding lncRNAs in mice. This suggests that, despite the low degree of sequence conservation, many lncRNAs play functionally conserved biological roles.
Spliceosomal introns are one of the principal distinctive features of eukaryotes. Nevertheless, different large-scale studies disagree about even the most basic features of their evolution. In order to come up with a more reliable reconstruction of intron evolution, we developed a model that is far more comprehensive than previous ones. This model is rich in parameters, and estimating them accurately is infeasible by straightforward likelihood maximization. Thus, we have developed an expectation-maximization algorithm that allows for efficient maximization. Here, we outline the model and describe the expectation-maximization algorithm in detail. Since the method works with intron presence–absence maps, it is expected to be instrumental for the analysis of the evolution of other binary characters as well.
Maximum likelihood; expectation-maximization; intron evolution; ancestral reconstruction; eukaryotic gene structure
It was proposed that if some mRNA characteristics resulted in a low efficiency of termination signal, an additional closely located stop codon (tandem stop codons) could be used to prevent the harmful readthrough. However, the role of tandem terminators in higher eukaryotes was not verified and remains hypothetical. In this work the sequence features of Arabidopsis thaliana and Oryza sativa mRNAs were analyzed. It was found that plant mRNAs with UGA terminator were characterized by a higher frequency of nonsense codons in the first triplet position of 3′-UTR that could result from a weak natural selection for “reserve” stop signal. Interestingly, the presence of tandem stop codons positively correlated with a specific amino acid composition in the C-terminal position of the encoded proteins. In particular, C-terminal glycine positively correlated with significantly higher frequencies of reserve terminators at the beginning positions of 3′-UTR in UGA-containing mRNAs. This finding coincides with some earlier observations concerning the role of glycine and its codons in inefficient termination of translation and recoding (e.g., 2A oligopeptide).
mRNA; Arabidopsis thaliana; Oryza sativa; stop codon; tandem terminators; readthrough
The two types of eukaryotic spliceosomal introns, U2 and U12, possess different splice signals and are excised by distinct spliceosomes. The nature of the primordial introns remains uncertain. A comparison of the amino acid distributions at insertion sites of introns that retained their positions throughout eukaryotic evolution with the distributions for human and Arabidopsis thaliana U2 and U12 introns reveals close similarity with U2 but not U12. Thus, the primordial spliceosomal introns were, most likely, U2-type.
Evolution of exon-intron structure of eukaryotic genes has been a matter of long-standing, intensive debate. The introns-early concept, later rebranded ‘introns first’ held that protein-coding genes were interrupted by numerous introns even at the earliest stages of life's evolution and that introns played a major role in the origin of proteins by facilitating recombination of sequences coding for small protein/peptide modules. The introns-late concept held that introns emerged only in eukaryotes and new introns have been accumulating continuously throughout eukaryotic evolution. Analysis of orthologous genes from completely sequenced eukaryotic genomes revealed numerous shared intron positions in orthologous genes from animals and plants and even between animals, plants and protists, suggesting that many ancestral introns have persisted since the last eukaryotic common ancestor (LECA). Reconstructions of intron gain and loss using the growing collection of genomes of diverse eukaryotes and increasingly advanced probabilistic models convincingly show that the LECA and the ancestors of each eukaryotic supergroup had intron-rich genes, with intron densities comparable to those in the most intron-rich modern genomes such as those of vertebrates. The subsequent evolution in most lineages of eukaryotes involved primarily loss of introns, with only a few episodes of substantial intron gain that might have accompanied major evolutionary innovations such as the origin of metazoa. The original invasion of self-splicing Group II introns, presumably originating from the mitochondrial endosymbiont, into the genome of the emerging eukaryote might have been a key factor of eukaryogenesis that in particular triggered the origin of endomembranes and the nucleus. Conversely, splicing errors gave rise to alternative splicing, a major contribution to the biological complexity of multicellular eukaryotes. There is no indication that any prokaryote has ever possessed a spliceosome or introns in protein-coding genes, other than relatively rare mobile self-splicing introns. Thus, the introns-first scenario is not supported by any evidence but exon-intron structure of protein-coding genes appears to have evolved concomitantly with the eukaryotic cell, and introns were a major factor of evolution throughout the history of eukaryotes. This article was reviewed by I. King Jordan, Manuel Irimia (nominated by Anthony Poole), Tobias Mourier (nominated by Anthony Poole), and Fyodor Kondrashov. For the complete reports, see the Reviewers’ Reports section.
Intron sliding; Intron gain; Intron loss; Spliceosome; Splicing signals; Evolution of exon/intron structure; Alternative splicing; Phylogenetic trees; Mobile domains; Eukaryotic ancestor
Mammalian genomes contain numerous genes for long noncoding RNAs (lncRNAs). The functions of the lncRNAs remain largely unknown but their evolution appears to be constrained by purifying selection, albeit relatively weakly. To gain insights into the mode of evolution and the functional range of the lncRNA, they can be compared with much better characterized protein-coding genes. The evolutionary rate of the protein-coding genes shows a universal negative correlation with expression: highly expressed genes are on average more conserved during evolution than the genes with lower expression levels. This correlation was conceptualized in the misfolding-driven protein evolution hypothesis according to which misfolding is the principal cost incurred by protein expression. We sought to determine whether long intergenic ncRNAs (lincRNAs) follow the same evolutionary trend and indeed detected a moderate but statistically significant negative correlation between the evolutionary rate and expression level of human and mouse lincRNA genes. The magnitude of the correlation for the lincRNAs is similar to that for equal-sized sets of protein-coding genes with similar levels of sequence conservation. Additionally, the expression level of the lincRNAs is significantly and positively correlated with the predicted extent of lincRNA molecule folding (base-pairing), however, the contributions of evolutionary rates and folding to the expression level are independent. Thus, the anticorrelation between evolutionary rate and expression level appears to be a general feature of gene evolution that might be caused by similar deleterious effects of protein and RNA misfolding and/or other factors, for example, the number of interacting partners of the gene product.
long noncoding RNA; ncRNA; RNA expression; genomic alignments; introns; RNA folding
Protein-coding genes in eukaryotes are interrupted by introns, but intron densities widely differ between eukaryotic lineages. Vertebrates, some invertebrates and green plants have intron-rich genes, with 6–7 introns per kilobase of coding sequence, whereas most of the other eukaryotes have intron-poor genes. We reconstructed the history of intron gain and loss using a probabilistic Markov model (Markov Chain Monte Carlo, MCMC) on 245 orthologous genes from 99 genomes representing the three of the five supergroups of eukaryotes for which multiple genome sequences are available. Intron-rich ancestors are confidently reconstructed for each major group, with 53 to 74% of the human intron density inferred with 95% confidence for the Last Eukaryotic Common Ancestor (LECA). The results of the MCMC reconstruction are compared with the reconstructions obtained using Maximum Likelihood (ML) and Dollo parsimony methods. An excellent agreement between the MCMC and ML inferences is demonstrated whereas Dollo parsimony introduces a noticeable bias in the estimations, typically yielding lower ancestral intron densities than MCMC and ML. Evolution of eukaryotic genes was dominated by intron loss, with substantial gain only at the bases of several major branches including plants and animals. The highest intron density, 120 to 130% of the human value, is inferred for the last common ancestor of animals. The reconstruction shows that the entire line of descent from LECA to mammals was intron-rich, a state conducive to the evolution of alternative splicing.
In eukaryotes, protein-coding genes are interrupted by non-coding introns. The intron densities widely differ, from 6–7 introns per kilobase of coding sequence in vertebrates, some invertebrates and plants, to only a few introns across the entire genome in many unicellular forms. We applied a robust statistical methodology, Markov Chain Monte Carlo, to reconstruct the history of intron gain and loss throughout the evolution of eukaryotes using a set of 245 homologous genes from 99 genomes that represent the diversity of eukaryotes. Intron-rich ancestors were confidently inferred for each major eukaryotic group including 53% to 74% of the human intron density for the last eukaryotic common ancestor, and 120% to 130% of the human value for the last common ancestor of animals. Evolution of eukaryotic genes involved primarily intron loss, with substantial gain only at the bases of several major branches including plants and animals. Thus, the common ancestor of all extant eukaryotes was a complex organism with a gene architecture resembling those in multicellular organisms. The line of descent from the last common ancestor to mammals was an uninterrupted intron-rich state that, given the error-prone splicing in intron-rich organisms, was conducive to the elaboration of functional alternative splicing.
The deaminase-like fold includes, in addition to nucleic acid/nucleotide deaminases, several catalytic domains such as the JAB domain, and others involved in nucleotide and ADP-ribose metabolism. Using sensitive sequence and structural comparison methods, we develop a comprehensive natural classification of the deaminase-like fold and show that its ancestral version was likely to operate on nucleotides or nucleic acids. Consequently, we present evidence that a specific group of JAB domains are likely to possess a DNA repair function, distinct from the previously known deubiquitinating peptidase activity. We also identified numerous previously unknown clades of nucleic acid deaminases. Using inference based on contextual information, we suggest that most of these clades are toxin domains of two distinct classes of bacterial toxin systems, namely polymorphic toxins implicated in bacterial interstrain competition and those that target distantly related cells. Genome context information suggests that these toxins might be delivered via diverse secretory systems, such as Type V, Type VI, PVC and a novel PrsW-like intramembrane peptidase-dependent mechanism. We propose that certain deaminase toxins might be deployed by diverse extracellular and intracellular pathogens as also endosymbionts as effectors targeting nucleic acids of host cells. Our analysis suggests that these toxin deaminases have been acquired by eukaryotes on several independent occasions and recruited as organellar or nucleo-cytoplasmic RNA modifiers, operating on tRNAs, mRNAs and short non-coding RNAs, and also as mutators of hyper-variable genes, viruses and selfish elements. This scenario potentially explains the origin of mutagenic AID/APOBEC-like deaminases, including novel versions from Caenorhabditis, Nematostella and diverse algae and a large class of fast-evolving fungal deaminases. These observations greatly expand the distribution of possible unidentified mutagenic processes catalyzed by nucleic acid deaminases.
Accurate estimation of the divergence time of the extant eukaryotes is a fundamentally important but extremely difficult problem owing primarily to gross violations of the molecular clock at long evolutionary distances and the lack of appropriate calibration points close to the date of interest. These difficulties are intrinsic to the dating of ancient divergence events and are reflected in the large discrepancies between estimates obtained with different approaches. Estimates of the age of Last Eukaryotic Common Ancestor (LECA) vary approximately twofold, from ~1,100 million years ago (Mya) to ~2,300 Mya.
We applied the genome-wide analysis of rare genomic changes associated with conserved amino acids (RGC_CAs) and used several independent techniques to obtain date estimates for the divergence of the major lineages of eukaryotes with calibration intervals for insects, land plants and vertebrates. The results suggest an early divergence of monocot and dicot plants, approximately 340 Mya, raising the possibility of plant-insect coevolution. The divergence of bilaterian animal phyla is estimated at ~400-700 Mya, a range of dates that is consistent with cladogenesis immediately preceding the Cambrian explosion. The origin of opisthokonts (the supergroup of eukaryotes that includes metazoa and fungi) is estimated at ~700-1,000 Mya, and the age of LECA at ~1,000-1,300 Mya. We separately analyzed the red algal calibration interval which is based on single fossil. This analysis produced time estimates that were systematically older compared to the other estimates. Nevertheless, the majority of the estimates for the age of the LECA using the red algal data fell within the 1,200-1,400 Mya interval.
The inference of a "young LECA" is compatible with the latest of previously estimated dates and has substantial biological implications. If these estimates are valid, the approximately 1 to 1.4 billion years of evolution of eukaryotes that is open to comparative-genomic study probably was preceded by hundreds of millions years of evolution that might have included extinct diversity inaccessible to comparative approaches.
This article was reviewed by William Martin, Herve Philippe (nominated by I. King Jordan), and Romain Derelle.
bilateria; opisthokonts; angiosperms; last eukaryotic common ancestor; molecular dating
Yeast DNA polymerase ε (Pol ε) is a highly accurate and processive enzyme that participates in nuclear DNA replication of the leading strand template. In addition to a large subunit (Pol2) harboring the polymerase and proofreading exonuclease active sites, Pol ε also has one essential subunit (Dpb2) and two smaller, non-essential subunits (Dpb3 and Dpb4) whose functions are not fully understood. To probe the functions of Dpb3 and Dpb4, here we investigate the consequences of their absence on the biochemical properties of Pol ε in vitro and on genome stability in vivo. The fidelity of DNA synthesis in vitro by purified Pol2/Dpb2, i.e. lacking Dpb3 and Dpb4, is comparable to the four-subunit Pol ε holoenzyme. Nonetheless, deletion of DPB3 and DPB4 elevates spontaneous frameshift and base substitution rates in vivo, to the same extent as the loss of Pol ε proofreading activity in a pol2-4 strain. In contrast to pol2-4, however, the dpb3Δdpb4Δ does not lead to a synergistic increase of mutation rates with defects in DNA mismatch repair. The increased mutation rate in dpb3Δdpb4Δ strains is partly dependent on REV3, as well as the proofreading capacity of Pol δ. Finally, biochemical studies demonstrate that the absence of Dpb3 and Dpb4 destabilizes the interaction between Pol ε and the template DNA during processive DNA synthesis and during processive 3′ to 5′exonucleolytic degradation of DNA. Collectively, these data suggest a model wherein Dpb3 and Dpb4 do not directly influence replication fidelity per se, but rather contribute to normal replication fork progression. In their absence, a defective replisome may more frequently leave gaps on the leading strand that are eventually filled by Pol ζ or Pol δ, in a post-replication process that generates errors not corrected by the DNA mismatch repair system.
The high fidelity of DNA replication is safeguarded by the accuracy of nucleotide selection by DNA polymerases, proofreading activity of the replicative polymerases, and the DNA mismatch repair system. Errors made by replicative polymerases are corrected by mismatch repair, and inactivation of the mismatch repair system results in a multiplicative increase in error rates when combined with a proofreading deficient allele of a replicative polymerase. In this study, we demonstrate that the deletion of two non-essential genes encoding for two subunits of Pol ε give an increased mutation rate due to increased synthesis by the error-prone DNA polymerase ζ. Surprisingly, there was no multiplicative increase in error rates when the mismatch repair system was inactivated. We propose that the deletion of DPB3 and DPB4 gives a defective replisome, which in turn gives increased synthesis, in part, by Pol ζ during an error-prone post-replication process that is not efficiently repaired by the mismatch repair system.
Retroposition, a leading mechanism for gene duplication, is an important process shaping the evolution of genomes. Retrogenes are also involved in the gene structure evolution as a major player in the process of intron deletion. Here, we demonstrate the role of retrogenes in intron gain in mammals. We identified one case of “intronization,” the transformation of exonic sequences into an intron, in the primate specific retrogene RNF113B and two independent “intronization” events in the retrogene DCAF12L2, one in the common ancestor of primates and rodents and another one in the rodent lineage. Intron gain resulted from the origin of new splice variants, and both genes have two transcript forms, one with retained intron and one with the intron spliced out. Evolution of these genes, especially RNF113B, has been very dynamic and has been accompanied by several additional events including parental gene loss, secondary retroposition, and exaptation of transposable elements.
intron gain; gene structure evolution; splice variant; RNF113; DCAF12
Evolutionary binary characters are features of species or genes, indicating the absence (value zero) or presence (value one) of some property. Examples include eukaryotic gene architecture (the presence or absence of an intron in a particular locus), gene content, and morphological characters. In many studies, the acquisition of such binary characters is assumed to represent a rare evolutionary event, and consequently, their evolution is analyzed using various flavors of parsimony. However, when gain and loss of the character are not rare enough, a probabilistic analysis becomes essential. Here, we present a comprehensive probabilistic model to describe the evolution of binary characters on a bifurcating phylogenetic tree. A fast software tool, EREM, is provided, using maximum likelihood to estimate the parameters of the model and to reconstruct ancestral states (presence and absence in internal nodes) and events (gain and loss events along branches).
The deep phylogeny of eukaryotes is an important but extremely difficult problem of evolutionary biology. Five eukaryotic supergroups are relatively well established but the relationship between these supergroups remains elusive, and their divergence seems to best fit a “Big Bang” model. Attempts were made to root the tree of eukaryotes by using potential derived shared characters such as unique fusions of conserved genes. One popular model of eukaryotic evolution that emerged from this type of analysis is the unikont–bikont phylogeny: The unikont branch consists of Metazoa, Choanozoa, Fungi, and Amoebozoa, whereas bikonts include the rest of eukaryotes, namely, Plantae (green plants, Chlorophyta, and Rhodophyta), Chromalveolata, excavates, and Rhizaria. We reexamine the relationships between the eukaryotic supergroups using a genome-wide analysis of rare genomic changes (RGCs) associated with multiple, conserved amino acids (RGC_CAMs and RGC_CAs), to resolve trifurcations of major eukaryotic lineages. The results do not support the basal position of Chromalveolata with respect to Plantae and unikonts or the monophyly of the bikont group and appear to be best compatible with the monophyly of unikonts and Chromalveolata. Chromalveolata show a distinct, additional signal of affinity with Plantae, conceivably, owing to genes transferred from the secondary, red algal symbiont. Excavates are derived forms, with extremely long branches that complicate phylogenetic inference; nevertheless, the RGC analysis suggests that they are significantly more likely to cluster with the unikont–Chromalveolata assemblage than with the Plantae. Thus, the first split in eukaryotic evolution might lie between photosynthetic and nonphotosynthetic forms and so could have been triggered by the endosymbiosis between an ancestral unicellular eukaryote and a cyanobacterium that gave rise to the chloroplast.
eukaryotic phylogeny; rare genomic changes; parsimony; substitutions; insertions; deletions
To probe Pol ζ functions in vivo via its error signature, here we report the properties of Saccharomyces cerevisiae Pol ζ in which phenyalanine was substituted for the conserved Leu-979 in the catalytic (Rev3) subunit. We show that purified L979F Pol ζ is 30% as active as wild-type Pol ζ when replicating undamaged DNA. L979F Pol ζ shares with wild-type Pol ζ the ability to perform moderately processive DNA synthesis. When copying undamaged DNA, L979F Pol ζ is error-prone compared to wild-type Pol ζ, providing a biochemical rationale for the observed mutator phenotype of rev3-L979F yeast strains. Errors generated by L979F Pol ζ in vitro include single-base insertions, deletions and substitutions, with the highest error rates involving stable misincorporation of dAMP and dGMP. L979F Pol ζ also generates multiple errors in close proximity to each other. The frequency of these events far exceeds that expected for independent single changes, indicating that the first error increases the probability of additional errors within 10 nucleotides. Thus L979F Pol ζ, and perhaps wild-type Pol ζ, which also generates clustered mutations at a lower but significant rate, performs short patches of processive, error-prone DNA synthesis. This may explain the origin of some multiple clustered mutations observed in vivo.
Alternative splicing (AS) in protein-coding sequences has emerged as an important mechanism of regulation and diversification of animal gene function. By contrast, the extent and roles of alternative events including AS and alternative transcription initiation (ATI) within the 5'-untranslated regions (5'UTRs) of mammalian genes are not well characterized.
We evaluated the abundance, conservation and evolution of putative regulatory control elements, namely, upstream start codons (uAUGs) and open reading frames (uORFs), in the 5'UTRs of human and mouse genes impacted by alternative events. For genes with alternative 5'UTRs, the fraction of alternative sequences (those present in a subset of the transcripts) is much greater than that in the corresponding coding sequence, conceivably, because 5'UTRs are not bound by constraints on protein structure that limit AS in coding regions. Alternative regions of mammalian 5'UTRs evolve faster and are subject to a weaker purifying selection than constitutive portions. This relatively weak selection results in over-abundance of uAUGs and uORFs in the alternative regions of 5'UTRs compared to constitutive regions. Nevertheless, even in alternative regions, uORFs evolve under a stronger selection than the rest of the sequences, indicating that some of the uORFs are conserved regulatory elements; some of the non-conserved uORFs could be involved in species-specific regulation.
The findings on the evolution and selection in alternative and constitutive regions presented here are consistent with the hypothesis that alternative events, namely, AS and ATI, in 5'UTRs of mammalian genes are likely to contribute to the regulation of translation.
Evolution of DNA polymerases, the key enzymes of DNA replication and repair, is central to any reconstruction of the history of cellular life. However, the details of the evolutionary relationships between DNA polymerases of archaea and eukaryotes remain unresolved.
We performed a comparative analysis of archaeal, eukaryotic, and bacterial B-family DNA polymerases, which are the main replicative polymerases in archaea and eukaryotes, combined with an analysis of domain architectures. Surprisingly, we found that eukaryotic Polymerase ε consists of two tandem exonuclease-polymerase modules, the active N-terminal module and a C-terminal module in which both enzymatic domains are inactivated. The two modules are only distantly related to each other, an observation that suggests the possibility that Pol ε evolved as a result of insertion and subsequent inactivation of a distinct polymerase, possibly, of bacterial descent, upstream of the C-terminal Zn-fingers, rather than by tandem duplication. The presence of an inactivated exonuclease-polymerase module in Pol ε parallels a similar inactivation of both enzymatic domains in a distinct family of archaeal B-family polymerases. The results of phylogenetic analysis indicate that eukaryotic B-family polymerases, most likely, originate from two distantly related archaeal B-family polymerases, one form giving rise to Pol ε, and the other one to the common ancestor of Pol α, Pol δ, and Pol ζ. The C-terminal Zn-fingers that are present in all eukaryotic B-family polymerases, unexpectedly, are homologous to the Zn-finger of archaeal D-family DNA polymerases that are otherwise unrelated to the B family. The Zn-finger of Polε shows a markedly greater similarity to the counterpart in archaeal PolD than the Zn-fingers of other eukaryotic B-family polymerases.
Evolution of eukaryotic DNA polymerases seems to have involved previously unnoticed complex events. We hypothesize that the archaeal ancestor of eukaryotes encoded three DNA polymerases, namely, two distinct B-family polymerases and a D-family polymerase all of which contributed to the evolution of the eukaryotic replication machinery. The Zn-finger might have been acquired from PolD by the B-family form that gave rise to Pol ε prior to or in the course of eukaryogenesis, and subsequently, was captured by the ancestor of the other B-family eukaryotic polymerases. The inactivated polymerase-exonuclease module of Pol ε might have evolved by fusion with a distinct polymerase, rather than by duplication of the active module of Pol ε, and is likely to play an important role in the assembly of eukaryotic replication and repair complexes.
This article was reviewed by Patrick Forterre, Arcady Mushegian, and Chris Ponting. For the full reviews, please go to the Reviewers' Reports section.
Evolution of protein sequences is largely governed by purifying selection, with a small fraction of proteins evolving under positive selection. The evolution at synonymous positions in protein-coding genes is not nearly as well understood, with the extent and types of selection remaining, largely, unclear. A statistical test to identify purifying and positive selection at synonymous sites in protein-coding genes was developed. The method compares the rate of evolution at synonymous sites (Ks) to that in intron sequences of the same gene after sampling the aligned intron sequences to mimic the statistical properties of coding sequences. We detected purifying selection at synonymous sites in ∼28% of the 1,562 analyzed orthologous genes from mouse and rat, and positive selection in ∼12% of the genes. Thus, the fraction of genes with readily detectable positive selection at synonymous sites is much greater than the fraction of genes with comparable positive selection at nonsynonymous sites, i.e., at the level of the protein sequence. Unlike other genes, the genes with positive selection at synonymous sites showed no correlation between Ks and the rate of evolution in nonsynonymous sites (Ka), indicating that evolution of synonymous sites under positive selection is decoupled from protein evolution. The genes with purifying selection at synonymous sites showed significant anticorrelation between Ks and expression level and breadth, indicating that highly expressed genes evolve slowly. The genes with positive selection at synonymous sites showed the opposite trend, i.e., highly expressed genes had, on average, higher Ks. For the genes with positive selection at synonymous sites, a significantly lower mRNA stability is predicted compared to the genes with negative selection. Thus, mRNA destabilization could be an important factor driving positive selection in nonsynonymous sites, probably, through regulation of expression at the level of mRNA degradation and, possibly, also translation rate. So, unexpectedly, we found that positive selection at synonymous sites of mammalian genes is substantially more common than positive selection at the level of protein sequences. Positive selection at synonymous sites might act through mRNA destabilization affecting mRNA levels and translation.
synonymous sites; nonsynonymous sites; positive selection; purifying selection; introns
The eukaryotic DNA polymerase δ (Pol δ) participates in genome replication, homologous recombination, DNA repair and damage tolerance. Regulation of the plethora of Pol δ functions depends on the interaction between the second (p50) and third (p66) non-catalytic subunits. We report the crystal structure of p50•p66N complex featuring oligonucleotide binding and phosphodiesterase domains in p50 and winged helix-turn-helix N-terminal domain in p66. Disruption of the interaction between the yeast orthologs of p50 and p66 by strategic amino acid changes leads to cold-sensitivity, sensitivity to hydroxyurea and to reduced UV mutagenesis, mimicking the phenotypes of strains where the third subunit of Pol δ is absent. The second subunits of all B family replicative DNA polymerases in archaea and eukaryotes, except Pol δ, share a three-domain structure similar to p50•p66N, raising the possibility that a portion of the gene encoding p66 was derived from the second subunit gene relatively late in evolution.
DNA polymerase δ; Pol δ; p50; p66; Pol31; Pol32; OB; Myb; phosphodiesterase; human; yeast
A widespread and highly conserved family of apparently inactivated derivatives of archaeal B-family DNA polymerases is described. Phylogenetic analysis shows that the inactivated forms comprise a distinct clade among archaeal B-family polymerases and that, within this clade, Euryarchaea and Crenarchaea are clearly separated from each other and from a small group of bacterial homologs. These findings are compatible with an ancient duplication of the DNA polymerase gene followed by inactivation and parallel loss in some of the lineages although contribution of horizontal gene transfer cannot be ruled out. The inactivated derivative of the archaeal DNA polymerase could form a complex with the active paralog and play a structural role in DNA replication.
This article was reviewed by Purificacion Lopez-Garcia and Chris Ponting. For the full reviews, please go to the Reviewers' Reports section.
The GT dinucleotide in the first two intron positions is the most conserved element of the U2 donor splice signals. However, in a small fraction of donor sites, GT is replaced by GC. A substantial enrichment of GC in donor sites of alternatively spliced genes has been observed previously in human, nematode and Arabidopsis, suggesting that GC signals are important for regulation of alternative splicing. We used parsimony analysis to reconstruct evolution of donor splice sites and inferred 298 GT > GC conversion events compared to 40 GC > GT conversion events in primate and rodent genomes. Thus, there was substantive accumulation of GC donor splice sites during the evolution of mammals. Accumulation of GC sites might have been driven by selection for alternative splicing.
This article was reviewed by Jerzy Jurka and Anton Nekrutenko. For the full reviews, please go to the Reviewers' Reports section.