Ongoing debates about functional importance of gene duplications have been recently intensified by a heated discussion of the “ortholog conjecture” (OC). Under the OC, which is central to functional annotation of genomes, orthologous genes are functionally more similar than paralogous genes at the same level of sequence divergence. However, a recent study challenged the OC by reporting a greater functional similarity, in terms of gene ontology (GO) annotations and expression profiles, among within-species paralogs compared to orthologs. These findings were taken to indicate that functional similarity of homologous genes is primarily determined by the cellular context of the genes, rather than evolutionary history. Subsequent studies suggested that the OC appears to be generally valid when applied to mammalian evolution but the complete picture of evolution of gene expression also has to incorporate lineage-specific aspects of paralogy. The observed complexity of gene expression evolution after duplication can be explained through selection for gene dosage effect combined with the duplication-degeneration-complementation model. This paper discusses expression divergence of recent duplications occurring before functional divergence of proteins encoded by duplicate genes.
More than half a century from postulated Warburg theory of cancer cells origin, a question of changed metabolism in cancer is again taking the central place. Generalized picture of cancer metabolism was replaced by analysis of signaling and oncogenes in each type of cancer for several decades. However, now empowered with wealth of knowledge about tumor suppressors, oncogenes, and signaling pathways, reprogramming of cellular metabolism (e.g., increased glycolysis to respiration ratio in cancer cells) reemerged as an important element of cancer progression. To analyze level of expression of various proteins including metabolic enzymes across various cancers we used dbEST and Unigene data. We delineated a list of genes that are overexpressed in different types of cancer. We also grouped overexpressed enzymes into KEGG pathways and analyzed adjacent pathways to describe enzymatic reactions that take place in cancer cells and to identify major players that are abundant in cancer protein machinery. Glycolysis/gluconeogenesis and oxidative phosphorylation are the most abundant pathways although several other pathways are enriched in genes from our list. Ubiquitously overexpressed genes could be marked as nonspecific cancer-associated genes when analyzing genes that are overexpressed in certain types of cancer. Thus the list of overexpressed genes may be a useful tool for cancer research.
Mitochondria are ubiquitous membranous organelles of eukaryotic cells that evolved from an alpha-proteobacterial endosymbiont and possess a small genome that encompasses from 3 to 106 genes. Accumulation of thousands of mitochondrial genomes from diverse groups of eukaryotes provides an opportunity for a comprehensive reconstruction of the evolution of the mitochondrial gene repertoire.
Clusters of orthologous mitochondrial protein-coding genes (MitoCOGs) were constructed from all available mitochondrial genomes and complemented with nuclear orthologs of mitochondrial genes. With minimal exceptions, the mitochondrial gene complements of eukaryotes are subsets of the superset of 66 genes found in jakobids. Reconstruction of the evolution of mitochondrial genomes indicates that the mitochondrial gene set of the last common ancestor of the extant eukaryotes was slightly larger than that of jakobids. This superset of mitochondrial genes likely represents an intermediate stage following the loss and transfer to the nucleus of most of the endosymbiont genes early in eukaryote evolution. Subsequent evolution in different lineages involved largely parallel transfer of ancestral endosymbiont genes to the nuclear genome. The intron density in nuclear orthologs of mitochondrial genes typically is nearly the same as in the rest of the genes in the respective genomes. However, in land plants, the intron density in nuclear orthologs of mitochondrial genes is almost 1.5-fold lower than the genomic mean, suggestive of ongoing transfer of functional genes from mitochondria to the nucleus.
The MitoCOGs are expected to become an important resource for the study of mitochondrial evolution. The nearly complete superset of mitochondrial genes in jakobids likely represents an intermediate stage in the evolution of eukaryotes after the initial, extensive loss and transfer of the endosymbiont genes. In addition, the bacterial multi-subunit RNA polymerase that is encoded in the jakobid mitochondrial genomes was replaced by a single-subunit phage-type RNA polymerase in the rest of the eukaryotes. These results are best compatible with the rooting of the eukaryotic tree between jakobids and the rest of the eukaryotes. The land plants are the only eukaryotic branch in which the gene transfer from the mitochondrial to the nuclear genome appears to be an active, ongoing process.
Electronic supplementary material
The online version of this article (doi:10.1186/s12862-014-0237-5) contains supplementary material, which is available to authorized users.
Mitochondria; Genome evolution; Gene loss; Gene transfer; Introns; Clusters of orthologous genes
A dramatic increase in the prevalence of autism and Autistic Spectrum Disorders (ASD) has been observed over the last two decades in USA, Europe and Asia. Given the accumulating data on the possible role of translation in the etiology of ASD, we analyzed potential effects of rare synonymous substitutions associated with ASD on mRNA stability, splicing enhancers and silencers, and codon usage.
Presentation of the hypothesis
We hypothesize that subtle impairment of translation, resulting in dosage imbalance of neuron-specific proteins, contributes to the etiology of ASD synergistically with environmental neurotoxins.
Testing the hypothesis
A statistically significant shift from optimal to suboptimal codons caused by rare synonymous substitutions associated with ASD was detected whereas no effect on other analyzed characteristics of transcripts was identified. This result suggests that the impact of rare codons on the translation of genes involved in neuron development, even if slight in magnitude, could contribute to the pathogenesis of ASD in the presence of an aggressive chemical background. This hypothesis could be tested by further analysis of ASD-associated mutations, direct biochemical characterization of their effects, and assessment of in vivo effects on animal models.
Implications of the hypothesis
It seems likely that the synergistic action of environmental hazards with genetic variations that in themselves have limited or no deleterious effects but are potentiated by the environmental factors is a general principle that underlies the alarming increase in the ASD prevalence.
This article was reviewed by Andrey Rzhetsky, Neil R. Smalheiser, and Shamil R. Sunyaev.
Synonymous mutations; Single nucleotide polymorphism; Codon usage; Splicing enhancer; Splicing silencer; mRNA secondary structure; Transcription factor binding; Neurotoxin
A substantial fraction of eukaryotic proteins contains multiple domains, some of which show a tendency to occur in diverse domain architectures and can be considered mobile (or ‘promiscuous’). These promiscuous domains are typically involved in protein–protein interactions and play crucial roles in interaction networks, particularly those contributing to signal transduction. They also play a major role in creating diversity of protein domain architecture in the proteome. It is now apparent that promiscuity is a volatile and relatively fast-changing feature in evolution, and that only a few domains retain their promiscuity status throughout evolution. Many such domains attained their promiscuity status independently in different lineages. Only recently, we have begun to understand the diversity of protein domain architectures and the role the promiscuous domains play in evolution of this diversity. However, many of the biological mechanisms of protein domain mobility remain shrouded in mystery. In this review, we discuss our present understanding of protein domain promiscuity, its evolution and its role in cellular function.
mobile domain; promiscuous domain; domain network; domain architecture; domain evolution
The rate of mutations in eukaryotes depends on a plethora of factors and is not immediately derived from the fidelity of DNA polymerases (Pols). Replication of chromosomes containing the anti-parallel strands of duplex DNA occurs through the copying of leading and lagging strand templates by a trio of Pols α, δ and ε, with the assistance of Pol ζ and Y-family Pols at difficult DNA template structures or sites of DNA damage. The parameters of the synthesis at a given location are dictated by the quality and quantity of nucleotides in the pools, replication fork architecture, transcription status, regulation of Pol switches, and structure of chromatin. The result of these transactions is a subject of survey and editing by DNA repair.
DNA polymerases; nucleotide pools; mutagenesis; Okazaki fragments
Aberrant activation of receptor tyrosine kinases (RTKs) is a common feature of many cancer cells. It was previously suggested that the mechanisms of kinase activation in cancer might be linked to transitions between active and inactive states. Here we estimate the effects of single and double cancer mutations on the stability of active and inactive states of the kinase domains from different RTKs. We show that singleton cancer mutations destabilize active and inactive states, however inactive states are destabilized more than the active ones leading to kinase activation. We show that there exists a relationship between the estimate of oncogenic potential of cancer mutation and kinase activation. Namely, more frequent mutations have a higher activating effect, which might allow us to predict the activating effect of the mutations from the mutation spectra. Independent evolutionary analysis of mutation spectra complements this observation and finds the same frequency threshold defining mutation hot spots. We analyze double mutations and report a positive epistasis and additional advantage of doublets with respect to cancer cell fitness. The activation mechanisms of double mutations differ from those of single mutations and double mutation spectrum is found to be dissimilar to the mutation spectrum of singletons.
cancer mutation; receptor tyrosine kinase; protein structure; kinase activation; mutation spectra; double mutations
Genetic information should be accurately transmitted from cell to cell; conversely, the adaptation in evolution and disease is fueled by mutations. In the case of cancer development, multiple genetic changes happen in somatic diploid cells. Most classic studies of the molecular mechanisms of mutagenesis have been performed in haploids. We demonstrate that the parameters of the mutation process are different in diploid cell populations. The genomes of drug-resistant mutants induced in yeast diploids by base analog 6-hydroxylaminopurine (HAP) or AID/APOBEC cytosine deaminase PmCDA1 from lamprey carried a stunning load of thousands of unselected mutations. Haploid mutants contained almost an order of magnitude fewer mutations. To explain this, we propose that the distribution of induced mutation rates in the cell population is uneven. The mutants in diploids with coincidental mutations in the two copies of the reporter gene arise from a fraction of cells that are transiently hypersensitive to the mutagenic action of a given mutagen. The progeny of such cells were never recovered in haploids due to the lethality caused by the inactivation of single-copy essential genes in cells with too many induced mutations. In diploid cells, the progeny of hypersensitive cells survived, but their genomes were saturated by heterozygous mutations. The reason for the hypermutability of cells could be transient faults of the mutation prevention pathways, like sanitization of nucleotide pools for HAP or an elevated expression of the PmCDA1 gene or the temporary inability of the destruction of the deaminase. The hypothesis on spikes of mutability may explain the sudden acquisition of multiple mutational changes during evolution and carcinogenesis.
Evolution and carcinogenesis are driven by mutations. Cells maintain constant mutation rates and can afford only transient mutagenesis bursts for adaptation. The nature of the mutational avalanches is not very clear. We sequenced the whole genomes of mutants induced in haploid and diploid yeast by nucleobase analog HAP and by DNA editing cytosine deaminase. Mutants selected in diploids are saturated with passenger mutations. Far fewer mutations are found in haploid mutants. Treatment with a mutagen without selection results in intermediate mutagenesis. The observed transient hypermutability of diploids under mutagenic insult helps to explain the wellspring of mutations that arise during evolution and carcinogenesis.
Kindlin-3 is a novel integrin activator in hematopoietic cells and its deficiency leads to immune problems and severe bleeding, known as LAD-III. Our current understanding of Kindlin-3 function primarily relies on analysis of animal models or cell lines.
To understand the functions of Kindlin-3 in human primary blood cells.
Here we analyze primary and immortalized hematopoietic cells obtained from a new LAD-III patient with immune problems, bleeding, a history of anemia and abnormally shaped red blood cells.
Patient’s WBC and platelets showed defect in agonist induced integrin activation and botrocetin induced platelet agglutination. Primary leukocytes from this patient exhibited abnormal activation of beta1 integrin. Integrin activation defects were responsible for observed deficiency of botrocetin induced platelet response. Analysis of patient’s genomic DNA revealed a novel mutation in kindlin-3 gene. The mutation abolished Kindlin-3 expression in primary WBC and platelets due to abnormal splicing. Kindlin-3 is expressed in erythrocytes and its deficiency proposed to lead to abnormal shape of RBC. Immortalized patient’s WBCs expressed a truncated form of Kindlin-3 which was not sufficient to support integrin activation. Expression of Kindlin-3 cDNA in immortalized patient’s WBCs rescued integrin activation defects while overexpression of the truncated form did not.
Kindlin-3 deficiency impairs integrin function, including activation of beta 1 integrin.
Abnormalities in GPIb-IX function in kindlin-3 deficient platelets are secondary to integrin defects.
Region of Kindlin-3 encoded by Exon 11 is crucial for its ability to activate integrins in humans.
Integrins; Kindlins; Leukocyte Adhesion Deficiency; Platelets; Red Blood Cells; White Blood Cells
We compare the sets of experimentally validated long intergenic non-coding (linc)RNAs from human and mouse and apply a maximum likelihood approach to estimate the total number of lincRNA genes as well as the size of the conserved part of the lincRNome. Under the assumption that the sets of experimentally validated lincRNAs are random samples of the lincRNomes of the corresponding species, we estimate the total lincRNome size at approximately 40,000 to 50,000 species, at least twice the number of protein-coding genes. We further estimate that the fraction of the human and mouse euchromatic genomes encoding lincRNAs is more than twofold greater than the fraction of protein-coding sequences. Although the sequences of most lincRNAs are much less strongly conserved than protein sequences, the extent of orthology between the lincRNomes is unexpectedly high, with 60 to 70% of the lincRNA genes shared between human and mouse. The orthologous mammalian lincRNAs can be predicted to perform equivalent functions; accordingly, it appears likely that thousands of evolutionarily conserved functional roles of lincRNAs remain to be characterized.
Genome analysis of humans and other mammals reveals a surprisingly small number of protein-coding genes, only slightly over 20,000 (although the diversity of actual proteins is substantially augmented by alternative transcription and alternative splicing). Recent analysis of the mammalian genomes and transcriptomes, in particular, using the RNAseq technology, shows that, in addition to protein-coding genes, mammalian genomes encode many long non-coding RNAs. For some of these transcripts, various regulatory functions have been demonstrated, but on the whole the repertoire of long non-coding RNAs remains poorly characterized. We compared the identified long intergenic non-coding (linc)RNAs from human and mouse, and employed a specially developed statistical technique to estimate the size and evolutionary conservation of the human and mouse lincRNomes. The estimates show that there are at least twice as many human and mouse lincRNAs than there are protein-coding genes. Moreover, about two third of the lincRNA genes appear to be conserved between human and mouse, implying thousands of conserved but still uncharacterized functions.
We describe the draft genome of the microcrustacean Daphnia pulex, which is only 200 Mb and contains at least 30,907 genes. The high gene count is a consequence of an elevated rate of gene duplication resulting in tandem gene clusters. More than 1/3 of Daphnia’s genes have no detectable homologs in any other available proteome, and the most amplified gene families are specific to the Daphnia lineage. The co-expansion of gene families interacting within metabolic pathways suggests that the maintenance of duplicated genes is not random, and the analysis of gene expression under different environmental conditions reveals that numerous paralogs acquire divergent expression patterns soon after duplication. Daphnia-specific genes – including many additional loci within sequenced regions that are otherwise devoid of annotations – are the most responsive genes to ecological challenges.
Clusters of localized hypermutation in human breast cancer genomes, named “kataegis” (from the Greek for thunderstorm), are hypothesized to result from multiple cytosine deaminations catalyzed by AID/APOBEC proteins. However, a direct link between APOBECs and kataegis is still lacking. We have sequenced the genomes of yeast mutants induced in diploids by expression of the gene for PmCDA1, a hypermutagenic deaminase from sea lamprey. Analysis of the distribution of 5,138 induced mutations revealed localized clusters very similar to those found in tumors. Our data provide evidence that unleashed cytosine deaminase activity is an evolutionary conserved, prominent source of genome-wide kataegis events.
This article was reviewed by: Professor Sandor Pongor, Professor Shamil R. Sunyaev, and Dr Vladimir Kuznetsov.
APOBEC; Deaminase; Mutation; Kataegis; Cancer; Diploid yeast; Hypermutation
In order to maintain visual sensitivity at all light levels, the vertebrate eye possesses a mechanism to regenerate the visual pigment chromophore 11-cis retinal in the dark enzymatically, unlike in all other taxa, which rely on photoisomerization. This mechanism is termed the visual cycle and is localized to the retinal pigment epithelium (RPE), a support layer of the neural retina. Speculation has long revolved around whether more primitive chordates, such as tunicates and cephalochordates, anticipated this feature. The two key enzymes of the visual cycle are RPE65, the visual cycle all-trans retinyl ester isomerohydrolase, and lecithin:retinol acyltransferase (LRAT), which generates RPE65’s substrate. We hypothesized that the origin of the vertebrate visual cycle is directly connected to an ancestral carotenoid oxygenase acquiring a new retinyl ester isomerohydrolase function. Our phylogenetic analyses of the RPE65/BCMO and N1pC/P60 (LRAT) superfamilies show that neither RPE65 nor LRAT orthologs occur in tunicates (Ciona) or cephalochordates (Branchiostoma), but occur in Petromyzon marinus (Sea Lamprey), a jawless vertebrate. The closest homologs to RPE65 in Ciona and Branchiostoma lacked predicted functionally diverged residues found in all authentic RPE65s, but lamprey RPE65 contained all of them. We cloned RPE65 and LRATb cDNAs from lamprey RPE and demonstrated appropriate enzymatic activities. We show that Ciona ß-carotene monooxygenase a (BCMOa) (previously annotated as an RPE65) has carotenoid oxygenase cleavage activity but not RPE65 activity. We verified the presence of RPE65 in lamprey RPE by immunofluorescence microscopy, immunoblot and mass spectrometry. On the basis of these data we conclude that the crucial transition from the typical carotenoid double bond cleavage functionality (BCMO) to the isomerohydrolase functionality (RPE65), coupled with the origin of LRAT, occurred subsequent to divergence of the more primitive chordates (tunicates, etc.) in the last common ancestor of the jawless and jawed vertebrates.
Among thousands of long non-coding RNAs (lncRNAs) only a small subset is functionally characterized and the functional annotation of lncRNAs on the genomic scale remains inadequate. In this study we computationally characterized two functionally different parts of human lncRNAs transcriptome based on their ability to bind the polycomb repressive complex, PRC2. This classification is enabled by the fact that while all lncRNAs constitute a diverse set of sequences, the classes of PRC2-binding and PRC2 non-binding lncRNAs possess characteristic combinations of sequence-structure patterns and, therefore, can be separated within the feature space. Based on the specific combination of features, we built several machine-learning classifiers and identified the SVM-based classifier as the best performing. We further showed that the SVM-based classifier is able to generalize on the independent data sets. We observed that this classifier, trained on the human lncRNAs, can predict up to 59.4% of PRC2-binding lncRNAs in mice. This suggests that, despite the low degree of sequence conservation, many lncRNAs play functionally conserved biological roles.
Spliceosomal introns are one of the principal distinctive features of eukaryotes. Nevertheless, different large-scale studies disagree about even the most basic features of their evolution. In order to come up with a more reliable reconstruction of intron evolution, we developed a model that is far more comprehensive than previous ones. This model is rich in parameters, and estimating them accurately is infeasible by straightforward likelihood maximization. Thus, we have developed an expectation-maximization algorithm that allows for efficient maximization. Here, we outline the model and describe the expectation-maximization algorithm in detail. Since the method works with intron presence–absence maps, it is expected to be instrumental for the analysis of the evolution of other binary characters as well.
Maximum likelihood; expectation-maximization; intron evolution; ancestral reconstruction; eukaryotic gene structure
It was proposed that if some mRNA characteristics resulted in a low efficiency of termination signal, an additional closely located stop codon (tandem stop codons) could be used to prevent the harmful readthrough. However, the role of tandem terminators in higher eukaryotes was not verified and remains hypothetical. In this work the sequence features of Arabidopsis thaliana and Oryza sativa mRNAs were analyzed. It was found that plant mRNAs with UGA terminator were characterized by a higher frequency of nonsense codons in the first triplet position of 3′-UTR that could result from a weak natural selection for “reserve” stop signal. Interestingly, the presence of tandem stop codons positively correlated with a specific amino acid composition in the C-terminal position of the encoded proteins. In particular, C-terminal glycine positively correlated with significantly higher frequencies of reserve terminators at the beginning positions of 3′-UTR in UGA-containing mRNAs. This finding coincides with some earlier observations concerning the role of glycine and its codons in inefficient termination of translation and recoding (e.g., 2A oligopeptide).
mRNA; Arabidopsis thaliana; Oryza sativa; stop codon; tandem terminators; readthrough
The two types of eukaryotic spliceosomal introns, U2 and U12, possess different splice signals and are excised by distinct spliceosomes. The nature of the primordial introns remains uncertain. A comparison of the amino acid distributions at insertion sites of introns that retained their positions throughout eukaryotic evolution with the distributions for human and Arabidopsis thaliana U2 and U12 introns reveals close similarity with U2 but not U12. Thus, the primordial spliceosomal introns were, most likely, U2-type.
Evolution of exon-intron structure of eukaryotic genes has been a matter of long-standing, intensive debate. The introns-early concept, later rebranded ‘introns first’ held that protein-coding genes were interrupted by numerous introns even at the earliest stages of life's evolution and that introns played a major role in the origin of proteins by facilitating recombination of sequences coding for small protein/peptide modules. The introns-late concept held that introns emerged only in eukaryotes and new introns have been accumulating continuously throughout eukaryotic evolution. Analysis of orthologous genes from completely sequenced eukaryotic genomes revealed numerous shared intron positions in orthologous genes from animals and plants and even between animals, plants and protists, suggesting that many ancestral introns have persisted since the last eukaryotic common ancestor (LECA). Reconstructions of intron gain and loss using the growing collection of genomes of diverse eukaryotes and increasingly advanced probabilistic models convincingly show that the LECA and the ancestors of each eukaryotic supergroup had intron-rich genes, with intron densities comparable to those in the most intron-rich modern genomes such as those of vertebrates. The subsequent evolution in most lineages of eukaryotes involved primarily loss of introns, with only a few episodes of substantial intron gain that might have accompanied major evolutionary innovations such as the origin of metazoa. The original invasion of self-splicing Group II introns, presumably originating from the mitochondrial endosymbiont, into the genome of the emerging eukaryote might have been a key factor of eukaryogenesis that in particular triggered the origin of endomembranes and the nucleus. Conversely, splicing errors gave rise to alternative splicing, a major contribution to the biological complexity of multicellular eukaryotes. There is no indication that any prokaryote has ever possessed a spliceosome or introns in protein-coding genes, other than relatively rare mobile self-splicing introns. Thus, the introns-first scenario is not supported by any evidence but exon-intron structure of protein-coding genes appears to have evolved concomitantly with the eukaryotic cell, and introns were a major factor of evolution throughout the history of eukaryotes. This article was reviewed by I. King Jordan, Manuel Irimia (nominated by Anthony Poole), Tobias Mourier (nominated by Anthony Poole), and Fyodor Kondrashov. For the complete reports, see the Reviewers’ Reports section.
Intron sliding; Intron gain; Intron loss; Spliceosome; Splicing signals; Evolution of exon/intron structure; Alternative splicing; Phylogenetic trees; Mobile domains; Eukaryotic ancestor
Mammalian genomes contain numerous genes for long noncoding RNAs (lncRNAs). The functions of the lncRNAs remain largely unknown but their evolution appears to be constrained by purifying selection, albeit relatively weakly. To gain insights into the mode of evolution and the functional range of the lncRNA, they can be compared with much better characterized protein-coding genes. The evolutionary rate of the protein-coding genes shows a universal negative correlation with expression: highly expressed genes are on average more conserved during evolution than the genes with lower expression levels. This correlation was conceptualized in the misfolding-driven protein evolution hypothesis according to which misfolding is the principal cost incurred by protein expression. We sought to determine whether long intergenic ncRNAs (lincRNAs) follow the same evolutionary trend and indeed detected a moderate but statistically significant negative correlation between the evolutionary rate and expression level of human and mouse lincRNA genes. The magnitude of the correlation for the lincRNAs is similar to that for equal-sized sets of protein-coding genes with similar levels of sequence conservation. Additionally, the expression level of the lincRNAs is significantly and positively correlated with the predicted extent of lincRNA molecule folding (base-pairing), however, the contributions of evolutionary rates and folding to the expression level are independent. Thus, the anticorrelation between evolutionary rate and expression level appears to be a general feature of gene evolution that might be caused by similar deleterious effects of protein and RNA misfolding and/or other factors, for example, the number of interacting partners of the gene product.
long noncoding RNA; ncRNA; RNA expression; genomic alignments; introns; RNA folding
Protein-coding genes in eukaryotes are interrupted by introns, but intron densities widely differ between eukaryotic lineages. Vertebrates, some invertebrates and green plants have intron-rich genes, with 6–7 introns per kilobase of coding sequence, whereas most of the other eukaryotes have intron-poor genes. We reconstructed the history of intron gain and loss using a probabilistic Markov model (Markov Chain Monte Carlo, MCMC) on 245 orthologous genes from 99 genomes representing the three of the five supergroups of eukaryotes for which multiple genome sequences are available. Intron-rich ancestors are confidently reconstructed for each major group, with 53 to 74% of the human intron density inferred with 95% confidence for the Last Eukaryotic Common Ancestor (LECA). The results of the MCMC reconstruction are compared with the reconstructions obtained using Maximum Likelihood (ML) and Dollo parsimony methods. An excellent agreement between the MCMC and ML inferences is demonstrated whereas Dollo parsimony introduces a noticeable bias in the estimations, typically yielding lower ancestral intron densities than MCMC and ML. Evolution of eukaryotic genes was dominated by intron loss, with substantial gain only at the bases of several major branches including plants and animals. The highest intron density, 120 to 130% of the human value, is inferred for the last common ancestor of animals. The reconstruction shows that the entire line of descent from LECA to mammals was intron-rich, a state conducive to the evolution of alternative splicing.
In eukaryotes, protein-coding genes are interrupted by non-coding introns. The intron densities widely differ, from 6–7 introns per kilobase of coding sequence in vertebrates, some invertebrates and plants, to only a few introns across the entire genome in many unicellular forms. We applied a robust statistical methodology, Markov Chain Monte Carlo, to reconstruct the history of intron gain and loss throughout the evolution of eukaryotes using a set of 245 homologous genes from 99 genomes that represent the diversity of eukaryotes. Intron-rich ancestors were confidently inferred for each major eukaryotic group including 53% to 74% of the human intron density for the last eukaryotic common ancestor, and 120% to 130% of the human value for the last common ancestor of animals. Evolution of eukaryotic genes involved primarily intron loss, with substantial gain only at the bases of several major branches including plants and animals. Thus, the common ancestor of all extant eukaryotes was a complex organism with a gene architecture resembling those in multicellular organisms. The line of descent from the last common ancestor to mammals was an uninterrupted intron-rich state that, given the error-prone splicing in intron-rich organisms, was conducive to the elaboration of functional alternative splicing.
Editing deaminases have a pivotal role in cellular physiology. A notable member of this superfamily, APOBEC3G (A3G), restricts retroviruses, and Activation Induced Deaminase (AID) generates antibody diversity by localized deamination of cytosines in DNA. Unconstrained deaminase activity can cause genome-wide mutagenesis and cancer. The mechanisms that protect the genomic DNA from the undesired action of deaminases are unknown. Using the in vitro deamination assays and expression of A3G in yeast, we show that replication protein A (RPA), the eukaryotic single-stranded DNA (ssDNA) binding protein, severely inhibits the deamination activity and processivity of A3G.
We found that mutations induced by A3G in the yeast genomic reporter are changes of a single nucleotide. This is unexpected because of the known property of A3G to catalyze multiple deaminations upon one substrate encounter event in vitro. The addition of recombinant RPA to the oligonucleotide deamination assay severely inhibited A3G activity. Additionally, we reveal the inverse correlation between RPA concentration and the number of deaminations induced by A3G in vitro on long ssDNA regions. This resembles the “hit and run” single base substitution events observed in yeast.
Our data suggest that RPA is a plausible antimutator factor limiting the activity and processivity of editing deaminases in the model yeast system. Because of the similar antagonism of yeast RPA and human RPA with A3G in vitro, we propose that RPA plays a role in the protection of the human genome cell from A3G and other deaminases when they are inadvertently diverged from their natural targets. We propose a model where RPA serves as one of the guardians of the genome that protects ssDNA from the destructive processive activity of deaminases by non-specific steric hindrance.
The deaminase-like fold includes, in addition to nucleic acid/nucleotide deaminases, several catalytic domains such as the JAB domain, and others involved in nucleotide and ADP-ribose metabolism. Using sensitive sequence and structural comparison methods, we develop a comprehensive natural classification of the deaminase-like fold and show that its ancestral version was likely to operate on nucleotides or nucleic acids. Consequently, we present evidence that a specific group of JAB domains are likely to possess a DNA repair function, distinct from the previously known deubiquitinating peptidase activity. We also identified numerous previously unknown clades of nucleic acid deaminases. Using inference based on contextual information, we suggest that most of these clades are toxin domains of two distinct classes of bacterial toxin systems, namely polymorphic toxins implicated in bacterial interstrain competition and those that target distantly related cells. Genome context information suggests that these toxins might be delivered via diverse secretory systems, such as Type V, Type VI, PVC and a novel PrsW-like intramembrane peptidase-dependent mechanism. We propose that certain deaminase toxins might be deployed by diverse extracellular and intracellular pathogens as also endosymbionts as effectors targeting nucleic acids of host cells. Our analysis suggests that these toxin deaminases have been acquired by eukaryotes on several independent occasions and recruited as organellar or nucleo-cytoplasmic RNA modifiers, operating on tRNAs, mRNAs and short non-coding RNAs, and also as mutators of hyper-variable genes, viruses and selfish elements. This scenario potentially explains the origin of mutagenic AID/APOBEC-like deaminases, including novel versions from Caenorhabditis, Nematostella and diverse algae and a large class of fast-evolving fungal deaminases. These observations greatly expand the distribution of possible unidentified mutagenic processes catalyzed by nucleic acid deaminases.
Accurate estimation of the divergence time of the extant eukaryotes is a fundamentally important but extremely difficult problem owing primarily to gross violations of the molecular clock at long evolutionary distances and the lack of appropriate calibration points close to the date of interest. These difficulties are intrinsic to the dating of ancient divergence events and are reflected in the large discrepancies between estimates obtained with different approaches. Estimates of the age of Last Eukaryotic Common Ancestor (LECA) vary approximately twofold, from ~1,100 million years ago (Mya) to ~2,300 Mya.
We applied the genome-wide analysis of rare genomic changes associated with conserved amino acids (RGC_CAs) and used several independent techniques to obtain date estimates for the divergence of the major lineages of eukaryotes with calibration intervals for insects, land plants and vertebrates. The results suggest an early divergence of monocot and dicot plants, approximately 340 Mya, raising the possibility of plant-insect coevolution. The divergence of bilaterian animal phyla is estimated at ~400-700 Mya, a range of dates that is consistent with cladogenesis immediately preceding the Cambrian explosion. The origin of opisthokonts (the supergroup of eukaryotes that includes metazoa and fungi) is estimated at ~700-1,000 Mya, and the age of LECA at ~1,000-1,300 Mya. We separately analyzed the red algal calibration interval which is based on single fossil. This analysis produced time estimates that were systematically older compared to the other estimates. Nevertheless, the majority of the estimates for the age of the LECA using the red algal data fell within the 1,200-1,400 Mya interval.
The inference of a "young LECA" is compatible with the latest of previously estimated dates and has substantial biological implications. If these estimates are valid, the approximately 1 to 1.4 billion years of evolution of eukaryotes that is open to comparative-genomic study probably was preceded by hundreds of millions years of evolution that might have included extinct diversity inaccessible to comparative approaches.
This article was reviewed by William Martin, Herve Philippe (nominated by I. King Jordan), and Romain Derelle.
bilateria; opisthokonts; angiosperms; last eukaryotic common ancestor; molecular dating
Yeast DNA polymerase ε (Pol ε) is a highly accurate and processive enzyme that participates in nuclear DNA replication of the leading strand template. In addition to a large subunit (Pol2) harboring the polymerase and proofreading exonuclease active sites, Pol ε also has one essential subunit (Dpb2) and two smaller, non-essential subunits (Dpb3 and Dpb4) whose functions are not fully understood. To probe the functions of Dpb3 and Dpb4, here we investigate the consequences of their absence on the biochemical properties of Pol ε in vitro and on genome stability in vivo. The fidelity of DNA synthesis in vitro by purified Pol2/Dpb2, i.e. lacking Dpb3 and Dpb4, is comparable to the four-subunit Pol ε holoenzyme. Nonetheless, deletion of DPB3 and DPB4 elevates spontaneous frameshift and base substitution rates in vivo, to the same extent as the loss of Pol ε proofreading activity in a pol2-4 strain. In contrast to pol2-4, however, the dpb3Δdpb4Δ does not lead to a synergistic increase of mutation rates with defects in DNA mismatch repair. The increased mutation rate in dpb3Δdpb4Δ strains is partly dependent on REV3, as well as the proofreading capacity of Pol δ. Finally, biochemical studies demonstrate that the absence of Dpb3 and Dpb4 destabilizes the interaction between Pol ε and the template DNA during processive DNA synthesis and during processive 3′ to 5′exonucleolytic degradation of DNA. Collectively, these data suggest a model wherein Dpb3 and Dpb4 do not directly influence replication fidelity per se, but rather contribute to normal replication fork progression. In their absence, a defective replisome may more frequently leave gaps on the leading strand that are eventually filled by Pol ζ or Pol δ, in a post-replication process that generates errors not corrected by the DNA mismatch repair system.
The high fidelity of DNA replication is safeguarded by the accuracy of nucleotide selection by DNA polymerases, proofreading activity of the replicative polymerases, and the DNA mismatch repair system. Errors made by replicative polymerases are corrected by mismatch repair, and inactivation of the mismatch repair system results in a multiplicative increase in error rates when combined with a proofreading deficient allele of a replicative polymerase. In this study, we demonstrate that the deletion of two non-essential genes encoding for two subunits of Pol ε give an increased mutation rate due to increased synthesis by the error-prone DNA polymerase ζ. Surprisingly, there was no multiplicative increase in error rates when the mismatch repair system was inactivated. We propose that the deletion of DPB3 and DPB4 gives a defective replisome, which in turn gives increased synthesis, in part, by Pol ζ during an error-prone post-replication process that is not efficiently repaired by the mismatch repair system.