Evolutionary binary characters are features of species or genes, indicating the absence (value zero) or presence (value one) of some property. Examples include eukaryotic gene architecture (the presence or absence of an intron in a particular locus), gene content, and morphological characters. In many studies, the acquisition of such binary characters is assumed to represent a rare evolutionary event, and consequently, their evolution is analyzed using various flavors of parsimony. However, when gain and loss of the character are not rare enough, a probabilistic analysis becomes essential. Here, we present a comprehensive probabilistic model to describe the evolution of binary characters on a bifurcating phylogenetic tree. A fast software tool, EREM, is provided, using maximum likelihood to estimate the parameters of the model and to reconstruct ancestral states (presence and absence in internal nodes) and events (gain and loss events along branches).
We compare the sets of experimentally validated long intergenic non-coding (linc)RNAs from human and mouse and apply a maximum likelihood approach to estimate the total number of lincRNA genes as well as the size of the conserved part of the lincRNome. Under the assumption that the sets of experimentally validated lincRNAs are random samples of the lincRNomes of the corresponding species, we estimate the total lincRNome size at approximately 40,000 to 50,000 species, at least twice the number of protein-coding genes. We further estimate that the fraction of the human and mouse euchromatic genomes encoding lincRNAs is more than twofold greater than the fraction of protein-coding sequences. Although the sequences of most lincRNAs are much less strongly conserved than protein sequences, the extent of orthology between the lincRNomes is unexpectedly high, with 60 to 70% of the lincRNA genes shared between human and mouse. The orthologous mammalian lincRNAs can be predicted to perform equivalent functions; accordingly, it appears likely that thousands of evolutionarily conserved functional roles of lincRNAs remain to be characterized.
Genome analysis of humans and other mammals reveals a surprisingly small number of protein-coding genes, only slightly over 20,000 (although the diversity of actual proteins is substantially augmented by alternative transcription and alternative splicing). Recent analysis of the mammalian genomes and transcriptomes, in particular, using the RNAseq technology, shows that, in addition to protein-coding genes, mammalian genomes encode many long non-coding RNAs. For some of these transcripts, various regulatory functions have been demonstrated, but on the whole the repertoire of long non-coding RNAs remains poorly characterized. We compared the identified long intergenic non-coding (linc)RNAs from human and mouse, and employed a specially developed statistical technique to estimate the size and evolutionary conservation of the human and mouse lincRNomes. The estimates show that there are at least twice as many human and mouse lincRNAs than there are protein-coding genes. Moreover, about two third of the lincRNA genes appear to be conserved between human and mouse, implying thousands of conserved but still uncharacterized functions.
We describe the draft genome of the microcrustacean Daphnia pulex, which is only 200 Mb and contains at least 30,907 genes. The high gene count is a consequence of an elevated rate of gene duplication resulting in tandem gene clusters. More than 1/3 of Daphnia’s genes have no detectable homologs in any other available proteome, and the most amplified gene families are specific to the Daphnia lineage. The co-expansion of gene families interacting within metabolic pathways suggests that the maintenance of duplicated genes is not random, and the analysis of gene expression under different environmental conditions reveals that numerous paralogs acquire divergent expression patterns soon after duplication. Daphnia-specific genes – including many additional loci within sequenced regions that are otherwise devoid of annotations – are the most responsive genes to ecological challenges.
Spliceosomal introns are one of the principal distinctive features of eukaryotes. Nevertheless, different large-scale studies disagree about even the most basic features of their evolution. In order to come up with a more reliable reconstruction of intron evolution, we developed a model that is far more comprehensive than previous ones. This model is rich in parameters, and estimating them accurately is infeasible by straightforward likelihood maximization. Thus, we have developed an expectation-maximization algorithm that allows for efficient maximization. Here, we outline the model and describe the expectation-maximization algorithm in detail. Since the method works with intron presence–absence maps, it is expected to be instrumental for the analysis of the evolution of other binary characters as well.
Maximum likelihood; expectation-maximization; intron evolution; ancestral reconstruction; eukaryotic gene structure
The two types of eukaryotic spliceosomal introns, U2 and U12, possess different splice signals and are excised by distinct spliceosomes. The nature of the primordial introns remains uncertain. A comparison of the amino acid distributions at insertion sites of introns that retained their positions throughout eukaryotic evolution with the distributions for human and Arabidopsis thaliana U2 and U12 introns reveals close similarity with U2 but not U12. Thus, the primordial spliceosomal introns were, most likely, U2-type.
Mammalian genomes contain numerous genes for long noncoding RNAs (lncRNAs). The functions of the lncRNAs remain largely unknown but their evolution appears to be constrained by purifying selection, albeit relatively weakly. To gain insights into the mode of evolution and the functional range of the lncRNA, they can be compared with much better characterized protein-coding genes. The evolutionary rate of the protein-coding genes shows a universal negative correlation with expression: highly expressed genes are on average more conserved during evolution than the genes with lower expression levels. This correlation was conceptualized in the misfolding-driven protein evolution hypothesis according to which misfolding is the principal cost incurred by protein expression. We sought to determine whether long intergenic ncRNAs (lincRNAs) follow the same evolutionary trend and indeed detected a moderate but statistically significant negative correlation between the evolutionary rate and expression level of human and mouse lincRNA genes. The magnitude of the correlation for the lincRNAs is similar to that for equal-sized sets of protein-coding genes with similar levels of sequence conservation. Additionally, the expression level of the lincRNAs is significantly and positively correlated with the predicted extent of lincRNA molecule folding (base-pairing), however, the contributions of evolutionary rates and folding to the expression level are independent. Thus, the anticorrelation between evolutionary rate and expression level appears to be a general feature of gene evolution that might be caused by similar deleterious effects of protein and RNA misfolding and/or other factors, for example, the number of interacting partners of the gene product.
long noncoding RNA; ncRNA; RNA expression; genomic alignments; introns; RNA folding
Protein-coding genes in eukaryotes are interrupted by introns, but intron densities widely differ between eukaryotic lineages. Vertebrates, some invertebrates and green plants have intron-rich genes, with 6–7 introns per kilobase of coding sequence, whereas most of the other eukaryotes have intron-poor genes. We reconstructed the history of intron gain and loss using a probabilistic Markov model (Markov Chain Monte Carlo, MCMC) on 245 orthologous genes from 99 genomes representing the three of the five supergroups of eukaryotes for which multiple genome sequences are available. Intron-rich ancestors are confidently reconstructed for each major group, with 53 to 74% of the human intron density inferred with 95% confidence for the Last Eukaryotic Common Ancestor (LECA). The results of the MCMC reconstruction are compared with the reconstructions obtained using Maximum Likelihood (ML) and Dollo parsimony methods. An excellent agreement between the MCMC and ML inferences is demonstrated whereas Dollo parsimony introduces a noticeable bias in the estimations, typically yielding lower ancestral intron densities than MCMC and ML. Evolution of eukaryotic genes was dominated by intron loss, with substantial gain only at the bases of several major branches including plants and animals. The highest intron density, 120 to 130% of the human value, is inferred for the last common ancestor of animals. The reconstruction shows that the entire line of descent from LECA to mammals was intron-rich, a state conducive to the evolution of alternative splicing.
In eukaryotes, protein-coding genes are interrupted by non-coding introns. The intron densities widely differ, from 6–7 introns per kilobase of coding sequence in vertebrates, some invertebrates and plants, to only a few introns across the entire genome in many unicellular forms. We applied a robust statistical methodology, Markov Chain Monte Carlo, to reconstruct the history of intron gain and loss throughout the evolution of eukaryotes using a set of 245 homologous genes from 99 genomes that represent the diversity of eukaryotes. Intron-rich ancestors were confidently inferred for each major eukaryotic group including 53% to 74% of the human intron density for the last eukaryotic common ancestor, and 120% to 130% of the human value for the last common ancestor of animals. Evolution of eukaryotic genes involved primarily intron loss, with substantial gain only at the bases of several major branches including plants and animals. Thus, the common ancestor of all extant eukaryotes was a complex organism with a gene architecture resembling those in multicellular organisms. The line of descent from the last common ancestor to mammals was an uninterrupted intron-rich state that, given the error-prone splicing in intron-rich organisms, was conducive to the elaboration of functional alternative splicing.
The deep phylogeny of eukaryotes is an important but extremely difficult problem of evolutionary biology. Five eukaryotic supergroups are relatively well established but the relationship between these supergroups remains elusive, and their divergence seems to best fit a “Big Bang” model. Attempts were made to root the tree of eukaryotes by using potential derived shared characters such as unique fusions of conserved genes. One popular model of eukaryotic evolution that emerged from this type of analysis is the unikont–bikont phylogeny: The unikont branch consists of Metazoa, Choanozoa, Fungi, and Amoebozoa, whereas bikonts include the rest of eukaryotes, namely, Plantae (green plants, Chlorophyta, and Rhodophyta), Chromalveolata, excavates, and Rhizaria. We reexamine the relationships between the eukaryotic supergroups using a genome-wide analysis of rare genomic changes (RGCs) associated with multiple, conserved amino acids (RGC_CAMs and RGC_CAs), to resolve trifurcations of major eukaryotic lineages. The results do not support the basal position of Chromalveolata with respect to Plantae and unikonts or the monophyly of the bikont group and appear to be best compatible with the monophyly of unikonts and Chromalveolata. Chromalveolata show a distinct, additional signal of affinity with Plantae, conceivably, owing to genes transferred from the secondary, red algal symbiont. Excavates are derived forms, with extremely long branches that complicate phylogenetic inference; nevertheless, the RGC analysis suggests that they are significantly more likely to cluster with the unikont–Chromalveolata assemblage than with the Plantae. Thus, the first split in eukaryotic evolution might lie between photosynthetic and nonphotosynthetic forms and so could have been triggered by the endosymbiosis between an ancestral unicellular eukaryote and a cyanobacterium that gave rise to the chloroplast.
eukaryotic phylogeny; rare genomic changes; parsimony; substitutions; insertions; deletions
Evolution of protein sequences is largely governed by purifying selection, with a small fraction of proteins evolving under positive selection. The evolution at synonymous positions in protein-coding genes is not nearly as well understood, with the extent and types of selection remaining, largely, unclear. A statistical test to identify purifying and positive selection at synonymous sites in protein-coding genes was developed. The method compares the rate of evolution at synonymous sites (Ks) to that in intron sequences of the same gene after sampling the aligned intron sequences to mimic the statistical properties of coding sequences. We detected purifying selection at synonymous sites in ∼28% of the 1,562 analyzed orthologous genes from mouse and rat, and positive selection in ∼12% of the genes. Thus, the fraction of genes with readily detectable positive selection at synonymous sites is much greater than the fraction of genes with comparable positive selection at nonsynonymous sites, i.e., at the level of the protein sequence. Unlike other genes, the genes with positive selection at synonymous sites showed no correlation between Ks and the rate of evolution in nonsynonymous sites (Ka), indicating that evolution of synonymous sites under positive selection is decoupled from protein evolution. The genes with purifying selection at synonymous sites showed significant anticorrelation between Ks and expression level and breadth, indicating that highly expressed genes evolve slowly. The genes with positive selection at synonymous sites showed the opposite trend, i.e., highly expressed genes had, on average, higher Ks. For the genes with positive selection at synonymous sites, a significantly lower mRNA stability is predicted compared to the genes with negative selection. Thus, mRNA destabilization could be an important factor driving positive selection in nonsynonymous sites, probably, through regulation of expression at the level of mRNA degradation and, possibly, also translation rate. So, unexpectedly, we found that positive selection at synonymous sites of mammalian genes is substantially more common than positive selection at the level of protein sequences. Positive selection at synonymous sites might act through mRNA destabilization affecting mRNA levels and translation.
synonymous sites; nonsynonymous sites; positive selection; purifying selection; introns
The eukaryotic DNA polymerase δ (Pol δ) participates in genome replication, homologous recombination, DNA repair and damage tolerance. Regulation of the plethora of Pol δ functions depends on the interaction between the second (p50) and third (p66) non-catalytic subunits. We report the crystal structure of p50•p66N complex featuring oligonucleotide binding and phosphodiesterase domains in p50 and winged helix-turn-helix N-terminal domain in p66. Disruption of the interaction between the yeast orthologs of p50 and p66 by strategic amino acid changes leads to cold-sensitivity, sensitivity to hydroxyurea and to reduced UV mutagenesis, mimicking the phenotypes of strains where the third subunit of Pol δ is absent. The second subunits of all B family replicative DNA polymerases in archaea and eukaryotes, except Pol δ, share a three-domain structure similar to p50•p66N, raising the possibility that a portion of the gene encoding p66 was derived from the second subunit gene relatively late in evolution.
DNA polymerase δ; Pol δ; p50; p66; Pol31; Pol32; OB; Myb; phosphodiesterase; human; yeast
It is commonly known that transcription of eukaryotic genes is carried out by three nuclear RNA polymerases of which RNA polymerase II is thought to be solely responsible for the synthesis of mRNAs1. Here we show that transcription of some mRNAs in humans and rodents is mediated by a previously unknown single-polypeptide nuclear RNA polymerase (spRNAP-IV). The spRNAP-IV is expressed from an alternative transcript of the mitochondrial RNA polymerase (mtRNAP) gene (POLRMT). The RNAP IVSP lacks 262 N-terminal amino acids of mtRNAP, including the mitochondrial-targeting signal, and localizes to the nucleus. Transcription by spRNAP-IV is resistant to the RNA-polymease II inhibitor α-amanitin but sensitive to siRNA specific for the POLRMT gene. The promoters for spRNAP-IV differ substantially from those utilized by RNAP-II, do not respond to transcriptional enhancers, and contain a common functional sequence motif.
By comparing sequences of human, mouse and rat orthologous genes, we show that in 5′-untranslated regions (5′-UTRs) of mammalian cDNAs but not in 3′-UTRs or coding sequences, AUG is conserved to a significantly greater extent than any of the other 63 nt triplets. This effect is likely to reflect, primarily, bona fide evolutionary conservation, rather than cDNA annotation artifacts, because the excess of conserved upstream AUGs (uAUGs) is seen in 5′-UTRs containing stop codons in-frame with the start AUG and many of the conserved AUGs are found in different frames, consistent with the location in authentic non-coding sequences. Altogether, conserved uAUGs are present in at least 20–30% of mammalian genes. Qualitatively similar results were obtained by comparison of orthologous genes from different species of the yeast genus Saccharomyces. Together with the observation that mammalian and yeast 5′-UTRs are significantly depleted in overall AUG content, these findings suggest that AUG triplets in 5′-UTRs are subject to the pressure of purifying selection in two opposite directions: the uAUGs that have no specific function tend to be deleterious and get eliminated during evolution, whereas those uAUGs that do serve a function are conserved. Most probably, the principal role of the conserved uAUGs is attenuation of translation at the initiation stage, which is often additionally regulated by alternative splicing in the mammalian 5′-UTRs. Consistent with this hypothesis, we found that open reading frames starting from conserved uAUGs are significantly shorter than those starting from non-conserved uAUGs, possibly, owing to selection for optimization of the level of attenuation.
Orthologous genes from distant eukaryotic species, e.g. animals and plants, share up to 25–30% intron positions. However, the relative contributions of evolutionary conservation and parallel gain of new introns into this pattern remain unknown. Here, the extent of independent insertion of introns in the same sites (parallel gain) in orthologous genes from phylogenetically distant eukaryotes is assessed within the framework of the protosplice site model. It is shown that protosplice sites are no more conserved during evolution of eukaryotic gene sequences than random sites. Simulation of intron insertion into protosplice sites with the observed protosplice site frequencies and intron densities shows that parallel gain can account but for a small fraction (5–10%) of shared intron positions in distantly related species. Thus, the presence of numerous introns in the same positions in orthologous genes from distant eukaryotes, such as animals, fungi and plants, appears to reflect mostly bona fide evolutionary conservation.
The mechanisms and evolutionary dynamics of intron insertion and loss in eukaryotic genes remain poorly understood. Reconstruction of parsimonious scenarios of gene structure evolution in paralogous gene families in animals and plants revealed numerous gains and losses of introns. In all analyzed lineages, the number of acquired new introns was substantially greater than the number of lost ancestral introns. This trend held even for lineages in which vertical evolution of genes involved more intron losses than gains, suggesting that gene duplication boosts intron insertion. However, dating gene duplications and the associated intron gains and losses based on the molecular clock assumption showed that very few, if any, introns were gained during the last ∼100 million years of animal and plant evolution, in agreement with previous conclusions reached through analysis of orthologous gene sets. These results are generally compatible with the emerging notion of intensive insertion and loss of introns during transitional epochs in contrast to the relative quiet of the intervening evolutionary spans.
Sequencing of multiple, nearly complete eukaryotic genomes creates opportunities for detecting previously unnoticed, subtle functional signals in non-coding regions. A genome-wide comparative analysis of orthologous sets of mammalian and yeast mRNAs revealed distinct patterns of evolutionary conservation at the boundaries of the untranslated regions (UTRs) and the coding region (CDS). Elevated sequence conservation was detected in ∼30 nt regions around the start codon. There seems to be a complementary relationship between sequence conservation in the ∼30 nt regions of the 5′-UTR immediately upstream of the start codon and that in the synonymous positions of the 5′-terminal 30 nt of the CDS: in mammalian mRNAs, the 5′-UTR shows a greater conservation than the CDS, whereas the opposite trend holds for yeast mRNAs. Unexpectedly, a ∼30 nt region downstream of the stop codon shows a substantially lower level of sequence conservation than the downstream portions of the 3′-UTRs. However, the sequence in this poorly conserved 30 nt portion of the 3′-UTR is non-random in that it has a higher GC content than the rest of the UTR. It is hypothesized that the elevated sequence conservation in the region immediately upstream of the start codon is related to the requirement for initiation factor binding during pre-initiation ribosomal scanning. In contrast, the poorly conserved region downstream of the stop codon could be involved in the post- termination scanning and dissociation of the ribosomes from the mRNA, which requires only the mRNA–ribosome interaction. Additionally, it was found that the choice of the stop codon in mammals, but not in yeasts, and the context in the immediate vicinity of the stop codons in both mammals and yeasts are subject to strong selection. Thus, genome-wide analysis of orthologous gene sets allows detection of previously unrecognized patterns of sequence conservation, which are likely to reflect hidden functional signals, such as ribosomal filters that could regulate translation by modulating the interaction between the mRNA and ribosomes.
Prokaryotic genomes are considered to be ‘wall-to-wall’ genomes, which consist largely of genes for proteins and structural RNAs, with only a small fraction of the genomic DNA allotted to intergenic regions, which are thought to typically contain regulatory signals. The majority of bacterial and archaeal genomes contain 6–14% non-coding DNA. Significant positive correlations were detected between the fraction of non-coding DNA and inter- and intra-operonic distances, suggesting that different classes of non-coding DNA evolve congruently. In contrast, no correlation was found between any of these characteristics of non-coding sequences and the number of genes or genome size. Thus, the non-coding regions and the gene sets in prokaryotes seem to evolve in different regimes. The evolution of non-coding regions appears to be determined primarily by the selective pressure to minimize the amount of non-functional DNA, while maintaining essential regulatory signals, because of which the content of non-coding DNA in different genomes is relatively uniform and intra- and inter-operonic non-coding regions evolve congruently. In contrast, the gene set is optimized for the particular environmental niche of the given microbe, which results in the lack of correlation between the gene number and the characteristics of non-coding regions.
A computational method was developed for delineating connected gene neighborhoods in bacterial and archaeal genomes. These gene neighborhoods are not typically present, in their entirety, in any single genome, but are held together by overlapping, partially conserved gene arrays. The procedure was applied to comparing the orders of orthologous genes, which were extracted from the database of Clusters of Orthologous Groups of proteins (COGs), in 31 prokaryotic genomes and resulted in the identification of 188 clusters of gene arrays, which included 1001 of 2890 COGs. These clusters were projected onto actual genomes to produce extended neighborhoods including additional genes, which are adjacent to the genes from the clusters and are transcribed in the same direction, which resulted in a total of 2387 COGs being included in the neighborhoods. Most of the neighborhoods consist predominantly of genes united by a coherent functional theme, but also include a minority of genes without an obvious functional connection to the main theme. We hypothesize that although some of the latter genes might have unsuspected roles, others are maintained within gene arrays because of the advantage of expression at a level that is typical of the given neighborhood. We designate this phenomenon ‘genomic hitchhiking’. The largest neighborhood includes 79 genes (COGs) and consists of overlapping, rearranged ribosomal protein superoperons; apparent genome hitchhiking is particularly typical of this neighborhood and other neighborhoods that consist of genes coding for translation machinery components. Several neighborhoods involve previously undetected connections between genes, allowing new functional predictions. Gene neighborhoods appear to evolve via complex rearrangement, with different combinations of genes from a neighborhood fixed in different lineages.
During a systematic analysis of conserved gene context in prokaryotic genomes, a previously undetected, complex, partially conserved neighborhood consisting of more than 20 genes was discovered in most Archaea (with the exception of Thermoplasma acidophilum and Halobacterium NRC-1) and some bacteria, including the hyperthermophiles Thermotoga maritima and Aquifex aeolicus. The gene composition and gene order in this neighborhood vary greatly between species, but all versions have a stable, conserved core that consists of five genes. One of the core genes encodes a predicted DNA helicase, often fused to a predicted HD-superfamily hydrolase, and another encodes a RecB family exonuclease; three core genes remain uncharacterized, but one of these might encode a nuclease of a new family. Two more genes that belong to this neighborhood and are present in most of the genomes in which the neighborhood was detected encode, respectively, a predicted HD-superfamily hydrolase (possibly a nuclease) of a distinct family and a predicted, novel DNA polymerase. Another characteristic feature of this neighborhood is the expansion of a superfamily of paralogous, uncharacterized proteins, which are encoded by at least 20–30% of the genes in the neighborhood. The functional features of the proteins encoded in this neighborhood suggest that they comprise a previously undetected DNA repair system, which, to our knowledge, is the first repair system largely specific for thermophiles to be identified. This hypothetical repair system might be functionally analogous to the bacterial–eukaryotic system of translesion, mutagenic repair whose central components are DNA polymerases of the UmuC-DinB-Rad30-Rev1 superfamily, which typically are missing in thermophiles.