PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1022161)

Clipboard (0)
None

Related Articles

1.  The origin of introns and their role in eukaryogenesis: a compromise solution to the introns-early versus introns-late debate? 
Biology Direct  2006;1:22.
Background
Ever since the discovery of 'genes in pieces' and mRNA splicing in eukaryotes, origin and evolution of spliceosomal introns have been considered within the conceptual framework of the 'introns early' versus 'introns late' debate. The 'introns early' hypothesis, which is closely linked to the so-called exon theory of gene evolution, posits that protein-coding genes were interrupted by numerous introns even at the earliest stages of life's evolution and that introns played a major role in the origin of proteins by facilitating recombination of sequences coding for small protein/peptide modules. Under this scenario, the absence of spliceosomal introns in prokaryotes is considered to be a result of "genome streamlining". The 'introns late' hypothesis counters that spliceosomal introns emerged only in eukaryotes, and moreover, have been inserted into protein-coding genes continuously throughout the evolution of eukaryotes. Beyond the formal dilemma, the more substantial side of this debate has to do with possible roles of introns in the evolution of eukaryotes.
Results
I argue that several lines of evidence now suggest a coherent solution to the introns-early versus introns-late debate, and the emerging picture of intron evolution integrates aspects of both views although, formally, there seems to be no support for the original version of introns-early. Firstly, there is growing evidence that spliceosomal introns evolved from group II self-splicing introns which are present, usually, in small numbers, in many bacteria, and probably, moved into the evolving eukaryotic genome from the α-proteobacterial progenitor of the mitochondria. Secondly, the concept of a primordial pool of 'virus-like' genetic elements implies that self-splicing introns are among the most ancient genetic entities. Thirdly, reconstructions of the ancestral state of eukaryotic genes suggest that the last common ancestor of extant eukaryotes had an intron-rich genome. Thus, it appears that ancestors of spliceosomal introns, indeed, have existed since the earliest stages of life's evolution, in a formal agreement with the introns-early scenario. However, there is no evidence that these ancient introns ever became widespread before the emergence of eukaryotes, hence, the central tenet of introns-early, the role of introns in early evolution of proteins, has no support. However, the demonstration that numerous introns invaded eukaryotic genes at the outset of eukaryotic evolution and that subsequent intron gain has been limited in many eukaryotic lineages implicates introns as an ancestral feature of eukaryotic genomes and refutes radical versions of introns-late. Perhaps, most importantly, I argue that the intron invasion triggered other pivotal events of eukaryogenesis, including the emergence of the spliceosome, the nucleus, the linear chromosomes, the telomerase, and the ubiquitin signaling system. This concept of eukaryogenesis, in a sense, revives some tenets of the exon hypothesis, by assigning to introns crucial roles in eukaryotic evolutionary innovation.
Conclusion
The scenario of the origin and evolution of introns that is best compatible with the results of comparative genomics and theoretical considerations goes as follows: self-splicing introns since the earliest stages of life's evolution – numerous spliceosomal introns invading genes of the emerging eukaryote during eukaryogenesis – subsequent lineage-specific loss and gain of introns. The intron invasion, probably, spawned by the mitochondrial endosymbiont, might have critically contributed to the emergence of the principal features of the eukaryotic cell. This scenario combines aspects of the introns-early and introns-late views.
Reviewers
this article was reviewed by W. Ford Doolittle, James Darnell (nominated by W. Ford Doolittle), William Martin, and Anthony Poole.
doi:10.1186/1745-6150-1-22
PMCID: PMC1570339  PMID: 16907971
2.  New Maximum Likelihood Estimators for Eukaryotic Intron Evolution 
PLoS Computational Biology  2005;1(7):e79.
The evolution of spliceosomal introns remains poorly understood. Although many approaches have been used to infer intron evolution from the patterns of intron position conservation, the results to date have been contradictory. In this paper, we address the problem using a novel maximum likelihood method, which allows estimation of the frequency of intron insertion target sites, together with the rates of intron gain and loss. We analyzed the pattern of 10,044 introns (7,221 intron positions) in the conserved regions of 684 sets of orthologs from seven eukaryotes. We determined that there is an average of one target site per 11.86 base pairs (bp) (95% confidence interval, 9.27 to 14.39 bp). In addition, our results showed that: (i) overall intron gains are ~25% greater than intron losses, although specific patterns vary with time and lineage; (ii) parallel gains account for ~18.5% of shared intron positions; and (iii) reacquisition following loss accounts for ~0.5% of all intron positions. Our results should assist in resolving the long-standing problem of inferring the evolution of spliceosomal introns.
Synopsis
When did spliceosomal introns originate, and what is their role? These questions are the central subject of the introns-early versus introns-late debate. Inference of intron evolution from the pattern of intron position conservation is vital for resolving this debate. So far, different methods of two approaches, maximum parsimony (MP) and maximum likelihood (ML), have been developed, but the results are contradictory. The differences between previous ML results are due predominantly to differing assumptions concerning the frequency of target sites for intron insertion. This paper describes a new ML method that treats this frequency as a parameter requiring optimization. Using the pattern of intron position in conserved regions of 684 clusters of gene orthologs from seven eukaryotes, the authors found that, on average, there is one target site per ~12 base pairs. The results of intron evolution inferred using this optimal frequency are more definitive than previous ML results. Since the ML method is preferred to the MP one for large datasets, the current results should be the most reliable ones to date. The results show that during the course of evolution there have been slightly more intron gains than losses, and thus they favor introns-late. These results should shed new light on our understanding of intron evolution.
doi:10.1371/journal.pcbi.0010079
PMCID: PMC1323467  PMID: 16389300
3.  Evolutionary Convergence on Highly-Conserved 3′ Intron Structures in Intron-Poor Eukaryotes and Insights into the Ancestral Eukaryotic Genome 
PLoS Genetics  2008;4(8):e1000148.
The presence of spliceosomal introns in eukaryotes raises a range of questions about genomic evolution. Along with the fundamental mysteries of introns' initial proliferation and persistence, the evolutionary forces acting on intron sequences remain largely mysterious. Intron number varies across species from a few introns per genome to several introns per gene, and the elements of intron sequences directly implicated in splicing vary from degenerate to strict consensus motifs. We report a 50-species comparative genomic study of intron sequences across most eukaryotic groups. We find two broad and striking patterns. First, we find that some highly intron-poor lineages have undergone evolutionary convergence to strong 3′ consensus intron structures. This finding holds for both branch point sequence and distance between the branch point and the 3′ splice site. Interestingly, this difference appears to exist within the genomes of green alga of the genus Ostreococcus, which exhibit highly constrained intron sequences through most of the intron-poor genome, but not in one much more intron-dense genomic region. Second, we find evidence that ancestral genomes contained highly variable branch point sequences, similar to more complex modern intron-rich eukaryotic lineages. In addition, ancestral structures are likely to have included polyT tails similar to those in metazoans and plants, which we found in a variety of protist lineages. Intriguingly, intron structure evolution appears to be quite different across lineages experiencing different types of genome reduction: whereas lineages with very few introns tend towards highly regular intronic sequences, lineages with very short introns tend towards highly degenerate sequences. Together, these results attest to the complex nature of ancestral eukaryotic splicing, the qualitatively different evolutionary forces acting on intron structures across modern lineages, and the impressive evolutionary malleability of eukaryotic gene structures.
Author Summary
The spliceosomal introns that interrupt eukaryotic genes show great number and sequence variation across species, from the rare, highly uniform yeast introns to the ubiquitous and highly variable vertebrate intron sequences. The causes of these differences remain mysterious. We studied sequences of intron branch points and 3′ termini in 50 eukaryotic species. All intron-rich species exhibit variable 3′ sequences. However, intron-poor species range from variable sequences, to uniform branch point motifs, to uniform branch point motifs in uniform positions along the intronic sequence. This is a more complex pattern than the clear relationship between intron number and 5′ intron sequence uniformity found previously. The correspondence of sequence uniformity and intron number extends to species of the green algal genus Ostreococcus, in which the single intron-rich genomic region shows far more variable intron sequences than in the otherwise intron-poor genome. We suggest that different concentrations of spliceosomal complexes may explain these differences. In addition, we report the existence of 3′ polyT tails in diverse eukaryotic protists, suggesting that this structure is ancestral. Together, these results underscore the complexity of ancestral eukaryotic splicing, the qualitatively different evolutionary forces acting on intron sequences in modern eukaryotes, and the impressive evolutionary malleability of eukaryotic genes.
doi:10.1371/journal.pgen.1000148
PMCID: PMC2483917  PMID: 18688272
4.  A Detailed History of Intron-rich Eukaryotic Ancestors Inferred from a Global Survey of 100 Complete Genomes 
PLoS Computational Biology  2011;7(9):e1002150.
Protein-coding genes in eukaryotes are interrupted by introns, but intron densities widely differ between eukaryotic lineages. Vertebrates, some invertebrates and green plants have intron-rich genes, with 6–7 introns per kilobase of coding sequence, whereas most of the other eukaryotes have intron-poor genes. We reconstructed the history of intron gain and loss using a probabilistic Markov model (Markov Chain Monte Carlo, MCMC) on 245 orthologous genes from 99 genomes representing the three of the five supergroups of eukaryotes for which multiple genome sequences are available. Intron-rich ancestors are confidently reconstructed for each major group, with 53 to 74% of the human intron density inferred with 95% confidence for the Last Eukaryotic Common Ancestor (LECA). The results of the MCMC reconstruction are compared with the reconstructions obtained using Maximum Likelihood (ML) and Dollo parsimony methods. An excellent agreement between the MCMC and ML inferences is demonstrated whereas Dollo parsimony introduces a noticeable bias in the estimations, typically yielding lower ancestral intron densities than MCMC and ML. Evolution of eukaryotic genes was dominated by intron loss, with substantial gain only at the bases of several major branches including plants and animals. The highest intron density, 120 to 130% of the human value, is inferred for the last common ancestor of animals. The reconstruction shows that the entire line of descent from LECA to mammals was intron-rich, a state conducive to the evolution of alternative splicing.
Author Summary
In eukaryotes, protein-coding genes are interrupted by non-coding introns. The intron densities widely differ, from 6–7 introns per kilobase of coding sequence in vertebrates, some invertebrates and plants, to only a few introns across the entire genome in many unicellular forms. We applied a robust statistical methodology, Markov Chain Monte Carlo, to reconstruct the history of intron gain and loss throughout the evolution of eukaryotes using a set of 245 homologous genes from 99 genomes that represent the diversity of eukaryotes. Intron-rich ancestors were confidently inferred for each major eukaryotic group including 53% to 74% of the human intron density for the last eukaryotic common ancestor, and 120% to 130% of the human value for the last common ancestor of animals. Evolution of eukaryotic genes involved primarily intron loss, with substantial gain only at the bases of several major branches including plants and animals. Thus, the common ancestor of all extant eukaryotes was a complex organism with a gene architecture resembling those in multicellular organisms. The line of descent from the last common ancestor to mammals was an uninterrupted intron-rich state that, given the error-prone splicing in intron-rich organisms, was conducive to the elaboration of functional alternative splicing.
doi:10.1371/journal.pcbi.1002150
PMCID: PMC3174169  PMID: 21935348
5.  Origin and evolution of spliceosomal introns 
Biology Direct  2012;7:11.
Evolution of exon-intron structure of eukaryotic genes has been a matter of long-standing, intensive debate. The introns-early concept, later rebranded ‘introns first’ held that protein-coding genes were interrupted by numerous introns even at the earliest stages of life's evolution and that introns played a major role in the origin of proteins by facilitating recombination of sequences coding for small protein/peptide modules. The introns-late concept held that introns emerged only in eukaryotes and new introns have been accumulating continuously throughout eukaryotic evolution. Analysis of orthologous genes from completely sequenced eukaryotic genomes revealed numerous shared intron positions in orthologous genes from animals and plants and even between animals, plants and protists, suggesting that many ancestral introns have persisted since the last eukaryotic common ancestor (LECA). Reconstructions of intron gain and loss using the growing collection of genomes of diverse eukaryotes and increasingly advanced probabilistic models convincingly show that the LECA and the ancestors of each eukaryotic supergroup had intron-rich genes, with intron densities comparable to those in the most intron-rich modern genomes such as those of vertebrates. The subsequent evolution in most lineages of eukaryotes involved primarily loss of introns, with only a few episodes of substantial intron gain that might have accompanied major evolutionary innovations such as the origin of metazoa. The original invasion of self-splicing Group II introns, presumably originating from the mitochondrial endosymbiont, into the genome of the emerging eukaryote might have been a key factor of eukaryogenesis that in particular triggered the origin of endomembranes and the nucleus. Conversely, splicing errors gave rise to alternative splicing, a major contribution to the biological complexity of multicellular eukaryotes. There is no indication that any prokaryote has ever possessed a spliceosome or introns in protein-coding genes, other than relatively rare mobile self-splicing introns. Thus, the introns-first scenario is not supported by any evidence but exon-intron structure of protein-coding genes appears to have evolved concomitantly with the eukaryotic cell, and introns were a major factor of evolution throughout the history of eukaryotes. This article was reviewed by I. King Jordan, Manuel Irimia (nominated by Anthony Poole), Tobias Mourier (nominated by Anthony Poole), and Fyodor Kondrashov. For the complete reports, see the Reviewers’ Reports section.
doi:10.1186/1745-6150-7-11
PMCID: PMC3488318  PMID: 22507701
Intron sliding; Intron gain; Intron loss; Spliceosome; Splicing signals; Evolution of exon/intron structure; Alternative splicing; Phylogenetic trees; Mobile domains; Eukaryotic ancestor
6.  Recurrent Loss of Specific Introns during Angiosperm Evolution 
PLoS Genetics  2014;10(12):e1004843.
Numerous instances of presence/absence variations for introns have been documented in eukaryotes, and some cases of recurrent loss of the same intron have been suggested. However, there has been no comprehensive or phylogenetically deep analysis of recurrent intron loss. Of 883 cases of intron presence/absence variation that we detected in five sequenced grass genomes, 93 were confirmed as recurrent losses and the rest could be explained by single losses (652) or single gains (118). No case of recurrent intron gain was observed. Deep phylogenetic analysis often indicated that apparent intron gains were actually numerous independent losses of the same intron. Recurrent loss exhibited extreme non-randomness, in that some introns were removed independently in many lineages. The two larger genomes, maize and sorghum, were found to have a higher rate of both recurrent loss and overall loss and/or gain than foxtail millet, rice or Brachypodium. Adjacent introns and small introns were found to be preferentially lost. Intron loss genes exhibited a high frequency of germ line or early embryogenesis expression. In addition, flanking exon A+T-richness and intron TG/CG ratios were higher in retained introns. This last result suggests that epigenetic status, as evidenced by a loss of methylated CG dinucleotides, may play a role in the process of intron loss. This study provides the first comprehensive analysis of recurrent intron loss, makes a series of novel findings on the patterns of recurrent intron loss during the evolution of the grass family, and provides insight into the molecular mechanism(s) underlying intron loss.
Author Summary
The spliceosomal introns are nucleotide sequences that interrupt coding regions of eukaryotic genes and are removed by RNA splicing after transcription. Recent studies have reported several examples of possible recurrent intron loss or gain, i.e., introns that are independently removed from or inserted into the identical sites more than once in an investigated phylogeny. However, the frequency, evolutionary patterns or other characteristics of recurrent intron turnover remain unknown. We provide results for the first comprehensive analysis of recurrent intron turnover within a plant family and show that recurrent intron loss represents a considerable portion of all intron losses identified and intron loss events far outnumber intron gain events. We also demonstrate that recurrent intron loss is non-random, affecting only a small number of introns that are repeatedly lost, and that different lineages show significantly different rates of intron loss. Our results suggest a possible role of DNA methylation in the process of intron loss. Moreover, this study provides strong support for the model of intron loss by reverse transcriptase mediated conversion of genes by their processed mRNA transcripts.
doi:10.1371/journal.pgen.1004843
PMCID: PMC4256211  PMID: 25474210
7.  Effects of Taxon Sampling in Reconstructions of Intron Evolution 
Introns comprise a considerable portion of eukaryotic genomes; however, their evolution is understudied. Numerous works of the last years largely disagree on many aspects of intron evolution. Interpretation of these differences is hindered because different algorithms and taxon sampling strategies were used. Here, we present the first attempt of a systematic evaluation of the effects of taxon sampling on popular intron evolution estimation algorithms. Using the “taxon jackknife” method, we compared the effect of taxon sampling on the behavior of intron evolution inferring algorithms. We show that taxon sampling can dramatically affect the inferences and identify conditions where algorithms are prone to systematic errors. Presence or absence of some key species is often more important than the taxon sampling size alone. Criteria of representativeness of the taxonomic sampling for reliable reconstructions are outlined. Presence of the deep-branching species with relatively high intron density is more important than sheer number of species. According to these criteria, currently available genomic databases are representative enough to provide reliable inferences of the intron evolution in animals, land plants, and fungi, but they underrepresent many groups of unicellular eukaryotes, including the well-studied Alveolata.
doi:10.1155/2013/671316
PMCID: PMC3647540  PMID: 23671844
8.  Sm/Lsm Genes Provide a Glimpse into the Early Evolution of the Spliceosome 
PLoS Computational Biology  2009;5(3):e1000315.
The spliceosome, a sophisticated molecular machine involved in the removal of intervening sequences from the coding sections of eukaryotic genes, appeared and subsequently evolved rapidly during the early stages of eukaryotic evolution. The last eukaryotic common ancestor (LECA) had both complex spliceosomal machinery and some spliceosomal introns, yet little is known about the early stages of evolution of the spliceosomal apparatus. The Sm/Lsm family of proteins has been suggested as one of the earliest components of the emerging spliceosome and hence provides a first in-depth glimpse into the evolving spliceosomal apparatus. An analysis of 335 Sm and Sm-like genes from 80 species across all three kingdoms of life reveals two significant observations. First, the eukaryotic Sm/Lsm family underwent two rapid waves of duplication with subsequent divergence resulting in 14 distinct genes. Each wave resulted in a more sophisticated spliceosome, reflecting a possible jump in the complexity of the evolving eukaryotic cell. Second, an unusually high degree of conservation in intron positions is observed within individual orthologous Sm/Lsm genes and between some of the Sm/Lsm paralogs. This suggests that functional spliceosomal introns existed before the emergence of the complete Sm/Lsm family of proteins; hence, spliceosomal machinery with considerably fewer components than today's spliceosome was already functional.
Author Summary
The spliceosome is a complex molecular machine that removes intervening sequences (introns) from mRNAs. It is unique to eukaryotes. Although prokaryotes have self-splicing introns, they completely lack spliceosomal introns and the spliceosome itself. Yet even the simplest eukaryotic organisms have introns and a rather complex spliceosomal apparatus. Little is known about how this amazing machine rapidly evolved in early eukaryotes. Here, we attempt to reconstruct a part of this evolutionary process using one of the most fundamental components of the spliceosome—the Sm and Lsm family of proteins. Using sequence and structure analysis as well as the analysis of the intron positions in Sm and Lsm genes in conjunction with a wealth of published data, we propose a plausible scenario for some aspects of spliceosomal evolution. In particular, we suggest that the Lsm family of genes could have been the first and the most essential component that allowed rudimentary splicing of early spliceosomal introns. Extensive duplications of Lsm genes and the later rise of the Sm gene family likely reflect a gradual increase in complexity of the spliceosome.
doi:10.1371/journal.pcbi.1000315
PMCID: PMC2650416  PMID: 19282982
9.  Nonsense-Mediated Decay Enables Intron Gain in Drosophila 
PLoS Genetics  2010;6(1):e1000819.
Intron number varies considerably among genomes, but despite their fundamental importance, the mutational mechanisms and evolutionary processes underlying the expansion of intron number remain unknown. Here we show that Drosophila, in contrast to most eukaryotic lineages, is still undergoing a dramatic rate of intron gain. These novel introns carry significantly weaker splice sites that may impede their identification by the spliceosome. Novel introns are more likely to encode a premature termination codon (PTC), indicating that nonsense-mediated decay (NMD) functions as a backup for weak splicing of new introns. Our data suggest that new introns originate when genomic insertions with weak splice sites are hidden from selection by NMD. This mechanism reduces the sequence requirement imposed on novel introns and implies that the capacity of the spliceosome to recognize weak splice sites was a prerequisite for intron gain during eukaryotic evolution.
Author Summary
The surprising observation 30 years ago that genes are interrupted by non-coding introns changed our view of gene architecture. Intron number varies dramatically among species; ranging from nine introns/gene in humans to less than one in some simple eukyarotes. Here we ask where new introns come from and how they are maintained in a population. We find that novel introns do not arise from pre-existing introns, although the mechanisms that generate novel introns remain unclear. We also show that novel introns carry only weak signals for their identification and removal, and therefore depend on nonsense-mediated decay (NMD). NMD maintains RNA quality control by degrading transcripts that have not been spliced properly. We propose that NMD shelters novel introns from natural selection. This increases the likelihood that a novel intron will rise in frequency and be maintained within a population, thus increasing the rate of intron gain.
doi:10.1371/journal.pgen.1000819
PMCID: PMC2809761  PMID: 20107520
10.  Phase distribution of spliceosomal introns: implications for intron origin 
Background
The origin of spliceosomal introns is the central subject of the introns-early versus introns-late debate. The distribution of intron phases is non-uniform, with an excess of phase-0 introns. Introns-early explains this by speculating that a fraction of present-day introns were present between minigenes in the progenote and therefore must lie in phase-0. In contrast, introns-late predicts that the nonuniformity of intron phase distribution reflects the nonrandomness of intron insertions.
Results
In this paper, we tested the two theories using analyses of intron phase distribution. We inferred the evolution of intron phase distribution from a dataset of 684 gene orthologs from seven eukaryotes using a maximum likelihood method. We also tested whether the observed intron phase distributions from 10 eukaryotes can be explained by intron insertions on a genome-wide scale. In contrast to the prediction of introns-early, the inferred evolution of intron phase distribution showed that the proportion of phase-0 introns increased over evolution. Consistent with introns-late, the observed intron phase distributions matched those predicted by an intron insertion model quite well.
Conclusion
Our results strongly support the introns-late hypothesis of the origin of spliceosomal introns.
doi:10.1186/1471-2148-6-69
PMCID: PMC1574350  PMID: 16959043
11.  Comparative genomic analysis of fungal genomes reveals intron-rich ancestors 
Genome Biology  2007;8(10):R223.
Analysis of intron gain and loss in fungal genomes provides support for an intron-rich fungus-animal ancestor.
Background
Eukaryotic protein-coding genes are interrupted by spliceosomal introns, which are removed from transcripts before protein translation. Many facets of spliceosomal intron evolution, including age, mechanisms of origins, the role of natural selection, and the causes of the vast differences in intron number between eukaryotic species, remain debated. Genome sequencing and comparative analysis has made possible whole genome analysis of intron evolution to address these questions.
Results
We analyzed intron positions in 1,161 sets of orthologous genes across 25 eukaryotic species. We find strong support for an intron-rich fungus-animal ancestor, with more than four introns per kilobase, comparable to the highest known modern intron densities. Indeed, the fungus-animal ancestor is estimated to have had more introns than any of the extant fungi in this study. Thus, subsequent fungal evolution has been characterized by widespread and recurrent intron loss occurring in all fungal clades. These results reconcile three previously proposed methods for estimation of ancestral intron number, which previously gave very different estimates of ancestral intron number for eight eukaryotic species, as well as a fourth more recent method. We do not find a clear inverse correspondence between rates of intron loss and gain, contrary to the predictions of selection-based proposals for interspecific differences in intron number.
Conclusion
Our results underscore the high intron density of eukaryotic ancestors and the widespread importance of intron loss through eukaryotic evolution.
doi:10.1186/gb-2007-8-10-r223
PMCID: PMC2246297  PMID: 17949488
12.  Metabolic modeling of endosymbiont genome reduction on a temporal scale 
This study explores the order in which individual metabolic genes are lost in an in silico evolutionary process leading from the metabolic network of Eschericia coli to that of the genome-reduced endosymbiont Buchnera aphidicola.
Simulating the reductive evolutionary process under several growth conditions, a remarkable correlation between in silico and phylogenetically reconstructed gene loss time is obtained.A gene's k-robustness (its depth of backups) is prime determinant of its loss time.In silico gene loss time is a better predictor of their actual loss times than genomic features and network properties.Simulating the reductive evolutionary process by the loss of large blocks followed by single-gene deletions, as known to occur in evolution, yields a remarkable correspondence with the phylogenetic reconstruction and the block loss reported in the literature.
An open fundamental challenge in Systems Biology is whether a genome-scale model can predict patterns of genome evolution by realistically accounting for the associated biochemical constraints. In this study, we explore the order in which individual genes are lost in an in silico evolutionary process, leading from the metabolic network of Eschericia coli to that of the endosymbiont Buchnera aphidicola.
To evaluate the in silico gene loss time, we repeated the reductive evolutionary process introduced by Pál et al (2006), denoting the in silico deletion time of a gene in a single run of the reductive evolutionary process as the number of genes deleted before its own deletion occurred. By comparing the in silico evaluations of the gene loss time to that obtained by a phylogenetic reconstruction (Figure 1), we could evaluate the ability of an in silico process to predict temporal patterns of genome reduction. Applying this procedure on a literature-based viable media, we obtained a mean Spearman's correlation of 0.46 (53% of the maximal correlation, empirical P-value <9.9e−4) between in silico and phylogenetically reconstructed loss times. In order to provide an upper bound on evolutionary necessity stemming from metabolic constraints, we searched the space of potential growth media and biomass functions via a simulated annealing search algorithm aimed at identifying an environment/biomass function that maximizes the target correlation between in silico and reconstructed loss times. Simulating the reductive evolutionary process under the growth conditions and biomass function obtained in this process, we managed to improve the correlation between in silico and reconstructed loss times to a mean Spearman's correlation of 0.54 (63% of the maximal correlation, empirical P-value <9.9e−4, Figure 3).
Examining the dependency of the predicted loss time of each gene on its intrinsic network-level properties we find a very strong inverse Spearman's correlation of −0.84 (empirical P-value <9.9e−4) between the order of gene loss predicted in silico and the k-robustness levels of the genes, the latter denoting the depth of their functional backups in the network (Deutscher et al, 2006). Moreover, in order to examine whether the relative loss time of a gene is influenced by its functional dependencies with other genes, we performed a flux-coupling analysis and identified pairs of reactions whose activities asymmetrically depend on each other, i.e., are directionally coupled (Burgard et al, 2004). We find that genes encoding reactions whose activity is needed for activating the other reaction (and not vice versa) have a tendency to be lost later, as one would expect (binomial P-value <1e−14).
To assess the scale of these results, we examined as a control how well genomic features and network properties predict the phylogenetically reconstructed gene loss times. We examined the dependency of the latter on several factors that are known be inversely correlated with the propensity of a gene to be lost (Brinza et al, 2009; Delmotte et al, 2006; Tamames et al, 2007), including the genes' mRNA levels, tAI values (Covert et al, 2004; Reis et al, 2004; Sharp and Li, 1987; Tuller et al, 2010a) and the number of partners the gene products have in a protein–protein interaction network. Remarkably, these genomic features yield considerably lower Spearman's correlation than that obtained by the in silico simulations. Moreover, multiply regressing the loss times from the phylogenetic reconstruction on the in silico gene loss time predictions and the genomic and network variables, we found that the (normalized) coefficient of the in silico predictions in the regression is much higher than those of the genomic features, further testifying to the considerable independent predictive power of the metabolic model.
Finally, simulating the evolutionary process as large block deletions at first followed by single-gene deletions as is thought to occur in evolution (Moran and Mira, 2001; van Ham et al, 2003), a remarkable correspondence with the phylogenetic reconstruction was found. Namely, we find that after a certain amount of genes are deleted from the genome, no further block deletions can occur due to the increasing density of essential genes. Notably, the maximum amount of genes that can be deleted in blocks (i.e., until no more blocks can be deleted) corresponds to the number of genes appearing in our phylogenetic reconstruction from the LCA (last common ancestor of Buchnera and E. coli) to the LCSA (last common symbiotic ancestor, nodes 1–3 in Figure 1A), as described in the literature.
A fundamental challenge in Systems Biology is whether a cell-scale metabolic model can predict patterns of genome evolution by realistically accounting for associated biochemical constraints. Here, we study the order in which genes are lost in an in silico evolutionary process, leading from the metabolic network of Eschericia coli to that of the endosymbiont Buchnera aphidicola. We examine how this order correlates with the order by which the genes were actually lost, as estimated from a phylogenetic reconstruction. By optimizing this correlation across the space of potential growth and biomass conditions, we compute an upper bound estimate on the model's prediction accuracy (R=0.54). The model's network-based predictive ability outperforms predictions obtained using genomic features of individual genes, reflecting the effect of selection imposed by metabolic stoichiometric constraints. Thus, while the timing of gene loss might be expected to be a completely stochastic evolutionary process, remarkably, we find that metabolic considerations, on their own, make a marked 40% contribution to determining when such losses occur.
doi:10.1038/msb.2011.11
PMCID: PMC3094061  PMID: 21451589
constraint-based modeling; endosymbiont; evolution; metabolism
13.  EREM: Parameter Estimation and Ancestral Reconstruction by Expectation-Maximization Algorithm for a Probabilistic Model of Genomic Binary Characters Evolution 
Advances in Bioinformatics  2010;2010:167408.
Evolutionary binary characters are features of species or genes, indicating the absence (value zero) or presence (value one) of some property. Examples include eukaryotic gene architecture (the presence or absence of an intron in a particular locus), gene content, and morphological characters. In many studies, the acquisition of such binary characters is assumed to represent a rare evolutionary event, and consequently, their evolution is analyzed using various flavors of parsimony. However, when gain and loss of the character are not rare enough, a probabilistic analysis becomes essential. Here, we present a comprehensive probabilistic model to describe the evolution of binary characters on a bifurcating phylogenetic tree. A fast software tool, EREM, is provided, using maximum likelihood to estimate the parameters of the model and to reconstruct ancestral states (presence and absence in internal nodes) and events (gain and loss events along branches).
doi:10.1155/2010/167408
PMCID: PMC2866244  PMID: 20467467
14.  Patterns of intron gain and conservation in eukaryotic genes 
Background:
The presence of introns in protein-coding genes is a universal feature of eukaryotic genome organization, and the genes of multicellular eukaryotes, typically, contain multiple introns, a substantial fraction of which share position in distant taxa, such as plants and animals. Depending on the methods and data sets used, researchers have reached opposite conclusions on the causes of the high fraction of shared introns in orthologous genes from distant eukaryotes. Some studies conclude that shared intron positions reflect, almost entirely, a remarkable evolutionary conservation, whereas others attribute it to parallel gain of introns. To resolve these contradictions, it is crucial to analyze the evolution of introns by using a model that minimally relies on arbitrary assumptions.
Results:
We developed a probabilistic model of evolution that allows for variability of intron gain and loss rates over branches of the phylogenetic tree, individual genes, and individual sites. Applying this model to an extended set of conserved eukaryotic genes, we find that parallel gain, on average, accounts for only ~8% of the shared intron positions. However, the distribution of parallel gains over the phylogenetic tree of eukaryotes is highly non-uniform. There are, practically, no parallel gains in closely related lineages, whereas for distant lineages, such as animals and plants, parallel gains appear to contribute up to 20% of the shared intron positions. In accord with these findings, we estimated that ancestral introns have a high probability to be retained in extant genomes, and conversely, that a substantial fraction of extant introns have retained their positions since the early stages of eukaryotic evolution. In addition, the density of sites that are available for intron insertion is estimated to be, approximately, one in seven basepairs.
Conclusion:
We obtained robust estimates of the contribution of parallel gain to the observed sharing of intron positions between eukaryotic species separated by different evolutionary distances. The results indicate that, although the contribution of parallel gains varies across the phylogenetic tree, the high level of intron position sharing is due, primarily, to evolutionary conservation. Accordingly, numerous introns appear to persist in the same position over hundreds of millions of years of evolution. This is compatible with recent observations of a negative correlation between the rate of intron gain and coding sequence evolution rate of a gene, suggesting that at least some of the introns are functionally relevant.
doi:10.1186/1471-2148-7-192
PMCID: PMC2151770  PMID: 17935625
15.  Analysis of Ribosomal Protein Gene Structures: Implications for Intron Evolution  
PLoS Genetics  2006;2(3):e25.
Many spliceosomal introns exist in the eukaryotic nuclear genome. Despite much research, the evolution of spliceosomal introns remains poorly understood. In this paper, we tried to gain insights into intron evolution from a novel perspective by comparing the gene structures of cytoplasmic ribosomal proteins (CRPs) and mitochondrial ribosomal proteins (MRPs), which are held to be of archaeal and bacterial origin, respectively. We analyzed 25 homologous pairs of CRP and MRP genes that together had a total of 527 intron positions. We found that all 12 of the intron positions shared by CRP and MRP genes resulted from parallel intron gains and none could be considered to be “conserved,” i.e., descendants of the same ancestor. This was supported further by the high frequency of proto-splice sites at these shared positions; proto-splice sites are proposed to be sites for intron insertion. Although we could not definitively disprove that spliceosomal introns were already present in the last universal common ancestor, our results lend more support to the idea that introns were gained late. At least, our results show that MRP genes were intronless at the time of endosymbiosis. The parallel intron gains between CRP and MRP genes accounted for 2.3% of total intron positions, which should provide a reliable estimate for future inferences of intron evolution.
Synopsis
Genes in eukaryotes are usually intervened by extra bits of DNA sequence, called introns, that have to be removed after the genes are transcribed into RNA. Why do introns exist in eukaryotic genes? What is the reason for the increased intron density in higher eukaryotes? There is much that is not known about introns. This research tries to clarify the evolutionary process by which introns arose by comparing the gene structures of two types of ribosomal proteins; one in cytoplasm and the other in mitochondria of the cell. Since cytoplasm and mitochondria are of archaeal and bacterial origin, respectively, cytoplasmic ribosomal proteins (CRPs) and mitochondrial ribosomal proteins (MRPs) are believed to diverge at the same time with the divergence of archaea and bacteria. Thus, a comparative analysis of CRP and MRP genes may reveal whether introns already existed at the last common ancestor of archaea and bacteria (introns-early) or whether they emerged late (introns-late). The results make it clear, at least, that all of the introns in MRP genes were gained during the course of eukaryotic evolution and therefore lend more support to the introns-late theory.
doi:10.1371/journal.pgen.0020025
PMCID: PMC1386722  PMID: 16518464
16.  Modeling the evolution dynamics of exon-intron structure with a general random fragmentation process 
Background
Most eukaryotic genes are interrupted by spliceosomal introns. The evolution of exon-intron structure remains mysterious despite rapid advance in genome sequencing technique. In this work, a novel approach is taken based on the assumptions that the evolution of exon-intron structure is a stochastic process, and that the characteristics of this process can be understood by examining its historical outcome, the present-day size distribution of internal translated exons (exon). Through the combination of simulation and modeling the size distribution of exons in different species, we propose a general random fragmentation process (GRFP) to characterize the evolution dynamics of exon-intron structure. This model accurately predicts the probability that an exon will be split by a new intron and the distribution of novel insertions along the length of the exon.
Results
As the first observation from this model, we show that the chance for an exon to obtain an intron is proportional to its size to the 3rd power. We also show that such size dependence is nearly constant across gene, with the exception of the exons adjacent to the 5′ UTR. As the second conclusion from the model, we show that intron insertion loci follow a normal distribution with a mean of 0.5 (center of the exon) and a standard deviation of 0.11. Finally, we show that intron insertions within a gene are independent of each other for vertebrates, but are more negatively correlated for non-vertebrate. We use simulation to demonstrate that the negative correlation might result from significant intron loss during evolution, which could be explained by selection against multi-intron genes in these organisms.
Conclusions
The GRFP model suggests that intron gain is dynamic with a higher chance for longer exons; introns are inserted into exons randomly with the highest probability at the center of the exon. GRFP estimates that there are 78 introns in every 10 kb coding sequences for vertebrate genomes, agreeing with empirical observations. GRFP also estimates that there are significant intron losses in the evolution of non-vertebrate genomes, with extreme cases of around 57% intron loss in Drosophila melanogaster, 28% in Caenorhabditis elegans, and 24% in Oryza sativa.
doi:10.1186/1471-2148-13-57
PMCID: PMC3732091  PMID: 23448166
Evolution of exon-intron structure; General random fragmentation process; Simulation
17.  Computational identification of functional introns: high positional conservation of introns that harbor RNA genes 
Nucleic Acids Research  2013;41(11):5604-5613.
An appreciable fraction of introns is thought to have some function, but there is no obvious way to predict which specific intron is likely to be functional. We hypothesize that functional introns experience a different selection regime than non-functional ones and will therefore show distinct evolutionary histories. In particular, we expect functional introns to be more resistant to loss, and that this would be reflected in high conservation of their position with respect to the coding sequence. To test this hypothesis, we focused on introns whose function comes about from microRNAs and snoRNAs that are embedded within their sequence. We built a data set of orthologous genes across 28 eukaryotic species, reconstructed the evolutionary histories of their introns and compared functional introns with the rest of the introns. We found that, indeed, the position of microRNA- and snoRNA-bearing introns is significantly more conserved. In addition, we found that both families of RNA genes settled within introns early during metazoan evolution. We identified several easily computable intronic properties that can be used to detect functional introns in general, thereby suggesting a new strategy to pinpoint non-coding cellular functions.
doi:10.1093/nar/gkt244
PMCID: PMC3675471  PMID: 23605046
18.  Metabolic network reconstruction of Chlamydomonas offers insight into light-driven algal metabolism 
A comprehensive genome-scale metabolic network of Chlamydomonas reinhardtii, including a detailed account of light-driven metabolism, is reconstructed and validated. The model provides a new resource for research of C. reinhardtii metabolism and in algal biotechnology.
The genome-scale metabolic network of Chlamydomonas reinhardtii (iRC1080) was reconstructed, accounting for >32% of the estimated metabolic genes encoded in the genome, and including extensive details of lipid metabolic pathways.This is the first metabolic network to explicitly account for stoichiometry and wavelengths of metabolic photon usage, providing a new resource for research of C. reinhardtii metabolism and developments in algal biotechnology.Metabolic functional annotation and the largest transcript verification of a metabolic network to date was performed, at least partially verifying >90% of the transcripts accounted for in iRC1080. Analysis of the network supports hypotheses concerning the evolution of latent lipid pathways in C. reinhardtii, including very long-chain polyunsaturated fatty acid and ceramide synthesis pathways.A novel approach for modeling light-driven metabolism was developed that accounts for both light source intensity and spectral quality of emitted light. The constructs resulting from this approach, termed prism reactions, were shown to significantly improve the accuracy of model predictions, and their use was demonstrated for evaluation of light source efficiency and design.
Algae have garnered significant interest in recent years, especially for their potential application in biofuel production. The hallmark, model eukaryotic microalgae Chlamydomonas reinhardtii has been widely used to study photosynthesis, cell motility and phototaxis, cell wall biogenesis, and other fundamental cellular processes (Harris, 2001). Characterizing algal metabolism is key to engineering production strains and understanding photobiological phenomena. Based on extensive literature on C. reinhardtii metabolism, its genome sequence (Merchant et al, 2007), and gene functional annotation, we have reconstructed and experimentally validated the genome-scale metabolic network for this alga, iRC1080, the first network to account for detailed photon absorption permitting growth simulations under different light sources. iRC1080 accounts for 1080 genes, associated with 2190 reactions and 1068 unique metabolites and encompasses 83 subsystems distributed across 10 cellular compartments (Figure 1A). Its >32% coverage of estimated metabolic genes is a tremendous expansion over previous algal reconstructions (Boyle and Morgan, 2009; Manichaikul et al, 2009). The lipid metabolic pathways of iRC1080 are considerably expanded relative to existing networks, and chemical properties of all metabolites in these pathways are accounted for explicitly, providing sufficient detail to completely specify all individual molecular species: backbone molecule and stereochemical numbering of acyl-chain positions; acyl-chain length; and number, position, and cis–trans stereoisomerism of carbon–carbon double bonds. Such detail in lipid metabolism will be critical for model-driven metabolic engineering efforts.
We experimentally verified transcripts accounted for in the network under permissive growth conditions, detecting >90% of tested transcript models (Figure 1B) and providing validating evidence for the contents of iRC1080. We also analyzed the extent of transcript verification by specific metabolic subsystems. Some subsystems stood out as more poorly verified, including chloroplast and mitochondrial transport systems and sphingolipid metabolism, all of which exhibited <80% of transcripts detected, reflecting incomplete characterization of compartmental transporters and supporting a hypothesis of latent pathway evolution for ceramide synthesis in C. reinhardtii. Additional lines of evidence from the reconstruction effort similarly support this hypothesis including lack of ceramide synthetase and other annotation gaps downstream in sphingolipid metabolism. A similar hypothesis of latent pathway evolution was established for very long-chain fatty acids (VLCFAs) and their polyunsaturated analogs (VLCPUFAs) (Figure 1C), owing to the absence of this class of lipids in previous experimental measurements, lack of a candidate VLCFA elongase in the functional annotation, and additional downstream annotation gaps in arachidonic acid metabolism.
The network provides a detailed account of metabolic photon absorption by light-driven reactions, including photosystems I and II, light-dependent protochlorophyllide oxidoreductase, provitamin D3 photoconversion to vitamin D3, and rhodopsin photoisomerase; this network accounting permits the precise modeling of light-dependent metabolism. iRC1080 accounts for effective light spectral ranges through analysis of biochemical activity spectra (Figure 3A), either reaction activity or absorbance at varying light wavelengths. Defining effective spectral ranges associated with each photon-utilizing reaction enabled our network to model growth under different light sources via stoichiometric representation of the spectral composition of emitted light, termed prism reactions. Coefficients for different photon wavelengths in a prism reaction correspond to the ratios of photon flux in the defined effective spectral ranges to the total emitted photon flux from a given light source (Figure 3B). This approach distinguishes the amount of emitted photons that drive different metabolic reactions. We created prism reactions for most light sources that have been used in published studies for algal and plant growth including solar light, various light bulbs, and LEDs. We also included regulatory effects, resulting from lighting conditions insofar as published studies enabled. Light and dark conditions have been shown to affect metabolic enzyme activity in C. reinhardtii on multiple levels: transcriptional regulation, chloroplast RNA degradation, translational regulation, and thioredoxin-mediated enzyme regulation. Through application of our light model and prism reactions, we were able to closely recapitulate experimental growth measurements under solar, incandescent, and red LED lights. Through unbiased sampling, we were able to establish the tremendous statistical significance of the accuracy of growth predictions achievable through implementation of prism reactions. Finally, application of the photosynthetic model was demonstrated prospectively to evaluate light utilization efficiency under different light sources. The results suggest that, of the existing light sources, red LEDs provide the greatest efficiency, about three times as efficient as sunlight. Extending this analysis, the model was applied to design a maximally efficient LED spectrum for algal growth. The result was a 677-nm peak LED spectrum with a total incident photon flux of 360 μE/m2/s, suggesting that for the simple objective of maximizing growth efficiency, LED technology has already reached an effective theoretical optimum.
In summary, the C. reinhardtii metabolic network iRC1080 that we have reconstructed offers insight into the basic biology of this species and may be employed prospectively for genetic engineering design and light source design relevant to algal biotechnology. iRC1080 was used to analyze lipid metabolism and generate novel hypotheses about the evolution of latent pathways. The predictive capacity of metabolic models developed from iRC1080 was demonstrated in simulating mutant phenotypes and in evaluation of light source efficiency. Our network provides a broad knowledgebase of the biochemistry and genomics underlying global metabolism of a photoautotroph, and our modeling approach for light-driven metabolism exemplifies how integration of largely unvisited data types, such as physicochemical environmental parameters, can expand the diversity of applications of metabolic networks.
Metabolic network reconstruction encompasses existing knowledge about an organism's metabolism and genome annotation, providing a platform for omics data analysis and phenotype prediction. The model alga Chlamydomonas reinhardtii is employed to study diverse biological processes from photosynthesis to phototaxis. Recent heightened interest in this species results from an international movement to develop algal biofuels. Integrating biological and optical data, we reconstructed a genome-scale metabolic network for this alga and devised a novel light-modeling approach that enables quantitative growth prediction for a given light source, resolving wavelength and photon flux. We experimentally verified transcripts accounted for in the network and physiologically validated model function through simulation and generation of new experimental growth data, providing high confidence in network contents and predictive applications. The network offers insight into algal metabolism and potential for genetic engineering and efficient light source design, a pioneering resource for studying light-driven metabolism and quantitative systems biology.
doi:10.1038/msb.2011.52
PMCID: PMC3202792  PMID: 21811229
Chlamydomonas reinhardtii; lipid metabolism; metabolic engineering; photobioreactor
19.  Exon definition as a potential negative force against intron losses in evolution 
Biology Direct  2008;3:46.
Background
Previous studies have indicated that the wide variation in intron density (the number of introns per gene) among different eukaryotes largely reflects varying degrees of intron loss during evolution. The most popular model, which suggests that organisms lose introns through a mechanism in which reverse-transcribed cDNA recombines with the genomic DNA, concerns only one mutational force.
Hypothesis
Using exons as the units of splicing-site recognition, exon definition constrains the length of exons. An intron-loss event results in fusion of flanking exons and thus a larger exon. The large size of the newborn exon may cause splicing errors, i.e., exon skipping, if the splicing of pre-mRNAs is initiated by exon definition. By contrast, if the splicing of pre-mRNAs is initiated by intron definition, intron loss does not matter. Exon definition may thus be a selective force against intron loss. An organism with a high frequency of exon definition is expected to experience a low rate of intron loss throughout evolution and have a high density of spliceosomal introns.
Conclusion
The majority of spliceosomal introns in vertebrates may be maintained during evolution not because of potential functions, but because of their splicing mechanism (i.e., exon definition). Further research is required to determine whether exon definition is a negative force in maintaining the high intron density of vertebrates.
Reviewers
This article was reviewed by Dr. Scott W. Roy (nominated by Dr. John Logsdon), Dr. Eugene V. Koonin, and Dr. Igor B. Rogozin (nominated by Dr. Mikhail Gelfand). For the full reviews, please go to the Reviewers' comments section.
doi:10.1186/1745-6150-3-46
PMCID: PMC2614967  PMID: 19014515
20.  Mechanisms Used for Genomic Proliferation by Thermophilic Group II Introns 
PLoS Biology  2010;8(6):e1000391.
Studies of mobile group II introns from a thermophilic cyanobacterium reveal how these introns proliferate within genomes and might explain the origin of introns and retroelements in higher organisms.
Mobile group II introns, which are found in bacterial and organellar genomes, are site-specific retroelments hypothesized to be evolutionary ancestors of spliceosomal introns and retrotransposons in higher organisms. Most bacteria, however, contain no more than one or a few group II introns, making it unclear how introns could have proliferated to higher copy numbers in eukaryotic genomes. An exception is the thermophilic cyanobacterium Thermosynechococcus elongatus, which contains 28 closely related copies of a group II intron, constituting ∼1.3% of the genome. Here, by using a combination of bioinformatics and mobility assays at different temperatures, we identified mechanisms that contribute to the proliferation of T. elongatus group II introns. These mechanisms include divergence of DNA target specificity to avoid target site saturation; adaptation of some intron-encoded reverse transcriptases to splice and mobilize multiple degenerate introns that do not encode reverse transcriptases, leading to a common splicing apparatus; and preferential insertion within other mobile introns or insertion elements, which provide new unoccupied sites in expanding non-essential DNA regions. Additionally, unlike mesophilic group II introns, the thermophilic T. elongatus introns rely on elevated temperatures to help promote DNA strand separation, enabling access to a larger number of DNA target sites by base pairing of the intron RNA, with minimal constraint from the reverse transcriptase. Our results provide insight into group II intron proliferation mechanisms and show that higher temperatures, which are thought to have prevailed on Earth during the emergence of eukaryotes, favor intron proliferation by increasing the accessibility of DNA target sites. We also identify actively mobile thermophilic introns, which may be useful for structural studies, gene targeting in thermophiles, and as a source of thermostable reverse transcriptases.
Author Summary
Group II introns are bacterial mobile elements thought to be ancestors of introns and retroelements in higher organisms. They comprise a catalytically active intron RNA and an intron-encoded reverse transcriptase, which promotes splicing of the intron from precursor RNA and integration of the excised intron into new genomic sites. While most bacteria have small numbers of group II introns, in the thermophilic cyanobacterium Thermosynechococcus elongatus, a single intron has proliferated and constitutes 1.3% of the genome. Here, we investigated how the T. elongatus introns proliferated to such high copy numbers. We found divergence of DNA target specificity, evolution of reverse transcriptases that splice and mobilize multiple degenerate introns, and preferential insertion into other mobile introns or insertion elements, which provide new integration sites in non-essential regions of the genome. Further, unlike mesophilic group II introns, the thermophilic T. elongatus introns rely on higher temperatures to help promote DNA strand separation, facilitating access to DNA target sites. We speculate how these mechanisms, including elevated temperature, might have contributed to intron proliferation in early eukaryotes. We also identify actively mobile thermophilic introns, which may be useful for structural studies and biotechnological applications.
doi:10.1371/journal.pbio.1000391
PMCID: PMC2882425  PMID: 20543989
21.  Evolutionary dynamics of U12-type spliceosomal introns 
Background
Many multicellular eukaryotes have two types of spliceosomes for the removal of introns from messenger RNA precursors. The major (U2) spliceosome processes the vast majority of introns, referred to as U2-type introns, while the minor (U12) spliceosome removes a small fraction (less than 0.5%) of introns, referred to as U12-type introns. U12-type introns have distinct sequence elements and usually occur together in genes with U2-type introns. A phylogenetic distribution of U12-type introns shows that the minor splicing pathway appeared very early in eukaryotic evolution and has been lost repeatedly.
Results
We have investigated the evolution of U12-type introns among eighteen metazoan genomes by analyzing orthologous U12-type intron clusters. Examination of gain, loss, and type switching shows that intron type is remarkably conserved among vertebrates. Among 180 intron clusters, only eight show intron loss in any vertebrate species and only five show conversion between the U12 and the U2-type. Although there are only nineteen U12-type introns in Drosophila melanogaster, we found one case of U2 to U12-type conversion, apparently mediated by the activation of cryptic U12 splice sites early in the dipteran lineage. Overall, loss of U12-type introns is more common than conversion to U2-type and the U12 to U2 conversion occurs more frequently among introns of the GT-AG subtype than among introns of the AT-AC subtype. We also found support for natural U12-type introns with non-canonical terminal dinucleotides (CT-AC, GG-AG, and GA-AG) that have not been previously reported.
Conclusions
Although complete loss of the U12-type spliceosome has occurred repeatedly, U12 introns are extremely stable in some taxa, including eutheria. Loss of U12 introns or the genes containing them is more common than conversion to the U2-type. The degeneracy of U12-type terminal dinucleotides among natural U12-type introns is higher than previously thought.
doi:10.1186/1471-2148-10-47
PMCID: PMC2831892  PMID: 20163699
22.  Comparative analysis of information contents relevant to recognition of introns in many species 
BMC Genomics  2011;12:45.
Background
The basic process of RNA splicing is conserved among eukaryotic species. Three signals (5' and 3' splice sites and branch site) are commonly used to directly conduct splicing, while other features are also related to the recognition of an intron. Although there is experimental evidence pointing to the significant species specificities in the features of intron recognition, a quantitative evaluation of the divergence of these features among a wide variety of eukaryotes has yet to be conducted.
Results
To better understand the splicing process from the viewpoints of evolution and information theory, we collected introns from 61 diverse species of eukaryotes and analyzed the properties of the nucleotide sequences relevant to splicing. We found that trees individually constructed from the five features (the three signals, intron length, and nucleotide composition within an intron) roughly reflect the phylogenetic relationships among the species but sometimes extensively deviate from the species classification. The degree of topological deviation of each feature tree from the reference trees indicates the lowest discordance for the 5' splicing signal, followed by that for the 3' splicing signal, and a considerably greater discordance for the other three features. We also estimated the relative contributions of the five features to short intron recognition in each species. Again, moderate correlation was observed between the similarities in pattern of short intron recognition and the genealogical relationships among the species. When mammalian introns were categorized into three subtypes according to their terminal dinucleotide sequences, each subtype segregated into a nearly monophyletic group, regardless of the host species, with respect to the 5' and 3' splicing signals. It was also found that GC-AG introns are extraordinarily abundant in some species with high genomic G + C contents, and that the U12-type spliceosome might make a greater contribution than currently estimated in most species.
Conclusions
Overall, the present study indicates that both splicing signals themselves and their relative contributions to short intron recognition are rather susceptible to evolutionary changes, while some poorly characterized properties seem to be preserved within the mammalian intron subtypes. Our findings may afford additional clues to understanding of evolution of splicing mechanisms.
doi:10.1186/1471-2164-12-45
PMCID: PMC3033335  PMID: 21247441
23.  Explaining evolution via constrained persistent perfect phylogeny 
BMC Genomics  2014;15(Suppl 6):S10.
Background
The perfect phylogeny is an often used model in phylogenetics since it provides an efficient basic procedure for representing the evolution of genomic binary characters in several frameworks, such as for example in haplotype inference. The model, which is conceptually the simplest, is based on the infinite sites assumption, that is no character can mutate more than once in the whole tree. A main open problem regarding the model is finding generalizations that retain the computational tractability of the original model but are more flexible in modeling biological data when the infinite site assumption is violated because of e.g. back mutations. A special case of back mutations that has been considered in the study of the evolution of protein domains (where a domain is acquired and then lost) is persistency, that is the fact that a character is allowed to return back to the ancestral state. In this model characters can be gained and lost at most once. In this paper we consider the computational problem of explaining binary data by the Persistent Perfect Phylogeny model (referred as PPP) and for this purpose we investigate the problem of reconstructing an evolution where some constraints are imposed on the paths of the tree.
Results
We define a natural generalization of the PPP problem obtained by requiring that for some pairs (character, species), neither the species nor any of its ancestors can have the character. In other words, some characters cannot be persistent for some species. This new problem is called Constrained PPP (CPPP). Based on a graph formulation of the CPPP problem, we are able to provide a polynomial time solution for the CPPP problem for matrices whose conflict graph has no edges. Using this result, we develop a parameterized algorithm for solving the CPPP problem where the parameter is the number of characters.
Conclusions
A preliminary experimental analysis shows that the constrained persistent perfect phylogeny model allows to explain efficiently data that do not conform with the classical perfect phylogeny model.
doi:10.1186/1471-2164-15-S6-S10
PMCID: PMC4240218  PMID: 25572381
perfect phylogeny; persistent perfect phylogeny; fixed-parameter complexity
24.  Using Likelihood-Free Inference to Compare Evolutionary Dynamics of the Protein Networks of H. pylori and P. falciparum 
PLoS Computational Biology  2007;3(11):e230.
Gene duplication with subsequent interaction divergence is one of the primary driving forces in the evolution of genetic systems. Yet little is known about the precise mechanisms and the role of duplication divergence in the evolution of protein networks from the prokaryote and eukaryote domains. We developed a novel, model-based approach for Bayesian inference on biological network data that centres on approximate Bayesian computation, or likelihood-free inference. Instead of computing the intractable likelihood of the protein network topology, our method summarizes key features of the network and, based on these, uses a MCMC algorithm to approximate the posterior distribution of the model parameters. This allowed us to reliably fit a flexible mixture model that captures hallmarks of evolution by gene duplication and subfunctionalization to protein interaction network data of Helicobacter pylori and Plasmodium falciparum. The 80% credible intervals for the duplication–divergence component are [0.64, 0.98] for H. pylori and [0.87, 0.99] for P. falciparum. The remaining parameter estimates are not inconsistent with sequence data. An extensive sensitivity analysis showed that incompleteness of PIN data does not largely affect the analysis of models of protein network evolution, and that the degree sequence alone barely captures the evolutionary footprints of protein networks relative to other statistics. Our likelihood-free inference approach enables a fully Bayesian analysis of a complex and highly stochastic system that is otherwise intractable at present. Modelling the evolutionary history of PIN data, it transpires that only the simultaneous analysis of several global aspects of protein networks enables credible and consistent inference to be made from available datasets. Our results indicate that gene duplication has played a larger part in the network evolution of the eukaryote than in the prokaryote, and suggests that single gene duplications with immediate divergence alone may explain more than 60% of biological network data in both domains.
Author Summary
The importance of gene duplication to biological evolution has been recognized since the 1930s. For more than a decade, substantial evidence has been collected from genomic sequence data in order to elucidate the importance and the mechanisms of gene duplication; however, most biological characteristics arise from complex interactions between the cell's numerous constituents. Recently, preliminary descriptions of the protein interaction networks have become available for species of different domains. Adapting novel techniques in stochastic simulation, the authors demonstrate that evolutionary inferences can be drawn from large-scale, incomplete network data by fitting a stochastic model of network growth that captures hallmarks of evolution by duplication and divergence. They have also analyzed the effect of summarizing protein networks in different ways, and show that a reliable and consistent analysis requires many aspects of network data to be considered jointly; in contrast to what is commonly done in practice. Their results indicate that duplication and divergence has played a larger role in the network evolution of the eukaryote P. falciparum than in the prokaryote H. pylori, and emphasize at least for the eukaryote the potential importance of subfunctionalization in network evolution.
doi:10.1371/journal.pcbi.0030230
PMCID: PMC2098858  PMID: 18052538
25.  Rapid Pathway Evolution Facilitated by Horizontal Gene Transfers across Prokaryotic Lineages 
PLoS Genetics  2009;5(3):e1000402.
The evolutionary history of biological pathways is of general interest, especially in this post-genomic era, because it may provide clues for understanding how complex systems encoded on genomes have been organized. To explain how pathways can evolve de novo, some noteworthy models have been proposed. However, direct reconstruction of pathway evolutionary history both on a genomic scale and at the depth of the tree of life has suffered from artificial effects in estimating the gene content of ancestral species. Recently, we developed an algorithm that effectively reconstructs gene-content evolution without these artificial effects, and we applied it to this problem. The carefully reconstructed history, which was based on the metabolic pathways of 160 prokaryotic species, confirmed that pathways have grown beyond the random acquisition of individual genes. Pathway acquisition took place quickly, probably eliminating the difficulty in holding genes during the course of the pathway evolution. This rapid evolution was due to massive horizontal gene transfers as gene groups, some of which were possibly operon transfers, which would convey existing pathways but not be able to generate novel pathways. To this end, we analyzed how these pathways originally appeared and found that the original acquisition of pathways occurred more contemporaneously than expected across different phylogenetic clades. As a possible model to explain this observation, we propose that novel pathway evolution may be facilitated by bidirectional horizontal gene transfers in prokaryotic communities. Such a model would complement existing pathway evolution models.
Author Summary
Many biological functions, from energy metabolism to antibiotic resistance, are carried out by biological pathways that require a number of cooperatively functioning genes. Hence, underlying mechanisms in the evolution of biological pathways are of particular interest. However, compared to the evolution of individual genes, which has been well studied, the evolution of biological pathways is far less understood. In this study, we used the abundant genome sequences available today and a novel algorithm we recently developed to trace the evolutionary history of prokaryotic metabolic pathways and to analyze how these pathways emerged. We found that the pathways have experienced significantly rapid acquisition, which would play a key role in eliminating the difficulty in holding genes during the course of pathway evolution. In addition, the emergence of novel pathways was suggested to have occurred more contemporaneously than expected across different phylogenetic clades. Based on these observations, we propose that novel pathway evolution can be facilitated by bidirectional horizontal gene transfers in prokaryotic communities. This simple model may approach the question of how biological pathways requiring a number of cooperatively functioning genes can be obtained and are the core event within the evolution of biological pathways in prokaryotes.
doi:10.1371/journal.pgen.1000402
PMCID: PMC2644373  PMID: 19266023

Results 1-25 (1022161)