Bacterial evolution is characterized by frequent gain and loss events of gene families. These events can be inferred from phyletic pattern data—a compact representation of gene family repertoire across multiple genomes. The maximum parsimony paradigm is a classical and prevalent approach for the detection of gene family gains and losses mapped on specific branches. We and others have previously developed probabilistic models that aim to account for the gain and loss stochastic dynamics. These models are a critical component of a methodology termed stochastic mapping, in which probabilities and expectations of gain and loss events are estimated for each branch of an underlying phylogenetic tree. In this work, we present a phyletic pattern simulator in which the gain and loss dynamics are assumed to follow a continuous-time Markov chain along the tree. Various models and options are implemented to make the simulation software useful for a large number of studies in which binary (presence/absence) data are analyzed. Using this simulation software, we compared the ability of the maximum parsimony and the stochastic mapping approaches to accurately detect gain and loss events along the tree. Our simulations cover a large array of evolutionary scenarios in terms of the propensities for gene family gains and losses and the variability of these propensities among gene families. Although in all simulation schemes, both methods obtain relatively low levels of false positive rates, stochastic mapping outperforms maximum parsimony in terms of true positive rates. We further studied the factors that influence the performance of both methods. We find, for example, that the accuracy of maximum parsimony inference is substantially reduced when the goal is to map gain and loss events along internal branches of the phylogenetic tree. Furthermore, the accuracy of stochastic mapping is reduced with smaller data sets (limited number of gene families) due to unreliable estimation of branch lengths. Our simulator and simulation results are additionally relevant for the analysis of other types of binary-coded data, such as the existence of homologues restriction sites, gaps, and introns, to name a few. Both the simulation software and the inference methodology are freely available at a user-friendly server: http://gloome.tau.ac.il/.
phyletic pattern; stochastic mapping; maximum parsimony; evolutionary models
The problem of probabilistic inference of gene content in the last common ancestor of several extant species with completely sequenced genomes is: for each gene that is conserved in all or some of the genomes, assign the probability that its ancestral gene was present in the genome of their last common ancestor.
We have developed a family of models of gene gain and gene loss in evolution, and applied the maximum-likelihood approach that uses phylogenetic tree of prokaryotes and the record of orthologous relationships between their genes to infer the gene content of LUCA, the Last Universal Common Ancestor of all currently living cellular organisms. The crucial parameter, the ratio of gene losses and gene gains, was estimated from the data and was higher in models that take account of the number of in-paralogs in genomes than in models that treat gene presences and absences as a binary trait.
While the numbers of genes that are placed confidently into LUCA are similar in the ML methods and in previously published methods that use various parsimony-based approaches, the identities of genes themselves are different. Most of the models of either kind treat the genes found in many existing genomes in a similar way, assigning to them high probabilities of being ancestral (“high ancestrality”). The ML models are more likely than others to assign high ancestrality to the genes that are relatively rare in the present-day genomes.
This article was reviewed by Martijn A Huynen, Toni Gabaldón and Fyodor Kondrashov.
Comparative analysis of sequenced genomes reveals numerous instances of apparent horizontal gene transfer (HGT), at least in prokaryotes, and indicates that lineage-specific gene loss might have been even more common in evolution. This complicates the notion of a species tree, which needs to be re-interpreted as a prevailing evolutionary trend, rather than the full depiction of evolution, and makes reconstruction of ancestral genomes a non-trivial task.
We addressed the problem of constructing parsimonious scenarios for individual sets of orthologous genes given a species tree. The orthologous sets were taken from the database of Clusters of Orthologous Groups of proteins (COGs). We show that the phyletic patterns (patterns of presence-absence in completely sequenced genomes) of almost 90% of the COGs are inconsistent with the hypothetical species tree. Algorithms were developed to reconcile the phyletic patterns with the species tree by postulating gene loss, COG emergence and HGT (the latter two classes of events were collectively treated as gene gains). We prove that each of these algorithms produces a parsimonious evolutionary scenario, which can be represented as mapping of loss and gain events on the species tree. The distribution of the evolutionary events among the tree nodes substantially depends on the underlying assumptions of the reconciliation algorithm, e.g. whether or not independent gene gains (gain after loss after gain) are permitted. Biological considerations suggest that, on average, gene loss might be a more likely event than gene gain. Therefore different gain penalties were used and the resulting series of reconstructed gene sets for the last universal common ancestor (LUCA) of the extant life forms were analysed. The number of genes in the reconstructed LUCA gene sets grows as the gain penalty increases. However, qualitative examination of the LUCA versions reconstructed with different gain penalties indicates that, even with a gain penalty of 1 (equal weights assigned to a gain and a loss), the set of 572 genes assigned to LUCA might be nearly sufficient to sustain a functioning organism. Under this gain penalty value, the numbers of horizontal gene transfer and gene loss events are nearly identical. This result holds true for two alternative topologies of the species tree and even under random shuffling of the tree. Therefore, the results seem to be compatible with approximately equal likelihoods of HGT and gene loss in the evolution of prokaryotes.
The notion that gene loss and HGT are major aspects of prokaryotic evolution was supported by quantitative analysis of the mapping of the phyletic patterns of COGs onto a hypothetical species tree. Algorithms were developed for constructing parsimonious evolutionary scenarios, which include gene loss and gain events, for orthologous gene sets, given a species tree. This analysis shows, contrary to expectations, that the number of predicted HGT events that occurred during the evolution of prokaryotes might be approximately the same as the number of gene losses. The approach to the reconstruction of evolutionary scenarios employed here is conservative with regard to the detection of HGT because only patterns of gene presence-absence in sequenced genomes are taken into account. In reality, horizontal transfer might have contributed to the evolution of many other genes also, which makes it a dominant force in prokaryotic evolution.
Protein-coding genes in eukaryotes are interrupted by introns, but intron densities widely differ between eukaryotic lineages. Vertebrates, some invertebrates and green plants have intron-rich genes, with 6–7 introns per kilobase of coding sequence, whereas most of the other eukaryotes have intron-poor genes. We reconstructed the history of intron gain and loss using a probabilistic Markov model (Markov Chain Monte Carlo, MCMC) on 245 orthologous genes from 99 genomes representing the three of the five supergroups of eukaryotes for which multiple genome sequences are available. Intron-rich ancestors are confidently reconstructed for each major group, with 53 to 74% of the human intron density inferred with 95% confidence for the Last Eukaryotic Common Ancestor (LECA). The results of the MCMC reconstruction are compared with the reconstructions obtained using Maximum Likelihood (ML) and Dollo parsimony methods. An excellent agreement between the MCMC and ML inferences is demonstrated whereas Dollo parsimony introduces a noticeable bias in the estimations, typically yielding lower ancestral intron densities than MCMC and ML. Evolution of eukaryotic genes was dominated by intron loss, with substantial gain only at the bases of several major branches including plants and animals. The highest intron density, 120 to 130% of the human value, is inferred for the last common ancestor of animals. The reconstruction shows that the entire line of descent from LECA to mammals was intron-rich, a state conducive to the evolution of alternative splicing.
In eukaryotes, protein-coding genes are interrupted by non-coding introns. The intron densities widely differ, from 6–7 introns per kilobase of coding sequence in vertebrates, some invertebrates and plants, to only a few introns across the entire genome in many unicellular forms. We applied a robust statistical methodology, Markov Chain Monte Carlo, to reconstruct the history of intron gain and loss throughout the evolution of eukaryotes using a set of 245 homologous genes from 99 genomes that represent the diversity of eukaryotes. Intron-rich ancestors were confidently inferred for each major eukaryotic group including 53% to 74% of the human intron density for the last eukaryotic common ancestor, and 120% to 130% of the human value for the last common ancestor of animals. Evolution of eukaryotic genes involved primarily intron loss, with substantial gain only at the bases of several major branches including plants and animals. Thus, the common ancestor of all extant eukaryotes was a complex organism with a gene architecture resembling those in multicellular organisms. The line of descent from the last common ancestor to mammals was an uninterrupted intron-rich state that, given the error-prone splicing in intron-rich organisms, was conducive to the elaboration of functional alternative splicing.
Character mapping on phylogenies has played an important, if not critical role, in our understanding of molecular, morphological, and behavioral evolution. Until very recently we have relied on parsimony to infer character changes. Parsimony has a number of serious limitations that are drawbacks to our understanding. Recent statistical methods have been developed that free us from these limitations enabling us to overcome the problems of parsimony by accommodating uncertainty in evolutionary time, ancestral states, and the phylogeny.
SIMMAP has been developed to implement stochastic character mapping that is useful to both molecular evolutionists, systematists, and bioinformaticians. Researchers can address questions about positive selection, patterns of amino acid substitution, character association, and patterns of morphological evolution.
Stochastic character mapping, as implemented in the SIMMAP software, enables users to address questions that require mapping characters onto phylogenies using a probabilistic approach that does not rely on parsimony. Analyses can be performed using a fully Bayesian approach that is not reliant on considering a single topology, set of substitution model parameters, or reconstruction of ancestral states. Uncertainty in these quantities is accommodated by using MCMC samples from their respective posterior distributions.
Why some species become successful invaders is an important issue in invasive biology. However, limited genomic resources make it very difficult for identifying candidate genes involved in invasiveness. Mikania micrantha H.B.K. (Asteraceae), one of the world's most invasive weeds, has adapted rapidly in response to novel environments since its introduction to southern China. In its genome, we expect to find outlier loci under selection for local adaptation, critical to dissecting the molecular mechanisms of invasiveness. An explorative amplified fragment length polymorphism (AFLP) genome scan was used to detect candidate loci under selection in 28 M. micrantha populations across its entire introduced range in southern China. We also estimated population genetic parameters, bottleneck signatures, and linkage disequilibrium. In binary characters, such as presence or absence of AFLP bands, if all four character combinations are present, it is referred to as a character incompatibility. Since character incompatibility is deemed to be rare in populations with extensive asexual reproduction, a character incompatibility analysis was also performed in order to infer the predominant mating system in the introduced M. micrantha populations. Out of 483 AFLP loci examined using stringent significance criteria, 14 highly credible outlier loci were identified by Dfdist and Bayescan. Moreover, remarkable genetic variation, multiple introductions, substantial bottlenecks and character compatibility were found to occur in M. micrantha. Thus local adaptation at the genome level indeed exists in M. micrantha, and may represent a major evolutionary mechanism of successful invasion. Interactions between genetic diversity, multiple introductions, and reproductive modes contribute to increase the capacity of adaptive evolution.
Evolutionary analysis of phyletic patterns (phylogenetic profiles) is widely used in biology, representing presence or absence of characters such as genes, restriction sites, introns, indels and methylation sites. The phyletic pattern observed in extant genomes is the result of ancestral gain and loss events along the phylogenetic tree. Here we present CoPAP (coevolution of presence–absence patterns), a user-friendly web server, which performs accurate inference of coevolving characters as manifested by co-occurring gains and losses. CoPAP uses state-of-the-art probabilistic methodologies to infer coevolution and allows for advanced network analysis and visualization. We developed a platform for comparing different algorithms that detect coevolution, which includes simulated data with pairs of coevolving sites and independent sites. Using these simulated data we demonstrate that CoPAP performance is higher than alternative methods. We exemplify CoPAP utility by analyzing coevolution among thousands of bacterial genes across 681 genomes. Clusters of coevolving genes that were detected using our method largely coincide with known biosynthesis pathways and cellular modules, thus exhibiting the capability of CoPAP to infer biologically meaningful interactions. CoPAP is freely available for use at http://copap.tau.ac.il/.
As one of the most widely used parsimony methods for ancestral reconstruction, the Fitch method minimizes the total number of hypothetical substitutions along all branches of a tree to explain the evolution of a character. Due to the extensive usage of this method, it has become a scientific endeavor in recent years to study the reconstruction accuracies of the Fitch method. However, most studies are restricted to 2-state evolutionary models and a study for higher-state models is needed since DNA sequences take the format of 4-state series and protein sequences even have 20 states.
In this paper, the ambiguous and unambiguous reconstruction accuracy of the Fitch method are studied for N-state evolutionary models. Given an arbitrary phylogenetic tree, a recurrence system is first presented to calculate iteratively the two accuracies. As complete binary tree and comb-shaped tree are the two extremal evolutionary tree topologies according to balance, we focus on the reconstruction accuracies on these two topologies and analyze their asymptotic properties. Then, 1000 Yule trees with 1024 leaves are generated and analyzed to simulate real evolutionary scenarios. It is known that more taxa not necessarily increase the reconstruction accuracies under 2-state models. The result under N-state models is also tested.
In a large tree with many leaves, the reconstruction accuracies of using all taxa are sometimes less than those of using a leaf subset under N-state models. For complete binary trees, there always exists an equilibrium interval [a, b] of conservation probability, in which the limiting ambiguous reconstruction accuracy equals to the probability of randomly picking a state. The value b decreases with the increase of the number of states, and it seems to converge. When the conservation probability is greater than b, the reconstruction accuracies of the Fitch method increase rapidly. The reconstruction accuracies on 1000 simulated Yule trees also exhibit similar behaviors. For comb-shaped trees, the limiting reconstruction accuracies of using all taxa are always less than or equal to those of using the nearest root-to-leaf path when the conservation probability is not less than 1N. As a result, more taxa are suggested for ancestral reconstruction when the tree topology is balanced and the sequences are highly similar, and a few taxa close to the root are recommended otherwise.
Large-scale sequencing of genomes has enabled the inference of phylogenies based on the evolution of genomic architecture, under such events as rearrangements, duplications, and losses. Many evolutionary models and associated algorithms have been designed over the last few years and have found use in comparative genomics and phylogenetic inference. However, the assessment of phylogenies built from such data has not been properly addressed to date. The standard method used in sequence-based phylogenetic inference is the bootstrap, but it relies on a large number of homologous characters that can be resampled; yet in the case of rearrangements, the entire genome is a single character. Alternatives such as the jackknife suffer from the same problem, while likelihood tests cannot be applied in the absence of well established probabilistic models.
We present a new approach to the assessment of distance-based phylogenetic inference from whole-genome data; our approach combines features of the jackknife and the bootstrap and remains nonparametric. For each feature of our method, we give an equivalent feature in the sequence-based framework; we also present the results of extensive experimental testing, in both sequence-based and genome-based frameworks. Through the feature-by-feature comparison and the experimental results, we show that our bootstrapping approach is on par with the classic phylogenetic bootstrap used in sequence-based reconstruction, and we establish the clear superiority of the classic bootstrap for sequence data and of our corresponding new approach for rearrangement data over proposed variants. Finally, we test our approach on a small dataset of mammalian genomes, verifying that the support values match current thinking about the respective branches.
Our method is the first to provide a standard of assessment to match that of the classic phylogenetic bootstrap for aligned sequences. Its support values follow a similar scale and its receiver-operating characteristics are nearly identical, indicating that it provides similar levels of sensitivity and specificity. Thus our assessment method makes it possible to conduct phylogenetic analyses on whole genomes with the same degree of confidence as for analyses on aligned sequences. Extensions to search-based inference methods such as maximum parsimony and maximum likelihood are possible, but remain to be thoroughly tested.
Bootstrap; Jackknife; Phylogenetic reconstruction; Rearrangement; Gene order; Comparative genomics
The presence of introns in protein-coding genes is a universal feature of eukaryotic genome organization, and the genes of multicellular eukaryotes, typically, contain multiple introns, a substantial fraction of which share position in distant taxa, such as plants and animals. Depending on the methods and data sets used, researchers have reached opposite conclusions on the causes of the high fraction of shared introns in orthologous genes from distant eukaryotes. Some studies conclude that shared intron positions reflect, almost entirely, a remarkable evolutionary conservation, whereas others attribute it to parallel gain of introns. To resolve these contradictions, it is crucial to analyze the evolution of introns by using a model that minimally relies on arbitrary assumptions.
We developed a probabilistic model of evolution that allows for variability of intron gain and loss rates over branches of the phylogenetic tree, individual genes, and individual sites. Applying this model to an extended set of conserved eukaryotic genes, we find that parallel gain, on average, accounts for only ~8% of the shared intron positions. However, the distribution of parallel gains over the phylogenetic tree of eukaryotes is highly non-uniform. There are, practically, no parallel gains in closely related lineages, whereas for distant lineages, such as animals and plants, parallel gains appear to contribute up to 20% of the shared intron positions. In accord with these findings, we estimated that ancestral introns have a high probability to be retained in extant genomes, and conversely, that a substantial fraction of extant introns have retained their positions since the early stages of eukaryotic evolution. In addition, the density of sites that are available for intron insertion is estimated to be, approximately, one in seven basepairs.
We obtained robust estimates of the contribution of parallel gain to the observed sharing of intron positions between eukaryotic species separated by different evolutionary distances. The results indicate that, although the contribution of parallel gains varies across the phylogenetic tree, the high level of intron position sharing is due, primarily, to evolutionary conservation. Accordingly, numerous introns appear to persist in the same position over hundreds of millions of years of evolution. This is compatible with recent observations of a negative correlation between the rate of intron gain and coding sequence evolution rate of a gene, suggesting that at least some of the introns are functionally relevant.
As predicted by theory, traits associated with reproduction often evolve at a comparatively high speed. This is especially the case for courtship behaviour which plays a central role in reproductive isolation. On the other hand, courtship behavioural traits often involve morphological and behavioural adaptations in both sexes; this suggests that their evolution might be under severe constraints, for instance irreversibility of character loss. Here, we use a recently proposed method to retrieve data on a peculiar courtship behavioural trait, i.e. antennal coiling, for 56 species of diplazontine parasitoid wasps. On the basis of a well-resolved phylogeny, we reconstruct the evolutionary history of antennal coiling and associated morphological modifications to study the mode of evolution of this complex character system.
Our study reveals a large variation in shape, location and ultra-structure of male-specific modifications on the antennae. As for antennal coiling, we find either single-coiling, double-coiling or the absence of coiling; each state is present in multiple genera. Using a model comparison approach, we show that the possession of antennal modifications is highly correlated with antennal coiling behaviour. Ancestral state reconstruction shows that both antennal modifications and antennal coiling are highly congruent with the molecular phylogeny, implying low levels of homoplasy and a comparatively low speed of evolution. Antennal coiling is lost on two independent occasions, and never reacquired. A zero rate of regaining antennal coiling is supported by maximum parsimony, maximum likelihood and Bayesian approaches.
Our study provides the first comparative evidence for a tight correlation between male-specific antennal modifications and the use of the antennae during courtship. Antennal coiling in Diplazontinae evolved at a comparatively low rate, and was never reacquired in any of the studied taxa. This suggests that the loss of antennal coiling is irreversible on the timescale examined here, and therefore that evolutionary constraints have greatly influenced the evolution of antennal courtship in this group of parasitoid wasps. Further studies are needed to ascertain whether the loss of antennal coiling is irreversible on larger timescales, and whether evolutionary constraints have influenced courtship behavioural traits in a similar way in other groups.
Evolution of exon-intron structure of eukaryotic genes has been a matter of long-standing, intensive debate. The introns-early concept, later rebranded ‘introns first’ held that protein-coding genes were interrupted by numerous introns even at the earliest stages of life's evolution and that introns played a major role in the origin of proteins by facilitating recombination of sequences coding for small protein/peptide modules. The introns-late concept held that introns emerged only in eukaryotes and new introns have been accumulating continuously throughout eukaryotic evolution. Analysis of orthologous genes from completely sequenced eukaryotic genomes revealed numerous shared intron positions in orthologous genes from animals and plants and even between animals, plants and protists, suggesting that many ancestral introns have persisted since the last eukaryotic common ancestor (LECA). Reconstructions of intron gain and loss using the growing collection of genomes of diverse eukaryotes and increasingly advanced probabilistic models convincingly show that the LECA and the ancestors of each eukaryotic supergroup had intron-rich genes, with intron densities comparable to those in the most intron-rich modern genomes such as those of vertebrates. The subsequent evolution in most lineages of eukaryotes involved primarily loss of introns, with only a few episodes of substantial intron gain that might have accompanied major evolutionary innovations such as the origin of metazoa. The original invasion of self-splicing Group II introns, presumably originating from the mitochondrial endosymbiont, into the genome of the emerging eukaryote might have been a key factor of eukaryogenesis that in particular triggered the origin of endomembranes and the nucleus. Conversely, splicing errors gave rise to alternative splicing, a major contribution to the biological complexity of multicellular eukaryotes. There is no indication that any prokaryote has ever possessed a spliceosome or introns in protein-coding genes, other than relatively rare mobile self-splicing introns. Thus, the introns-first scenario is not supported by any evidence but exon-intron structure of protein-coding genes appears to have evolved concomitantly with the eukaryotic cell, and introns were a major factor of evolution throughout the history of eukaryotes. This article was reviewed by I. King Jordan, Manuel Irimia (nominated by Anthony Poole), Tobias Mourier (nominated by Anthony Poole), and Fyodor Kondrashov. For the complete reports, see the Reviewers’ Reports section.
Intron sliding; Intron gain; Intron loss; Spliceosome; Splicing signals; Evolution of exon/intron structure; Alternative splicing; Phylogenetic trees; Mobile domains; Eukaryotic ancestor
The effect of alignment gaps on phylogenetic accuracy has been the subject of numerous studies. In this study, we investigated the relationship between the total number of gapped sites and phylogenetic accuracy, when the gaps were introduced (by means of computer simulation) to reflect indel (insertion/deletion) events during the evolution of DNA sequences. The resulting (true) alignments were subjected to commonly used gap treatment and phylogenetic inference methods.
(1) In general, there was a strong – almost deterministic – relationship between the amount of gap in the data and the level of phylogenetic accuracy when the alignments were very "gappy", (2) gaps resulting from deletions (as opposed to insertions) contributed more to the inaccuracy of phylogenetic inference, (3) the probabilistic methods (Bayesian, PhyML & "MLε, " a method implemented in DNAML in PHYLIP) performed better at most levels of gap percentage when compared to parsimony (MP) and distance (NJ) methods, with Bayesian analysis being clearly the best, (4) methods that treat gapped sites as missing data yielded less accurate trees when compared to those that attribute phylogenetic signal to the gapped sites (by coding them as binary character data – presence/absence, or as in the MLε method), and (5) in general, the accuracy of phylogenetic inference depended upon the amount of available data when the gaps resulted from mainly deletion events, and the amount of missing data when insertion events were equally likely to have caused the alignment gaps.
When gaps in an alignment are a consequence of indel events in the evolution of the sequences, the accuracy of phylogenetic analysis is likely to improve if: (1) alignment gaps are categorized as arising from insertion events or deletion events and then treated separately in the analysis, (2) the evolutionary signal provided by indels is harnessed in the phylogenetic analysis, and (3) methods that utilize the phylogenetic signal in indels are developed for distance methods too. When the true homology is known and the amount of gaps is 20 percent of the alignment length or less, the methods used in this study are likely to yield trees with 90–100 percent accuracy.
Parsimony methods are widely used in molecular evolution to estimate the most plausible phylogeny for a set of characters. Sankoff parsimony determines the minimum number of changes required in a given phylogeny when a cost is associated to transitions between character states. Although optimizations exist to reduce the computations in the number of taxa, the original algorithm takes time O(n2) in the number of states, making it impractical for large values of n.
In this study we introduce an optimization of Sankoff parsimony for the reconstruction of ancestral states when ultrametric or additive cost matrices are used. We analyzed its performance for randomly generated matrices, Jukes-Cantor and Kimura's two-parameter models of DNA evolution, and in the reconstruction of elongation factor-1α and ancestral metabolic states of a group of eukaryotes, showing that in all cases the execution time is significantly less than with the original implementation.
The algorithms here presented provide a fast computation of Sankoff parsimony for a given phylogeny. Problems where the number of states is large, such as reconstruction of ancestral metabolism, are particularly adequate for this optimization. Since we are reducing the computations required to calculate the parsimony cost of a single tree, our method can be combined with optimizations in the number of taxa that aim at finding the most parsimonious tree.
Ever since the discovery of 'genes in pieces' and mRNA splicing in eukaryotes, origin and evolution of spliceosomal introns have been considered within the conceptual framework of the 'introns early' versus 'introns late' debate. The 'introns early' hypothesis, which is closely linked to the so-called exon theory of gene evolution, posits that protein-coding genes were interrupted by numerous introns even at the earliest stages of life's evolution and that introns played a major role in the origin of proteins by facilitating recombination of sequences coding for small protein/peptide modules. Under this scenario, the absence of spliceosomal introns in prokaryotes is considered to be a result of "genome streamlining". The 'introns late' hypothesis counters that spliceosomal introns emerged only in eukaryotes, and moreover, have been inserted into protein-coding genes continuously throughout the evolution of eukaryotes. Beyond the formal dilemma, the more substantial side of this debate has to do with possible roles of introns in the evolution of eukaryotes.
I argue that several lines of evidence now suggest a coherent solution to the introns-early versus introns-late debate, and the emerging picture of intron evolution integrates aspects of both views although, formally, there seems to be no support for the original version of introns-early. Firstly, there is growing evidence that spliceosomal introns evolved from group II self-splicing introns which are present, usually, in small numbers, in many bacteria, and probably, moved into the evolving eukaryotic genome from the α-proteobacterial progenitor of the mitochondria. Secondly, the concept of a primordial pool of 'virus-like' genetic elements implies that self-splicing introns are among the most ancient genetic entities. Thirdly, reconstructions of the ancestral state of eukaryotic genes suggest that the last common ancestor of extant eukaryotes had an intron-rich genome. Thus, it appears that ancestors of spliceosomal introns, indeed, have existed since the earliest stages of life's evolution, in a formal agreement with the introns-early scenario. However, there is no evidence that these ancient introns ever became widespread before the emergence of eukaryotes, hence, the central tenet of introns-early, the role of introns in early evolution of proteins, has no support. However, the demonstration that numerous introns invaded eukaryotic genes at the outset of eukaryotic evolution and that subsequent intron gain has been limited in many eukaryotic lineages implicates introns as an ancestral feature of eukaryotic genomes and refutes radical versions of introns-late. Perhaps, most importantly, I argue that the intron invasion triggered other pivotal events of eukaryogenesis, including the emergence of the spliceosome, the nucleus, the linear chromosomes, the telomerase, and the ubiquitin signaling system. This concept of eukaryogenesis, in a sense, revives some tenets of the exon hypothesis, by assigning to introns crucial roles in eukaryotic evolutionary innovation.
The scenario of the origin and evolution of introns that is best compatible with the results of comparative genomics and theoretical considerations goes as follows: self-splicing introns since the earliest stages of life's evolution – numerous spliceosomal introns invading genes of the emerging eukaryote during eukaryogenesis – subsequent lineage-specific loss and gain of introns. The intron invasion, probably, spawned by the mitochondrial endosymbiont, might have critically contributed to the emergence of the principal features of the eukaryotic cell. This scenario combines aspects of the introns-early and introns-late views.
this article was reviewed by W. Ford Doolittle, James Darnell (nominated by W. Ford Doolittle), William Martin, and Anthony Poole.
Batesian mimics gain protection from predation through the evolution of physical similarities to a model species that possesses anti-predator defences. This protection should not be effective in the absence of the model since the predator does not identify the mimic as potentially dangerous and both the model and the mimic are highly conspicuous. Thus, Batesian mimics should probably encounter strong predation pressure outside the geographical range of the model species. There are several documented examples of Batesian mimics occurring in locations without their models, but the evolutionary responses remain largely unidentified. A mimetic species has four alternative evolutionary responses to the loss of model presence. If predation is weak, it could maintain its mimetic signal. If predation is intense, it is widely presumed the mimic will go extinct. However, the mimic could also evolve a new colour pattern to mimic another model species or it could revert back to its ancestral, less conspicuous phenotype. We used molecular phylogenetic approaches to reconstruct and test the evolution of mimicry in the North American admiral butterflies (Limenitis: Nymphalidae). We confirmed that the more cryptic white-banded form is the ancestral phenotype of North American admiral butterflies. However, one species, Limenitis arthemis, evolved the black pipevine swallowtail mimetic form but later reverted to the white-banded more cryptic ancestral form. This character reversion is strongly correlated with the geographical absence of the model species and its host plant, but not the host plant distribution of L. arthemis. Our results support the prediction that a Batesian mimic does not persist in locations without its model, but it does not go extinct either. The mimic can revert back to its ancestral, less conspicuous form and persist.
character evolution; Lepidoptera; parametric bootstrap; wing pattern evolution
Parmelioid lichens form a species-rich group of predominantly foliose and fruticose lichenized fungi encompassing a broad range of morphological and chemical diversity. Using a multilocus approach, we reconstructed a phylogeny including 323 OTUs of parmelioid lichens and employed ancestral character reconstruction methods to understand the phenotypical evolution within this speciose group of lichen-forming fungi. Specifically, we were interested in the evolution of growth form, epicortex structure, and cortical chemistry. Since previous studies have shown that results may differ depending on the reconstruction method used, here we employed both maximum-parsimony and maximum-likelihood approaches to reconstruct ancestral character states. We have also implemented binary and multistate coding of characters and performed parallel analyses with both coding types to assess for potential coding-based biases. We reconstructed the ancestral states for nine well-supported major clades in the parmelioid group, two higher-level sister groups and the ancestral character state for all parmelioid lichens. We found that different methods for coding phenotypical characters and different ancestral character state reconstruction methods mostly resulted in identical reconstructions but yield conflicting inferences of ancestral states, in some cases. However, we found support for the ancestor of parmelioid lichens having been a foliose lichen with a non-pored epicortex and pseudocyphellae. Our data suggest that some traits exhibit patterns of evolution consistent with adaptive radiation.
Background and Aims
The Neotropical tribe Trimezieae are taxonomically difficult. They are generally characterized by the absence of the features used to delimit their sister group Tigridieae. Delimiting the four genera that make up Trimezieae is also problematic. Previous family-level phylogenetic analyses have not examined the monophyly of the tribe or relationships within it. Reconstructing the phylogeny of Trimezieae will allow us to evaluate the status of the tribe and genera and to examine the suitability of characters traditionally used in their taxonomy.
Maximum parsimony and Bayesian phylogenetic analyses are presented for 37 species representing all four genera of Trimezieae. Analyses were based on nrITS sequences and a combined plastid dataset. Ancestral character state reconstructions were used to investigate the evolution of ten morphological characters previously considered taxonomically useful.
Analyses of nrITS and plastid datasets strongly support the monophyly of Trimezieae and recover four principal clades with varying levels of support; these clades do not correspond to the currently recognized genera. Relationships within the four clades are not consistently resolved, although the conflicting resolutions are not strongly supported in individual analyses. Ancestral character state reconstructions suggest considerable homoplasy, especially in the floral characters used to delimit Pseudotrimezia.
The results strongly support recognition of Trimezieae as a tribe but suggest that both generic- and species-level taxonomy need revision. Further molecular analyses, with increased sampling of taxa and markers, are needed to support any revision. Such analyses will help determine the causes of discordance between the plastid and nuclear data and provide a framework for identifying potential morphological synapomorphies for infra-tribal groups. The results also suggest Trimezieae provide a promising model for evolutionary research.
DNA sequences; Iridaceae; Iridoideae; morphology; Neomarica; Neotropics; phylogenetic analysis; Pseudiris; Pseudotrimezia; Trimezia; Trimezieae
Probabilistic models for sequence comparison (such as hidden Markov models and pair hidden Markov models for proteins and mRNAs, or their context-free grammar counterparts for structural RNAs) often assume a fixed degree of divergence. Ideally we would like these models to be conditional on evolutionary divergence time.
Probabilistic models of substitution events are well established, but there has not been a completely satisfactory theoretical framework for modeling insertion and deletion events.
I have developed a method for extending standard Markov substitution models to include gap characters, and another method for the evolution of state transition probabilities in a probabilistic model. These methods use instantaneous rate matrices in a way that is more general than those used for substitution processes, and are sufficient to provide time-dependent models for standard linear and affine gap penalties, respectively.
Given a probabilistic model, we can make all of its emission probabilities (including gap characters) and all its transition probabilities conditional on a chosen divergence time. To do this, we only need to know the parameters of the model at one particular divergence time instance, as well as the parameters of the model at the two extremes of zero and infinite divergence.
I have implemented these methods in a new generation of the RNA genefinder QRNA (eQRNA).
These methods can be applied to incorporate evolutionary models of insertions and deletions into any hidden Markov model or stochastic context-free grammar, in a pair or profile form, for sequence modeling.
Domains are modules within proteins that can fold and function independently and are evolutionarily conserved. Here we compared the usage and distribution of protein domain families in the free-living proteomes of Archaea, Bacteria and Eukarya and reconstructed species phylogenies while tracing the history of domain emergence and loss in proteomes. We show that both gains and losses of domains occurred frequently during proteome evolution. The rate of domain discovery increased approximately linearly in evolutionary time. Remarkably, gains generally outnumbered losses and the gain-to-loss ratios were much higher in akaryotes compared to eukaryotes. Functional annotations of domain families revealed that both Archaea and Bacteria gained and lost metabolic capabilities during the course of evolution while Eukarya acquired a number of diverse molecular functions including those involved in extracellular processes, immunological mechanisms, and cell regulation. Results also highlighted significant contemporary sharing of informational enzymes between Archaea and Eukarya and metabolic enzymes between Bacteria and Eukarya. Finally, the analysis provided useful insights into the evolution of species. The archaeal superkingdom appeared first in evolution by gradual loss of ancestral domains, bacterial lineages were the first to gain superkingdom-specific domains, and eukaryotes (likely) originated when an expanding proto-eukaryotic stem lineage gained organelles through endosymbiosis of already diversified bacterial lineages. The evolutionary dynamics of domain families in proteomes and the increasing number of domain gains is predicted to redefine the persistence strategies of organisms in superkingdoms, influence the make up of molecular functions, and enhance organismal complexity by the generation of new domain architectures. This dynamics highlights ongoing secondary evolutionary adaptations in akaryotic microbes, especially Archaea.
Proteins are made up of well-packed structural units referred to as domains. Domain structure in proteins is responsible for protein function and is evolutionarily conserved. Here we report global patterns of protein domain gain and loss in the three superkingdoms of life. We reconstructed phylogenetic trees using domain fold families as phylogenetic characters and retraced the history of character changes along the many branches of the tree of life. Results revealed that both domain gains and losses were frequent events in the evolution of cells. However, domain gains generally overshadowed the number of losses. This trend was consistent in the three superkingdoms. However, the rate of domain discovery was highest in akaryotic microbes. Domain gains occurred throughout the evolutionary timeline albeit at a non-uniform rate. Our study sheds light into the evolutionary history of living organisms and highlights important ongoing mechanisms that are responsible for secondary evolutionary adaptations in the three superkingdoms of life.
Reliable inference of ancestral sequences can be critical to identifying both patterns and causes of molecular evolution. Robustness of ancestral inference is often assumed among closely related species, but tests of this assumption have been limited. Here, we examine the performance of inference methods for data simulated under scenarios of codon bias evolution within the Drosophila melanogaster subgroup. Genome sequence data for multiple, closely related species within this subgroup make it an important system for studying molecular evolutionary genetics. The effects of asymmetric and lineage-specific substitution rates (i.e., varying levels of codon usage bias and departures from equilibrium) on the reliability of ancestral codon usage was investigated. Maximum parsimony inference, which has been widely employed in analyses of Drosophila codon bias evolution, was compared to an approach that attempts to account for uncertainty in ancestral inference by weighting ancestral reconstructions by their posterior probabilities. The latter approach employs maximum likelihood estimation of rate and base composition parameters. For equilibrium and most non-equilibrium scenarios that were investigated, the probabilistic method appears to generate reliable ancestral codon bias inferences for molecular evolutionary studies within the D. melanogaster subgroup. These reconstructions are more reliable than parsimony inference, especially when codon usage is strongly skewed. However, inference biases are considerable for both methods under particular departures from stationarity (i.e., when adaptive evolution is prevalent). Reliability of inference can be sensitive to branch lengths, asymmetry in substitution rates, and the locations and nature of lineage-specific processes within a gene tree. Inference reliability, even among closely related species, can be strongly affected by (potentially unknown) patterns of molecular evolution in lineages ancestral to those of interest.
Spliceosomal introns are one of the principal distinctive features of eukaryotes. Nevertheless, different large-scale studies disagree about even the most basic features of their evolution. In order to come up with a more reliable reconstruction of intron evolution, we developed a model that is far more comprehensive than previous ones. This model is rich in parameters, and estimating them accurately is infeasible by straightforward likelihood maximization. Thus, we have developed an expectation-maximization algorithm that allows for efficient maximization. Here, we outline the model and describe the expectation-maximization algorithm in detail. Since the method works with intron presence–absence maps, it is expected to be instrumental for the analysis of the evolution of other binary characters as well.
Maximum likelihood; expectation-maximization; intron evolution; ancestral reconstruction; eukaryotic gene structure
Macrogenomic events, in which genes are gained and lost, play a pivotal evolutionary role in microbial evolution. Nevertheless, probabilistic-evolutionary models describing such events and methods for their robust inference are considerably less developed than existing methodologies for analyzing site-specific sequence evolution. Here, we present a novel method for the inference of gains and losses of gene families. First, we develop probabilistic-evolutionary models describing the dynamics of gene-family content, which are more biologically realistic than previously suggested models. In our likelihood-based models, gains and losses are represented by transitions between presence and absence, given an underlying phylogeny. We employ a mixture-model approach in which we allow both the gain rate and the loss rate to vary among gene families. Second, we use these models together with the analytic implementation of stochastic mapping to infer branch-specific events. Our novel methodology allows us to infer and quantify horizontal gene transfer (HGT) events. This enables us to rank various gene families and lineages according to their propensity to undergo gains and losses. Applying our methodology to 4,873 gene families shows that: 1) the novel mixture models describe the observed variability in gene-family content among microbes significantly better than previous models; 2) The stochastic mapping approach enables accurate inference of gain and loss events based on simulations; 3) At least 34% of the gene families analyzed are inferred to have experienced HGT at least once during their evolution; and 4) Gene families that were inferred to experience HGT are both enriched and depleted with respect to specific functional categories.
phyletic pattern; probabilistic-evolutionary models; mixture models; genome evolution; horizontal gene transfer; gene-family content
The Australian scincid clade Lerista provides perhaps the best available model for studying limb reduction in squamates (lizards and snakes), comprising more than 75 species displaying a remarkable variety of digit configurations, from pentadactyl to entirely limbless conditions. We investigated the pattern and rate of limb reduction and loss in Lerista, employing a comprehensive phylogeny inferred from nucleotide sequences for a nuclear intron and six mitochondrial genes.
The inferred phylogeny reveals extraordinary evolutionary mutability of limb morphology in Lerista. Ancestral state reconstructions indicate at least ten independent reductions in the number of digits from a pentadactyl condition, with a further seven reductions proceeding independently from a tetradactyl condition derived from one of these reductions. Four independent losses of all digits are inferred, three from pentadactyl or tetradactyl conditions. These conclusions are not substantially affected by uncertainty in assumed rates of character state transition or the phylogeny. An estimated age of 13.4 million years for Lerista entails that limb reduction has occurred not only repeatedly, but also very rapidly. At the highest rate, complete loss of digits from a pentadactyl condition is estimated to have occurred within 3.6 million years.
The exceptionally high frequency and rate of limb reduction inferred for Lerista emphasise the potential for rapid and substantial alteration of body form in squamates. An absence of compelling evidence for reversals of digit loss contrasts with a recent proposal that digits have been regained in some species of the gymnophthalmid clade Bachia, possibly reflecting an influence of differing environmental and genetic contexts on the evolution of limb morphology in these clades. Future study of the genetic, developmental, and ecological bases of limb reduction and loss in Lerista promises the elucidation of not only this phenomenon in squamates, but also the dramatic evolutionary transformations of body form that have produced the extraordinary diversity of multicellular organisms.
The loss of phenotypic characters is a common feature of evolution. Cave organisms provide excellent models for investigating the underlying patterns and processes governing the evolutionary loss of phenotypic traits. The blind Mexican cavefish, Astyanax mexicanus, represents a particularly strong model for both developmental and genetic analyses as these fish can be raised in the laboratory and hybridized with conspecific surface form counterparts to produce large F2 pedigrees. As studies have begun to illuminate the genetic bases for trait evolution in these cavefish, it has become increasingly important to understand these phenotypic changes within the context of cavefish origins. Understanding these origins is a challenge. For instance, widespread convergence on similar features renders morphological characters less informative. In addition, current and past gene flow between surface and cave forms have complicated the delineation of particular cave populations.
Past population-level analyses have sought to: 1) estimate at what time in the geological past cave forms became isolated from surface-dwelling ancestors, 2) define the extent to which cave form populations originated from a common invasion (single origin hypothesis) or several invasions (multiple origin hypothesis), and 3) clarify the role of geological and climatic events in Astyanax cavefish evolution. In recent years, thanks to the combined use of morphological and genetic data, a much clearer picture has emerged regarding the origins of Astyanax cavefish.
The consensus view, based on several recent studies, is that cave forms originated from at least two distinct ancestral surface-dwelling stocks over the past several million years. In addition, each stock gave rise to multiple invasions of the subterranean biotope. The older stock is believed to have invaded the El Abra caves at least three times while the new stock separately invaded the northern Guatemala and western Micos caves. This renewed picture of Astyanax cavefish origins will help investigators draw conclusions regarding the evolution of phenotypic traits through parallelism versus convergence. Additionally, it will help us understand how the presence of cave-associated traits in old versus young cave populations may be influenced by the time since isolation in the cave environment. This will, in turn, help to inform our broader understanding of the forces that govern the evolution of phenotypic loss.
Regressive phenotypic evolution; Cave biology; Genetics