Spliceosomal introns are one of the principal distinctive features of eukaryotes. Nevertheless, different large-scale studies disagree about even the most basic features of their evolution. In order to come up with a more reliable reconstruction of intron evolution, we developed a model that is far more comprehensive than previous ones. This model is rich in parameters, and estimating them accurately is infeasible by straightforward likelihood maximization. Thus, we have developed an expectation-maximization algorithm that allows for efficient maximization. Here, we outline the model and describe the expectation-maximization algorithm in detail. Since the method works with intron presence–absence maps, it is expected to be instrumental for the analysis of the evolution of other binary characters as well.
Maximum likelihood; expectation-maximization; intron evolution; ancestral reconstruction; eukaryotic gene structure
Parsimony methods are widely used in molecular evolution to estimate the most plausible phylogeny for a set of characters. Sankoff parsimony determines the minimum number of changes required in a given phylogeny when a cost is associated to transitions between character states. Although optimizations exist to reduce the computations in the number of taxa, the original algorithm takes time O(n2) in the number of states, making it impractical for large values of n.
In this study we introduce an optimization of Sankoff parsimony for the reconstruction of ancestral states when ultrametric or additive cost matrices are used. We analyzed its performance for randomly generated matrices, Jukes-Cantor and Kimura's two-parameter models of DNA evolution, and in the reconstruction of elongation factor-1α and ancestral metabolic states of a group of eukaryotes, showing that in all cases the execution time is significantly less than with the original implementation.
The algorithms here presented provide a fast computation of Sankoff parsimony for a given phylogeny. Problems where the number of states is large, such as reconstruction of ancestral metabolism, are particularly adequate for this optimization. Since we are reducing the computations required to calculate the parsimony cost of a single tree, our method can be combined with optimizations in the number of taxa that aim at finding the most parsimonious tree.
As one of the most widely used parsimony methods for ancestral reconstruction, the Fitch method minimizes the total number of hypothetical substitutions along all branches of a tree to explain the evolution of a character. Due to the extensive usage of this method, it has become a scientific endeavor in recent years to study the reconstruction accuracies of the Fitch method. However, most studies are restricted to 2-state evolutionary models and a study for higher-state models is needed since DNA sequences take the format of 4-state series and protein sequences even have 20 states.
In this paper, the ambiguous and unambiguous reconstruction accuracy of the Fitch method are studied for N-state evolutionary models. Given an arbitrary phylogenetic tree, a recurrence system is first presented to calculate iteratively the two accuracies. As complete binary tree and comb-shaped tree are the two extremal evolutionary tree topologies according to balance, we focus on the reconstruction accuracies on these two topologies and analyze their asymptotic properties. Then, 1000 Yule trees with 1024 leaves are generated and analyzed to simulate real evolutionary scenarios. It is known that more taxa not necessarily increase the reconstruction accuracies under 2-state models. The result under N-state models is also tested.
In a large tree with many leaves, the reconstruction accuracies of using all taxa are sometimes less than those of using a leaf subset under N-state models. For complete binary trees, there always exists an equilibrium interval [a, b] of conservation probability, in which the limiting ambiguous reconstruction accuracy equals to the probability of randomly picking a state. The value b decreases with the increase of the number of states, and it seems to converge. When the conservation probability is greater than b, the reconstruction accuracies of the Fitch method increase rapidly. The reconstruction accuracies on 1000 simulated Yule trees also exhibit similar behaviors. For comb-shaped trees, the limiting reconstruction accuracies of using all taxa are always less than or equal to those of using the nearest root-to-leaf path when the conservation probability is not less than 1N. As a result, more taxa are suggested for ancestral reconstruction when the tree topology is balanced and the sequences are highly similar, and a few taxa close to the root are recommended otherwise.
Macrogenomic events, in which genes are gained and lost, play a pivotal evolutionary role in microbial evolution. Nevertheless, probabilistic-evolutionary models describing such events and methods for their robust inference are considerably less developed than existing methodologies for analyzing site-specific sequence evolution. Here, we present a novel method for the inference of gains and losses of gene families. First, we develop probabilistic-evolutionary models describing the dynamics of gene-family content, which are more biologically realistic than previously suggested models. In our likelihood-based models, gains and losses are represented by transitions between presence and absence, given an underlying phylogeny. We employ a mixture-model approach in which we allow both the gain rate and the loss rate to vary among gene families. Second, we use these models together with the analytic implementation of stochastic mapping to infer branch-specific events. Our novel methodology allows us to infer and quantify horizontal gene transfer (HGT) events. This enables us to rank various gene families and lineages according to their propensity to undergo gains and losses. Applying our methodology to 4,873 gene families shows that: 1) the novel mixture models describe the observed variability in gene-family content among microbes significantly better than previous models; 2) The stochastic mapping approach enables accurate inference of gain and loss events based on simulations; 3) At least 34% of the gene families analyzed are inferred to have experienced HGT at least once during their evolution; and 4) Gene families that were inferred to experience HGT are both enriched and depleted with respect to specific functional categories.
phyletic pattern; probabilistic-evolutionary models; mixture models; genome evolution; horizontal gene transfer; gene-family content
Summary: Malin is a software package for the analysis of eukaryotic gene structure evolution. It provides a graphical user interface for various tasks commonly used to infer the evolution of exon–intron structure in protein-coding orthologs. Implemented tasks include the identification of conserved homologous intron sites in protein alignments, as well as the estimation of ancestral intron content, lineage-specific intron losses and gains. Estimates are computed either with parsimony, or with a probabilistic model that incorporates rate variation across lineages and intron sites.
Availability: Malin is available as a stand-alone Java application, as well as an application bundle for MacOS X, at the website http://www.iro.umontreal.ca/~csuros/introns/malin/. The software is distributed under a BSD-style license.
Large-scale sequencing of genomes has enabled the inference of phylogenies based on the evolution of genomic architecture, under such events as rearrangements, duplications, and losses. Many evolutionary models and associated algorithms have been designed over the last few years and have found use in comparative genomics and phylogenetic inference. However, the assessment of phylogenies built from such data has not been properly addressed to date. The standard method used in sequence-based phylogenetic inference is the bootstrap, but it relies on a large number of homologous characters that can be resampled; yet in the case of rearrangements, the entire genome is a single character. Alternatives such as the jackknife suffer from the same problem, while likelihood tests cannot be applied in the absence of well established probabilistic models.
We present a new approach to the assessment of distance-based phylogenetic inference from whole-genome data; our approach combines features of the jackknife and the bootstrap and remains nonparametric. For each feature of our method, we give an equivalent feature in the sequence-based framework; we also present the results of extensive experimental testing, in both sequence-based and genome-based frameworks. Through the feature-by-feature comparison and the experimental results, we show that our bootstrapping approach is on par with the classic phylogenetic bootstrap used in sequence-based reconstruction, and we establish the clear superiority of the classic bootstrap for sequence data and of our corresponding new approach for rearrangement data over proposed variants. Finally, we test our approach on a small dataset of mammalian genomes, verifying that the support values match current thinking about the respective branches.
Our method is the first to provide a standard of assessment to match that of the classic phylogenetic bootstrap for aligned sequences. Its support values follow a similar scale and its receiver-operating characteristics are nearly identical, indicating that it provides similar levels of sensitivity and specificity. Thus our assessment method makes it possible to conduct phylogenetic analyses on whole genomes with the same degree of confidence as for analyses on aligned sequences. Extensions to search-based inference methods such as maximum parsimony and maximum likelihood are possible, but remain to be thoroughly tested.
Bootstrap; Jackknife; Phylogenetic reconstruction; Rearrangement; Gene order; Comparative genomics
The mechanisms and evolutionary dynamics of intron insertion and loss in eukaryotic genes remain poorly understood. Reconstruction of parsimonious scenarios of gene structure evolution in paralogous gene families in animals and plants revealed numerous gains and losses of introns. In all analyzed lineages, the number of acquired new introns was substantially greater than the number of lost ancestral introns. This trend held even for lineages in which vertical evolution of genes involved more intron losses than gains, suggesting that gene duplication boosts intron insertion. However, dating gene duplications and the associated intron gains and losses based on the molecular clock assumption showed that very few, if any, introns were gained during the last ∼100 million years of animal and plant evolution, in agreement with previous conclusions reached through analysis of orthologous gene sets. These results are generally compatible with the emerging notion of intensive insertion and loss of introns during transitional epochs in contrast to the relative quiet of the intervening evolutionary spans.
Motivation: Gene duplication (D), transfer (T), loss (L) and incomplete lineage sorting (I) are crucial to the evolution of gene families and the emergence of novel functions. The history of these events can be inferred via comparison of gene and species trees, a process called reconciliation, yet current reconciliation algorithms model only a subset of these evolutionary processes.
Results: We present an algorithm to reconcile a binary gene tree with a nonbinary species tree under a DTLI parsimony criterion. This is the first reconciliation algorithm to capture all four evolutionary processes driving tree incongruence and the first to reconcile non-binary species trees with a transfer model. Our algorithm infers all optimal solutions and reports complete, temporally feasible event histories, giving the gene and species lineages in which each event occurred. It is fixed-parameter tractable, with polytime complexity when the maximum species outdegree is fixed. Application of our algorithms to prokaryotic and eukaryotic data show that use of an incomplete event model has substantial impact on the events inferred and resulting biological conclusions.
Availability: Our algorithms have been implemented in Notung, a freely available phylogenetic reconciliation software package, available at http://www.cs.cmu.edu/~durand/Notung.
The evolution of multicellular organisms involved the evolution of specialized cell types performing distinct functions; and specialized cell types presumably arose from more generalized ancestral cell types as a result of mutational event, such as gene duplication and changes in gene expression. We used characters based on gene expression data to reconstruct evolutionary relationships among 11 types of lymphocytes by the maximum parsimony method. The resulting phylogenetic tree showed expected patterns including separation of the lymphoid and myeloid lineages; clustering together of granulocyte types; and pairing of phenotypically similar cell types such as T-helper cells type 1 and T-helper cells type 2 (Th1 and Th2). We used phylogenetic analyses of sequence data to determine the time of origin of genes showing significant expression difference between Th1 and Th2 cells. Many such genes, particularly those involved in the regulation of gene expression or activation of proteins, were of ancient origin, having arisen by gene duplication before the most recent common ancestor (MRCA) of tetrapods and teleosts. However, certain other genes with significant expression difference between Th1 and Th2 arose after the tetrapod--teleost MRCA, and some of the latter were specific to eutherian (placental) mammals. This evolutionary pattern is consistent with previous evidence that, while bony fishes possess Th1 and Th2 cells, the latter differ phenotypically in important respects from the corresponding cells of mammals. Our results support a gradualistic model of the evolution of distinctive cellular phenotypes whereby the unique characteristics of a given cell type arise as a result of numerous independent mutational changes over hundreds of millions of years.
Probabilistic evolutionary models revolutionized our capability to extract biological insights from sequence data. While these models accurately describe the stochastic processes of site-specific substitutions, single-base substitutions represent only a fraction of all the events that shape genomes. Specifically, in microbes, events in which entire genes are gained (e.g. via horizontal gene transfer) and lost play a pivotal evolutionary role. In this research, we present a novel likelihood-based evolutionary model for gene gains and losses, and use it to analyse genome-wide patterns of the presence and absence of gene families. The model assumes a Markovian stochastic process, where gains and losses are represented by the transition between presence and absence, respectively, given an underlying phylogenetic tree. To account for differences in the rates of gain and loss of different gene families, we assume among-gene family rate variability, thus allowing for more accurate description of the data. Using the Bayesian approach, we estimated an evolutionary rate for each gene family. Simulation studies demonstrated that our methodology accurately infers these rates. Our methodology was applied to analyse a large corpus of data, consisting of 4873 gene families spanning 63 species and revealed novel insights regarding the evolutionary nature of genome-wide gain and loss dynamics.
phyletic pattern; probabilistic evolutionary models; genome evolution; gene gain and loss; horizontal gene transfer; gene content
Palaeobiogeographic reconstructions are underpinned by phylogenies, divergence times and ancestral area reconstructions, which together yield ancestral area chronograms that provide a basis for proposing and testing hypotheses of dispersal and vicariance. Methods for area coding include multi-state coding with a single character, binary coding with multiple characters and string coding. Ancestral reconstruction methods are divided into parsimony versus Bayesian/likelihood approaches. We compared nine methods for reconstructing ancestral areas for placental mammals. Ambiguous reconstructions were a problem for all methods. Important differences resulted from coding areas based on the geographical ranges of extant species versus the geographical provenance of the oldest fossil for each lineage. Africa and South America were reconstructed as the ancestral areas for Afrotheria and Xenarthra, respectively. Most methods reconstructed Eurasia as the ancestral area for Boreoeutheria, Euarchontoglires and Laurasiatheria. The coincidence of molecular dates for the separation of Afrotheria and Xenarthra at approximately 100 Ma with the plate tectonic sundering of Africa and South America hints at the importance of vicariance in the early history of Placentalia. Dispersal has also been important including the origins of Madagascar's endemic mammal fauna. Further studies will benefit from increased taxon sampling and the application of new ancestral area reconstruction methods.
ancestral areas; dispersal; historical biogeography; Mammalia; vicariance
Bacterial evolution is characterized by frequent gain and loss events of gene families. These events can be inferred from phyletic pattern data—a compact representation of gene family repertoire across multiple genomes. The maximum parsimony paradigm is a classical and prevalent approach for the detection of gene family gains and losses mapped on specific branches. We and others have previously developed probabilistic models that aim to account for the gain and loss stochastic dynamics. These models are a critical component of a methodology termed stochastic mapping, in which probabilities and expectations of gain and loss events are estimated for each branch of an underlying phylogenetic tree. In this work, we present a phyletic pattern simulator in which the gain and loss dynamics are assumed to follow a continuous-time Markov chain along the tree. Various models and options are implemented to make the simulation software useful for a large number of studies in which binary (presence/absence) data are analyzed. Using this simulation software, we compared the ability of the maximum parsimony and the stochastic mapping approaches to accurately detect gain and loss events along the tree. Our simulations cover a large array of evolutionary scenarios in terms of the propensities for gene family gains and losses and the variability of these propensities among gene families. Although in all simulation schemes, both methods obtain relatively low levels of false positive rates, stochastic mapping outperforms maximum parsimony in terms of true positive rates. We further studied the factors that influence the performance of both methods. We find, for example, that the accuracy of maximum parsimony inference is substantially reduced when the goal is to map gain and loss events along internal branches of the phylogenetic tree. Furthermore, the accuracy of stochastic mapping is reduced with smaller data sets (limited number of gene families) due to unreliable estimation of branch lengths. Our simulator and simulation results are additionally relevant for the analysis of other types of binary-coded data, such as the existence of homologues restriction sites, gaps, and introns, to name a few. Both the simulation software and the inference methodology are freely available at a user-friendly server: http://gloome.tau.ac.il/.
phyletic pattern; stochastic mapping; maximum parsimony; evolutionary models
Character mapping on phylogenies has played an important, if not critical role, in our understanding of molecular, morphological, and behavioral evolution. Until very recently we have relied on parsimony to infer character changes. Parsimony has a number of serious limitations that are drawbacks to our understanding. Recent statistical methods have been developed that free us from these limitations enabling us to overcome the problems of parsimony by accommodating uncertainty in evolutionary time, ancestral states, and the phylogeny.
SIMMAP has been developed to implement stochastic character mapping that is useful to both molecular evolutionists, systematists, and bioinformaticians. Researchers can address questions about positive selection, patterns of amino acid substitution, character association, and patterns of morphological evolution.
Stochastic character mapping, as implemented in the SIMMAP software, enables users to address questions that require mapping characters onto phylogenies using a probabilistic approach that does not rely on parsimony. Analyses can be performed using a fully Bayesian approach that is not reliant on considering a single topology, set of substitution model parameters, or reconstruction of ancestral states. Uncertainty in these quantities is accommodated by using MCMC samples from their respective posterior distributions.
The presence of introns in protein-coding genes is a universal feature of eukaryotic genome organization, and the genes of multicellular eukaryotes, typically, contain multiple introns, a substantial fraction of which share position in distant taxa, such as plants and animals. Depending on the methods and data sets used, researchers have reached opposite conclusions on the causes of the high fraction of shared introns in orthologous genes from distant eukaryotes. Some studies conclude that shared intron positions reflect, almost entirely, a remarkable evolutionary conservation, whereas others attribute it to parallel gain of introns. To resolve these contradictions, it is crucial to analyze the evolution of introns by using a model that minimally relies on arbitrary assumptions.
We developed a probabilistic model of evolution that allows for variability of intron gain and loss rates over branches of the phylogenetic tree, individual genes, and individual sites. Applying this model to an extended set of conserved eukaryotic genes, we find that parallel gain, on average, accounts for only ~8% of the shared intron positions. However, the distribution of parallel gains over the phylogenetic tree of eukaryotes is highly non-uniform. There are, practically, no parallel gains in closely related lineages, whereas for distant lineages, such as animals and plants, parallel gains appear to contribute up to 20% of the shared intron positions. In accord with these findings, we estimated that ancestral introns have a high probability to be retained in extant genomes, and conversely, that a substantial fraction of extant introns have retained their positions since the early stages of eukaryotic evolution. In addition, the density of sites that are available for intron insertion is estimated to be, approximately, one in seven basepairs.
We obtained robust estimates of the contribution of parallel gain to the observed sharing of intron positions between eukaryotic species separated by different evolutionary distances. The results indicate that, although the contribution of parallel gains varies across the phylogenetic tree, the high level of intron position sharing is due, primarily, to evolutionary conservation. Accordingly, numerous introns appear to persist in the same position over hundreds of millions of years of evolution. This is compatible with recent observations of a negative correlation between the rate of intron gain and coding sequence evolution rate of a gene, suggesting that at least some of the introns are functionally relevant.
Phylogenetic approaches to inferring ancestral character states are becoming increasingly sophisticated; however, the potential remains for available methods to yield strongly supported but inaccurate ancestral state estimates. The consistency of ancestral states inferred for two or more characters affords a useful criterion for evaluating ancestral trait reconstructions. Ancestral state estimates for multiple characters that entail plausible phenotypes when considered together may reasonably be assumed to be reliable. However, the accuracy of inferred ancestral states for one or more characters may be questionable where combined reconstructions imply implausible phenotypes for a proportion of internal nodes. This criterion for assessing reconstructed ancestral states is applied here in evaluating inferences of ancestral limb morphology in the scincid lizard clade Lerista. Ancestral numbers of digits for the manus and pes inferred assuming the models that best fit the data entail ancestral digit configurations for many nodes that differ fundamentally from configurations observed among known species. However, when an alternative model is assumed for the pes, inferred ancestral digit configurations are invariably represented among observed phenotypes. This indicates that a suboptimal model for the pes (and not the model providing the best fit to the data) yields accurate ancestral state estimates.
ancestral state; Bayesian inference; Lerista; limb reduction; Squamata
Visualising the evolutionary history of a set of sequences is a challenge for molecular phylogenetics. One approach is to use undirected graphs, such as median networks, to visualise phylogenies where reticulate relationships such as recombination or homoplasy are displayed as cycles. Median networks contain binary representations of sequences as nodes, with edges connecting those sequences differing at one character; hypothetical ancestral nodes are invoked to generate a connected network which contains all most parsimonious trees. Quasi-median networks are a generalisation of median networks which are not restricted to binary data, although phylogenetic information contained within the multistate positions can be lost during the preprocessing of data. Where the history of a set of samples contain frequent homoplasies or recombination events quasi-median networks will have a complex topology. Graph reduction or pruning methods have been used to reduce network complexity but some of these methods are inapplicable to datasets in which recombination has occurred and others are procedurally complex and/or result in disconnected networks.
We address the problems inherent in construction and reduction of quasi-median networks. We describe a novel method of generating quasi-median networks that uses all characters, both binary and multistate, without imposing an arbitrary ordering of the multistate partitions. We also describe a pruning mechanism which maintains at least one shortest path between observed sequences, displaying the underlying relations between all pairs of sequences while maintaining a connected graph.
Application of this approach to 5S rDNA sequence data from sea beet produced a pruned network within which genetic isolation between populations by distance was evident, demonstrating the value of this approach for exploration of evolutionary relationships.
Ancestral sequence reconstruction is essential to a variety of evolutionary studies. Here, we present the FastML web server, a user-friendly tool for the reconstruction of ancestral sequences. FastML implements various novel features that differentiate it from existing tools: (i) FastML uses an indel-coding method, in which each gap, possibly spanning multiples sites, is coded as binary data. FastML then reconstructs ancestral indel states assuming a continuous time Markov process. FastML provides the most likely ancestral sequences, integrating both indels and characters; (ii) FastML accounts for uncertainty in ancestral states: it provides not only the posterior probabilities for each character and indel at each sequence position, but also a sample of ancestral sequences from this posterior distribution, and a list of the k-most likely ancestral sequences; (iii) FastML implements a large array of evolutionary models, which makes it generic and applicable for nucleotide, protein and codon sequences; and (iv) a graphical representation of the results is provided, including, for example, a graphical logo of the inferred ancestral sequences. The utility of FastML is demonstrated by reconstructing ancestral sequences of the Env protein from various HIV-1 subtypes. FastML is freely available for all academic users and is available online at http://fastml.tau.ac.il/.
Probabilistic models for sequence comparison (such as hidden Markov models and pair hidden Markov models for proteins and mRNAs, or their context-free grammar counterparts for structural RNAs) often assume a fixed degree of divergence. Ideally we would like these models to be conditional on evolutionary divergence time.
Probabilistic models of substitution events are well established, but there has not been a completely satisfactory theoretical framework for modeling insertion and deletion events.
I have developed a method for extending standard Markov substitution models to include gap characters, and another method for the evolution of state transition probabilities in a probabilistic model. These methods use instantaneous rate matrices in a way that is more general than those used for substitution processes, and are sufficient to provide time-dependent models for standard linear and affine gap penalties, respectively.
Given a probabilistic model, we can make all of its emission probabilities (including gap characters) and all its transition probabilities conditional on a chosen divergence time. To do this, we only need to know the parameters of the model at one particular divergence time instance, as well as the parameters of the model at the two extremes of zero and infinite divergence.
I have implemented these methods in a new generation of the RNA genefinder QRNA (eQRNA).
These methods can be applied to incorporate evolutionary models of insertions and deletions into any hidden Markov model or stochastic context-free grammar, in a pair or profile form, for sequence modeling.
Costly structures need to represent an adaptive advantage in order to be maintained over evolutionary times. Contrary to many other conspicuous shell ornamentations of gastropods, the haired shells of several Stylommatophoran land snails still lack a convincing adaptive explanation. In the present study, we analysed the correlation between the presence/absence of hairs and habitat conditions in the genus Trochulus in a Bayesian framework of character evolution.
Haired shells appeared to be the ancestral character state, a feature most probably lost three times independently. These losses were correlated with a shift from humid to dry habitats, indicating an adaptive function of hairs in moist environments. It had been previously hypothesised that these costly protein structures of the outer shell layer facilitate the locomotion in moist habitats. Our experiments, on the contrary, showed an increased adherence of haired shells to wet surfaces.
We propose the hypothesis that the possession of hairs facilitates the adherence of the snails to their herbaceous food plants during foraging when humidity levels are high. The absence of hairs in some Trochulus species could thus be explained as a loss of the potential adaptive function linked to habitat shifts.
Visual patterns in animals may serve different functions, such as attracting mates and deceiving predators. If a signal is used for multiple functions, the opportunity arises for conflict among the different functions, preventing optimization for any one visual signal. Here we investigate the hypothesis that spatial separation of different visual signal functions has occurred in Bicyclus butterflies. Using phylogenetic reconstructions of character evolution and comparisons of evolutionary rates, we found dorsal surface characters to evolve at higher rates than ventral characters. Dorsal characters also displayed sex-based differences in evolutionary rates more often than did ventral characters. Thus, dorsal characters corresponded to our predictions of mate signalling while ventral characters appear to play an important role in predator avoidance. Forewing characters also fit a model of mate signalling, and displayed higher rates of evolution than hindwing characters. Our results, as well as the behavioural and developmental data from previous studies of Bicyclus species, support the hypothesis that spatial separation of visual signal functions has occurred in Bicyclus butterflies. This study is the first to demonstrate, in a phylogenetic framework, that spatial separation of signals used for mate signalling and those used for predator avoidance is a viable strategy to accommodate multiple signal functions. This signalling strategy has important ramifications on the developmental evolution of wing pattern elements and diversification of butterfly species.
eyespot; modularity; Nymphalidae; likelihood; Bicyclus; wing patterns
Phylogenetic networks are models of evolution that go beyond trees, incorporating non-tree-like biological events such as recombination (or more generally reticulation), which occur either in a single species (meiotic recombination) or between species (reticulation due to lateral gene transfer and hybrid speciation). The central algorithmic problems are to reconstruct a plausible history of mutations and non-tree-like events, or to determine the minimum number of such events needed to derive a given set of binary sequences, allowing one mutation per site. Meiotic recombination, reticulation and recurrent mutation can cause conflict or incompatibility between pairs of sites (or characters) of the input. Previously, we used “conflict graphs” and “incompatibility graphs” to compute lower bounds on the minimum number of recombination nodes needed, and to efficiently solve constrained cases of the minimization problem. Those results exposed the structural and algorithmic importance of the non-trivial connected components of those two graphs.
In this paper, we more fully develop the structural importance of non-trivial connected components of the incompatibility and conflict graphs, proving a general decomposition theorem (first presented in Gusfield and Bansal 2005) for phylogenetic networks. The decomposition theorem depends only on the incompatibilities in the input sequences, and hence applies to phylogenetic networks of all types, and to any phenomena that causes pairwise incompatibilities. More generally, the proof of the decomposition theorem exposes a maximal embedded tree structure that exists in the network when the sequences cannot be derived on a perfect phylogenetic tree. This extends the theory of perfect phylogeny in a natural and important way. The proof is constructive and leads to a polynomial-time algorithm to find the unique underlying maximal tree structure. We next examine and fully solve the major open question from Gusfield and Bansal (2005): Is it true that for every input there must be a fully decomposed phylogenetic network that minimizes the number of recombination nodes used, over all phylogenetic networks for the input. We previously conjectured that the answer is yes. In this paper we show that the answer in is no, both for the case that only single-crossover recombination is allowed, and also for the case that unbounded multiple-crossover recombination is allowed. The latter case also resolves a conjecture recently stated in Huson and Klopper (2007) in the context of general reticulation networks. Although the conjecture from Gusfield and Bansal (2005) is disproved in general, we show that the answer to the conjecture is yes in several natural special cases, and establish necessary combinatorial structure that counterexamples to the conjecture must posses. We also show that counterexamples to the conjecture are rare (for the case of single-crossover recombination) in simulated data.
Molecular Evolution; Phylogenetic Networks; Perfect Phylogeny; Ancestral Recombination Graph; Recombination; Gene-Conversion; SNP
Little is known about the patterns of intron gain and loss or the relative contributions of these two processes to gene evolution. To investigate the dynamics of intron evolution, we analyzed orthologous genes from four filamentous fungal genomes and determined the pattern of intron conservation. We developed a probabilistic model to estimate the most likely rates of intron gain and loss giving rise to these observed conservation patterns. Our data reveal the surprising importance of intron gain. Between about 150 and 250 gains and between 150 and 350 losses were inferred in each lineage. We discuss one gene in particular (encoding 1-phosphoribosyl-5-pyrophosphate synthetase) that displays an unusually high rate of intron gain in multiple lineages. It has been recognized that introns are biased towards the 5′ ends of genes in intron-poor genomes but are evenly distributed in intron-rich genomes. Current models attribute this bias to 3′ intron loss through a poly-adenosine-primed reverse transcription mechanism. Contrary to standard models, we find no increased frequency of intron loss toward the 3′ ends of genes. Thus, recent intron dynamics do not support a model whereby 5′ intron positional bias is generated solely by 3′-biased intron loss.
A comparative study of four fungal genomes reveals the patterns of intron gain and loss over several hundred million years of evolution
Comparative analysis of sequenced genomes reveals numerous instances of apparent horizontal gene transfer (HGT), at least in prokaryotes, and indicates that lineage-specific gene loss might have been even more common in evolution. This complicates the notion of a species tree, which needs to be re-interpreted as a prevailing evolutionary trend, rather than the full depiction of evolution, and makes reconstruction of ancestral genomes a non-trivial task.
We addressed the problem of constructing parsimonious scenarios for individual sets of orthologous genes given a species tree. The orthologous sets were taken from the database of Clusters of Orthologous Groups of proteins (COGs). We show that the phyletic patterns (patterns of presence-absence in completely sequenced genomes) of almost 90% of the COGs are inconsistent with the hypothetical species tree. Algorithms were developed to reconcile the phyletic patterns with the species tree by postulating gene loss, COG emergence and HGT (the latter two classes of events were collectively treated as gene gains). We prove that each of these algorithms produces a parsimonious evolutionary scenario, which can be represented as mapping of loss and gain events on the species tree. The distribution of the evolutionary events among the tree nodes substantially depends on the underlying assumptions of the reconciliation algorithm, e.g. whether or not independent gene gains (gain after loss after gain) are permitted. Biological considerations suggest that, on average, gene loss might be a more likely event than gene gain. Therefore different gain penalties were used and the resulting series of reconstructed gene sets for the last universal common ancestor (LUCA) of the extant life forms were analysed. The number of genes in the reconstructed LUCA gene sets grows as the gain penalty increases. However, qualitative examination of the LUCA versions reconstructed with different gain penalties indicates that, even with a gain penalty of 1 (equal weights assigned to a gain and a loss), the set of 572 genes assigned to LUCA might be nearly sufficient to sustain a functioning organism. Under this gain penalty value, the numbers of horizontal gene transfer and gene loss events are nearly identical. This result holds true for two alternative topologies of the species tree and even under random shuffling of the tree. Therefore, the results seem to be compatible with approximately equal likelihoods of HGT and gene loss in the evolution of prokaryotes.
The notion that gene loss and HGT are major aspects of prokaryotic evolution was supported by quantitative analysis of the mapping of the phyletic patterns of COGs onto a hypothetical species tree. Algorithms were developed for constructing parsimonious evolutionary scenarios, which include gene loss and gain events, for orthologous gene sets, given a species tree. This analysis shows, contrary to expectations, that the number of predicted HGT events that occurred during the evolution of prokaryotes might be approximately the same as the number of gene losses. The approach to the reconstruction of evolutionary scenarios employed here is conservative with regard to the detection of HGT because only patterns of gene presence-absence in sequenced genomes are taken into account. In reality, horizontal transfer might have contributed to the evolution of many other genes also, which makes it a dominant force in prokaryotic evolution.
Since Darwin's Origin of Species, reconstructing the Tree of Life has been a goal of evolutionists, and tree-thinking has become a major concept of evolutionary biology. Practically, building the Tree of Life has proven to be tedious. Too few morphological characters are useful for conducting conclusive phylogenetic analyses at the highest taxonomic level. Consequently, molecular sequences (genes, proteins, and genomes) likely constitute the only useful characters for constructing a phylogeny of all life. For this reason, tree-makers expect a lot from gene comparisons. The simultaneous study of the largest number of molecular markers possible is sometimes considered to be one of the best solutions in reconstructing the genealogy of organisms. This conclusion is a direct consequence of tree-thinking: if gene inheritance conforms to a tree-like model of evolution, sampling more of these molecules will provide enough phylogenetic signal to build the Tree of Life. The selection of congruent markers is thus a fundamental step in simultaneous analysis of many genes.
Heat map analyses were used to investigate the congruence of orthologues in four datasets (archaeal, bacterial, eukaryotic and alpha-proteobacterial). We conclude that we simply cannot determine if a large portion of the genes have a common history. In addition, none of these datasets can be considered free of lateral gene transfer.
Our phylogenetic analyses do not support tree-thinking. These results have important conceptual and practical implications. We argue that representations other than a tree should be investigated in this case because a non-critical concatenation of markers could be highly misleading.
Motivation: Correlated events of gains and losses enable inference of co-evolution relations. The reconstruction of the co-evolutionary interactions network in prokaryotic species may elucidate functional associations among genes.
Results: We developed a novel probabilistic methodology for the detection of co-evolutionary interactions between pairs of genes. Using this method we inferred the co-evolutionary network among 4593 Clusters of Orthologous Genes (COGs). The number of co-evolutionary interactions substantially differed among COGs. Over 40% were found to co-evolve with at least one partner. We partitioned the network of co-evolutionary relations into clusters and uncovered multiple modular assemblies of genes with clearly defined functions. Finally, we measured the extent to which co-evolutionary relations coincide with other cellular relations such as genomic proximity, gene fusion propensity, co-expression, protein–protein interactions and metabolic connections. Our results show that co-evolutionary relations only partially overlap with these other types of networks. Our results suggest that the inferred co-evolutionary network in prokaryotes is highly informative towards revealing functional relations among genes, often showing signals that cannot be extracted from other network types.
Availability and implementation: Available under GPL license as open source.
Supplementary data are available at Bioinformatics online.