Molecular phylogenetics and phylogenomics are subject to noise from horizontal gene transfer (HGT) and bias from convergence in macromolecular compositions. Extensive variation in size, structure and base composition of alphaproteobacterial genomes has complicated their phylogenomics, sparking controversy over the origins and closest relatives of the SAR11 strains. SAR11 are highly abundant, cosmopolitan aquatic Alphaproteobacteria with streamlined, A+T-biased genomes. A dominant view holds that SAR11 are monophyletic and related to both Rickettsiales and the ancestor of mitochondria. Other studies dispute this, finding evidence of a polyphyletic origin of SAR11 with most strains distantly related to Rickettsiales. Although careful evolutionary modeling can reduce bias and noise in phylogenomic inference, entirely different approaches may be useful to extract robust phylogenetic signals from genomes. Here we develop simple phyloclassifiers from bioinformatically derived tRNA Class-Informative Features (CIFs), features predicted to target tRNAs for specific interactions within the tRNA interaction network. Our tRNA CIF-based model robustly and accurately classifies alphaproteobacterial genomes into one of seven undisputed monophyletic orders or families, despite great variability in tRNA gene complement sizes and base compositions. Our model robustly rejects monophyly of SAR11, classifying all but one strain as Rhizobiales with strong statistical support. Yet remarkably, conventional phylogenetic analysis of tRNAs classifies all SAR11 strains identically as Rickettsiales. We attribute this discrepancy to convergence of SAR11 and Rickettsiales tRNA base compositions. Thus, tRNA CIFs appear more robust to compositional convergence than tRNA sequences generally. Our results suggest that tRNA-CIF-based phyloclassification is robust to HGT of components of the tRNA interaction network, such as aminoacyl-tRNA synthetases. We explain why tRNAs are especially advantageous for prediction of traits governing macromolecular interactions from genomic data, and why such traits may be advantageous in the search for robust signals to address difficult problems in classification and phylogeny.
If gene products work well in the networks of foreign cells, their genes may transfer horizontally between unrelated genomes. What factors dictate the ability to integrate into foreign networks? Different RNAs and proteins must interact specifically in order to function well as a system. For example, tRNA functions are determined by the interactions they have with other macromolecules. We have developed ways to predict, from genomic data alone, how tRNAs distinguish themselves to their specific interaction partners. Here, as proof of concept, we built a robust computational model from these bioinformatic predictions in seven lineages of Alphaproteobacteria. We validated our model by classifying hundreds of diverse alphaproteobacterial taxa and tested it on eight strains of SAR11, a phylogenetically controversial group that is highly abundant in the world's oceans. We found that different strains of SAR11 are more distantly related, both to each other and to mitochondria, than widely believed. We explain conflicting results about SAR11 as an artifact of bias created by the variability in base contents of alphaproteobacterial genomes. While this bias affects tRNAs too, our classifier appears unexpectedly robust to it. More broadly, our results suggest that traits governing macromolecular interactions may be more faithfully vertically inherited than the macromolecules themselves.
Recently, various evolution-related journals adopted policies to encourage or require archiving of phylogenetic trees and associated data. Such attention to practices that promote sharing of data reflects rapidly improving information technology, and rapidly expanding potential to use this technology to aggregate and link data from previously published research. Nevertheless, little is known about current practices, or best practices, for publishing trees and associated data so as to promote re-use.
Here we summarize results of an ongoing analysis of current practices for archiving phylogenetic trees and associated data, current practices of re-use, and current barriers to re-use. We find that the technical infrastructure is available to support rudimentary archiving, but the frequency of archiving is low. Currently, most phylogenetic knowledge is not easily re-used due to a lack of archiving, lack of awareness of best practices, and lack of community-wide standards for formatting data, naming entities, and annotating data. Most attempts at data re-use seem to end in disappointment. Nevertheless, we find many positive examples of data re-use, particularly those that involve customized species trees generated by grafting to, and pruning from, a much larger tree.
The technologies and practices that facilitate data re-use can catalyze synthetic and integrative research. However, success will require engagement from various stakeholders including individual scientists who produce or consume shareable data, publishers, policy-makers, technology developers and resource-providers. The critical challenges for facilitating re-use of phylogenetic trees and associated data, we suggest, include: a broader commitment to public archiving; more extensive use of globally meaningful identifiers; development of user-friendly technology for annotating, submitting, searching, and retrieving data and their metadata; and development of a minimum reporting standard (MIAPA) indicating which kinds of data and metadata are most important for a re-useable phylogenetic record.
Evolution; Phylogeny; Data sharing; Bioinformatics; Phyloinformatics; Standards
In the past decade, transcriptome data have become an important component of many phylogenetic studies. They are a cost-effective source of protein-coding gene sequences, and have helped projects grow from a few genes to hundreds or thousands of genes. Phylogenetic studies now regularly include genes from newly sequenced transcriptomes, as well as publicly available transcriptomes and genomes. Implementing such a phylogenomic study, however, is computationally intensive, requires the coordinated use of many complex software tools, and includes multiple steps for which no published tools exist. Phylogenomic studies have therefore been manual or semiautomated. In addition to taking considerable user time, this makes phylogenomic analyses difficult to reproduce, compare, and extend. In addition, methodological improvements made in the context of one study often cannot be easily applied and evaluated in the context of other studies.
We present Agalma, an automated tool that constructs matrices for phylogenomic analyses. The user provides raw Illumina transcriptome data, and Agalma produces annotated assemblies, aligned gene sequence matrices, a preliminary phylogeny, and detailed diagnostics that allow the investigator to make extensive assessments of intermediate analysis steps and the final results. Sequences from other sources, such as externally assembled genomes and transcriptomes, can also be incorporated in the analyses. Agalma is built on the BioLite bioinformatics framework, which tracks provenance, profiles processor and memory use, records diagnostics, manages metadata, installs dependencies, logs version numbers and calls to external programs, and enables rich HTML reports for all stages of the analysis. Agalma includes a small test data set and a built-in test analysis of these data. In addition to describing Agalma, we here present a sample analysis of a larger seven-taxon data set. Agalma is available for download at https://bitbucket.org/caseywdunn/agalma.
Agalma allows complex phylogenomic analyses to be implemented and described unambiguously as a series of high-level commands. This will enable phylogenomic studies to be readily reproduced, modified, and extended. Agalma also facilitates methods development by providing a complete modular workflow, bundled with test data, that will allow further optimization of each step in the context of a full phylogenomic analysis.
Transcriptomes; Assembly; Phylogenetics; Homology; Workflow; Pipeline
Molecular sequence data have become the standard in modern day phylogenetics. In particular, several long-standing questions of mammalian evolutionary history have been recently resolved thanks to the use of molecular characters. Yet, most studies have focused on only a handful of standard markers. The availability of an ever increasing number of whole genome sequences is a golden mine for modern systematics. Genomic data now provide the opportunity to select new markers that are potentially relevant for further resolving branches of the mammalian phylogenetic tree at various taxonomic levels.
The EnsEMBL database was used to determine a set of orthologous genes from 12 available complete mammalian genomes. As targets for possible amplification and sequencing in additional taxa, more than 3,000 exons of length > 400 bp have been selected, among which 118, 368, 608, and 674 are respectively retrieved for 12, 11, 10, and 9 species. A bioinformatic pipeline has been developed to provide evolutionary descriptors for these candidate markers in order to assess their potential phylogenetic utility. The resulting OrthoMaM (Orthologous Mammalian Markers) database can be queried and alignments can be downloaded through a dedicated web interface .
The importance of marker choice in phylogenetic studies has long been stressed. Our database centered on complete genome information now makes possible to select promising markers to a given phylogenetic question or a systematic framework by querying a number of evolutionary descriptors. The usefulness of the database is illustrated with two biological examples. First, two potentially useful markers were identified for rodent systematics based on relevant evolutionary parameters and sequenced in additional species. Second, a complete, gapless 94 kb supermatrix of 118 orthologous exons was assembled for 12 mammals. Phylogenetic analyses using probabilistic methods unambiguously supported the new placental phylogeny by retrieving the monophyly of Glires, Euarchontoglires, Laurasiatheria, and Boreoeutheria. Muroid rodents thus do not represent a basal placental lineage as it was mistakenly reasserted in some recent phylogenomic analyses based on fewer taxa. We expect the OrthoMaM database to be useful for further resolving the phylogenetic tree of placental mammals and for better understanding the evolutionary dynamics of their genomes, i.e., the forces that shaped coding sequences in terms of selective constraints.
The constant accumulation of sequence data poses new computational and methodological challenges for phylogenetic inference, since multiple sequence alignments grow both in the horizontal (number of base pairs, phylogenomic alignments) as well as vertical (number of taxa) dimension. Put aside the ongoing controversial discussion about appropriate models, partitioning schemes, and assembly methods for phylogenomic alignments, coupled with the high computational cost to infer these, for many organismic groups, a sufficient number of taxa is often exclusively available from one or just a few genes (e.g., rbcL, matK, rDNA). In this paper we address scalability of Maximum-Likelihood-based phylogeny reconstruction with respect to the number of taxa by example of several large nested single-gene rbcL alignments comprising 400 up to 3,491 taxa. In order to test the effect of taxon sampling, we employ an appropriately adapted taxon jackknifing approach. In contrast to standard jackknifing, this taxon subsampling procedure is not conducted entirely at random, but based on drawing subsamples from empirical taxon-groups which can either be user-defined or determined by using taxonomic information from databases. Our results indicate that, despite an unfavorable number of sequences to number of base pairs ratio, i.e., many relatively short sequences, Maximum Likelihood tree searches and bootstrap analyses scale well on single-gene rbcL alignments with a dense taxon sampling up to several thousand sequences. Moreover, the newly implemented taxon subsampling procedure can be beneficial for inferring higher level relationships and interpreting bootstrap support from comprehensive analysis.
RAxML; phylogenetic inference; many taxon analyses; taxon jackknifing
A novel result of the current research is the development and implementation of a unique functional phylogenomic approach that explores the genomic origins of seed plant diversification. We first use 22,833 sets of orthologs from the nuclear genomes of 101 genera across land plants to reconstruct their phylogenetic relationships. One of the more salient results is the resolution of some enigmatic relationships in seed plant phylogeny, such as the placement of Gnetales as sister to the rest of the gymnosperms. In using this novel phylogenomic approach, we were also able to identify overrepresented functional gene ontology categories in genes that provide positive branch support for major nodes prompting new hypotheses for genes associated with the diversification of angiosperms. For example, RNA interference (RNAi) has played a significant role in the divergence of monocots from other angiosperms, which has experimental support in Arabidopsis and rice. This analysis also implied that the second largest subunit of RNA polymerase IV and V (NRPD2) played a prominent role in the divergence of gymnosperms. This hypothesis is supported by the lack of 24nt siRNA in conifers, the maternal control of small RNA in the seeds of flowering plants, and the emergence of double fertilization in angiosperms. Our approach takes advantage of genomic data to define orthologs, reconstruct relationships, and narrow down candidate genes involved in plant evolution within a phylogenomic view of species' diversification.
Understanding the genetic and genomic basis of plant diversification has been a major goal of evolutionary biologists since Darwin first pondered his “abominable mystery,” the rapid diversification of the angiosperms in the fossil record. We develop and deploy a functional phylogenomic approach that helps identify genes and biological processes putatively involved in species diversification. We assembled a matrix of 22,833 orthologs from 150 species to reconstruct seed plant phylogenetic relationships and to identify gene sets with a unique evolutionary signal. Our analysis of overrepresented biological processes in these sets narrowed down possible genetic mechanisms underlying plant adaptation and diversification. The phylogenetic relationships we uncovered support the hypothesis that gnetophytes are closely related to the rest of the gymnosperms at the base of the living seed plants. We also found that genes involved in post-transcriptional silencing via RNA interference (RNAi)—increasingly important in understanding plant evolution—are significantly represented early in angiosperm and gymnosperm divergence, with an apparent loss of specific classes of small interfering RNAs (siRNA) in gymnosperms. Our functional phylogenomic approach can be applied to any taxa with available sequences to enhance our knowledge of the evolutionary processes underlying biodiversity in general.
We present a statistical graphical model to infer specific molecular function for unannotated protein sequences using homology. Based on phylogenomic principles, SIFTER (Statistical Inference of Function Through Evolutionary Relationships) accurately predicts molecular function for members of a protein family given a reconciled phylogeny and available function annotations, even when the data are sparse or noisy. Our method produced specific and consistent molecular function predictions across 100 Pfam families in comparison to the Gene Ontology annotation database, BLAST, GOtcha, and Orthostrapper. We performed a more detailed exploration of functional predictions on the adenosine-5′-monophosphate/adenosine deaminase family and the lactate/malate dehydrogenase family, in the former case comparing the predictions against a gold standard set of published functional characterizations. Given function annotations for 3% of the proteins in the deaminase family, SIFTER achieves 96% accuracy in predicting molecular function for experimentally characterized proteins as reported in the literature. The accuracy of SIFTER on this dataset is a significant improvement over other currently available methods such as BLAST (75%), GeneQuiz (64%), GOtcha (89%), and Orthostrapper (11%). We also experimentally characterized the adenosine deaminase from Plasmodium falciparum, confirming SIFTER's prediction. The results illustrate the predictive power of exploiting a statistical model of function evolution in phylogenomic problems. A software implementation of SIFTER is available from the authors.
New genome sequences continue to be published at a prodigious rate. However, unannotated sequences are of limited use to biologists. To computationally annotate a hypothetical protein for molecular function, researchers generally attempt to carry out some form of information transfer from evolutionarily related proteins. Such transfer is most successfully achieved within the context of phylogenetic relationships, exploiting the comprehensive knowledge that is available regarding molecular evolution within a given protein family. This general approach to molecular function annotation is known as phylogenomics, and it is the best method currently available for providing high-quality annotations. A drawback of phylogenomics, however, is that it is a time-consuming manual process requiring expert knowledge. In the current paper, the authors have developed a statistical approach—referred to as SIFTER (Statistical Inference of Function Through Evolutionary Relationships)—that allows phylogenomic analyses to be carried out automatically.
The authors present the results of running SIFTER on a collection of 100 protein families. They also validate their method on a specific family for which a gold standard set of experimental annotations is available. They show that SIFTER annotates 96% of the gold standard proteins correctly, outperforming popular annotation methods including BLAST-based annotation (75%), GOtcha (89%), GeneQuiz (64%), and Orthostrapper (11%). The results support the feasibility of carrying out high-quality phylogenomic analyses of entire genomes.
Genome level analyses have enhanced our view of phylogenetics in many areas of the tree of life. With the production of whole genome DNA sequences of hundreds of organisms and large-scale EST databases a large number of candidate genes for inclusion into phylogenetic analysis have become available. In this work, we exploit the burgeoning genomic data being generated for plant genomes to address one of the more important plant phylogenetic questions concerning the hierarchical relationships of the several major seed plant lineages (angiosperms, Cycadales, Gingkoales, Gnetales, and Coniferales), which continues to be a work in progress, despite numerous studies using single, few or several genes and morphology datasets. Although most recent studies support the notion that gymnosperms and angiosperms are monophyletic and sister groups, they differ on the topological arrangements within each major group.
We exploited the EST database to construct a supermatrix of DNA sequences (over 1,200 concatenated orthologous gene partitions for 17 taxa) to examine non-flowering seed plant relationships. This analysis employed programs that offer rapid and robust orthology determination of novel, short sequences from plant ESTs based on reference seed plant genomes. Our phylogenetic analysis retrieved an unbiased (with respect to gene choice), well-resolved and highly supported phylogenetic hypothesis that was robust to various outgroup combinations.
We evaluated character support and the relative contribution of numerous variables (e.g. gene number, missing data, partitioning schemes, taxon sampling and outgroup choice) on tree topology, stability and support metrics. Our results indicate that while missing characters and order of addition of genes to an analysis do not influence branch support, inadequate taxon sampling and limited choice of outgroup(s) can lead to spurious inference of phylogeny when dealing with phylogenomic scale data sets. As expected, support and resolution increases significantly as more informative characters are added, until reaching a threshold, beyond which support metrics stabilize, and the effect of adding conflicting characters is minimized.
Camellia is an economically and phylogenetically important genus in the family Theaceae. Owing to numerous hybridization and polyploidization, it is taxonomically and phylogenetically ranked as one of the most challengingly difficult taxa in plants. Sequence comparisons of chloroplast (cp) genomes are of great interest to provide a robust evidence for taxonomic studies, species identification and understanding mechanisms that underlie the evolution of the Camellia species.
The eight complete cp genomes and five draft cp genome sequences of Camellia species were determined using Illumina sequencing technology via a combined strategy of de novo and reference-guided assembly. The Camellia cp genomes exhibited typical circular structure that was rather conserved in genomic structure and the synteny of gene order. Differences of repeat sequences, simple sequence repeats, indels and substitutions were further examined among five complete cp genomes, representing a wide phylogenetic diversity in the genus. A total of fifteen molecular markers were identified with more than 1.5% sequence divergence that may be useful for further phylogenetic analysis and species identification of Camellia. Our results showed that, rather than functional constrains, it is the regional constraints that strongly affect sequence evolution of the cp genomes. In a substantial improvement over prior studies, evolutionary relationships of the section Thea were determined on basis of phylogenomic analyses of cp genome sequences.
Despite a high degree of conservation between the Camellia cp genomes, sequence variation among species could still be detected, representing a wide phylogenetic diversity in the genus. Furthermore, phylogenomic analysis was conducted using 18 complete cp genomes and 5 draft cp genome sequences of Camellia species. Our results support Chang’s taxonomical treatment that C. pubicosta may be classified into sect. Thea, and indicate that taxonomical value of the number of ovaries should be reconsidered when classifying the Camellia species. The availability of these cp genomes provides valuable genetic information for accurately identifying species, clarifying taxonomy and reconstructing the phylogeny of the genus Camellia.
Camellia; Chloroplast genome; Phylogenetic relationships; Genomic structure; Taxonomic identification
Polyamine oxidase enzymes catalyze the oxidation of polyamines and acetylpolyamines. Since polyamines are basic regulators of cell growth and proliferation, their homeostasis is crucial for cell life. Members of the polyamine oxidase gene family have been identified in a wide variety of animals, including vertebrates, arthropodes, nematodes, placozoa, as well as in plants and fungi. Polyamine oxidases (PAOs) from yeast can oxidize spermine, N1-acetylspermine, and N1-acetylspermidine, however, in vertebrates two different enzymes, namely spermine oxidase (SMO) and acetylpolyamine oxidase (APAO), specifically catalyze the oxidation of spermine, and N1-acetylspermine/N1-acetylspermidine, respectively. Little is known about the molecular evolutionary history of these enzymes. However, since the yeast PAO is able to catalyze the oxidation of both acetylated and non acetylated polyamines, and in vertebrates these functions are addressed by two specialized polyamine oxidase subfamilies (APAO and SMO), it can be hypothesized an ancestral reference for the former enzyme from which the latter would have been derived.
We analysed 36 SMO, 26 APAO, and 14 PAO homologue protein sequences from 54 taxa including various vertebrates and invertebrates. The analysis of the full-length sequences and the principal domains of vertebrate and invertebrate PAOs yielded consensus primary protein sequences for vertebrate SMOs and APAOs, and invertebrate PAOs. This analysis, coupled to molecular modeling techniques, also unveiled sequence regions that confer specific structural and functional properties, including substrate specificity, by the different PAO subfamilies. Molecular phylogenetic trees revealed a basal position of all the invertebrates PAO enzymes relative to vertebrate SMOs and APAOs. PAOs from insects constitute a monophyletic clade. Two PAO variants sampled in the amphioxus are basal to the dichotomy between two well supported monophyletic clades including, respectively, all the SMOs and APAOs from vertebrates. The two vertebrate monophyletic clades clustered strictly mirroring the organismal phylogeny of fishes, amphibians, reptiles, birds, and mammals. Evidences from comparative genomic analysis, structural evolution and functional divergence in a phylogenetic framework across Metazoa suggested an evolutionary scenario where the ancestor PAO coding sequence, present in invertebrates as an orthologous gene, has been duplicated in the vertebrate branch to originate the paralogous SMO and APAO genes. A further genome evolution event concerns the SMO gene of placental, but not marsupial and monotremate, mammals which increased its functional variation following an alternative splicing (AS) mechanism.
In this study the explicit integration in a phylogenomic framework of phylogenetic tree construction, structure prediction, and biochemical function data/prediction, allowed inferring the molecular evolutionary history of the PAO gene family and to disambiguate paralogous genes related by duplication event (SMO and APAO) and orthologous genes related by speciation events (PAOs, SMOs/APAOs). Further, while in vertebrates experimental data corroborate SMO and APAO molecular function predictions, in invertebrates the finding of a supported phylogenetic clusters of insect PAOs and the co-occurrence of two PAO variants in the amphioxus urgently claim the need for future structure-function studies.
All characters and trait systems in an organism share a common evolutionary history that can be estimated using phylogenetic methods. However, differential rates of change and the evolutionary mechanisms driving those rates result in pervasive phylogenetic conflict. These drivers need to be uncovered because mismatches between evolutionary processes and phylogenetic models can lead to high confidence in incorrect hypotheses. Incongruence between phylogenies derived from morphological versus molecular analyses, and between trees based on different subsets of molecular sequences has become pervasive as datasets have expanded rapidly in both characters and species. For more than a decade, evolutionary relationships among members of the New World bat family Phyllostomidae inferred from morphological and molecular data have been in conflict. Here, we develop and apply methods to minimize systematic biases, uncover the biological mechanisms underlying phylogenetic conflict, and outline data requirements for future phylogenomic and morphological data collection. We introduce new morphological data for phyllostomids and outgroups and expand previous molecular analyses to eliminate methodological sources of phylogenetic conflict such as taxonomic sampling, sparse character sampling, or use of different algorithms to estimate the phylogeny. We also evaluate the impact of biological sources of conflict: saturation in morphological changes and molecular substitutions, and other processes that result in incongruent trees, including convergent morphological and molecular evolution. Methodological sources of incongruence play some role in generating phylogenetic conflict, and are relatively easy to eliminate by matching taxa, collecting more characters, and applying the same algorithms to optimize phylogeny. The evolutionary patterns uncovered are consistent with multiple biological sources of conflict, including saturation in morphological and molecular changes, adaptive morphological convergence among nectar-feeding lineages, and incongruent gene trees. Applying methods to account for nucleotide sequence saturation reduces, but does not completely eliminate, phylogenetic conflict. We ruled out paralogy, lateral gene transfer, and poor taxon sampling and outgroup choices among the processes leading to incongruent gene trees in phyllostomid bats. Uncovering and countering the possible effects of introgression and lineage sorting of ancestral polymorphism on gene trees will require great leaps in genomic and allelic sequencing in this species-rich mammalian family. We also found evidence for adaptive molecular evolution leading to convergence in mitochondrial proteins among nectar-feeding lineages. In conclusion, the biological processes that generate phylogenetic conflict are ubiquitous, and overcoming incongruence requires better models and more data than have been collected even in well-studied organisms such as phyllostomid bats.
adaptive convergence; incongruence; gene trees; partitioned likelihood support; phylogeny; Phyllostomidae; saturation; species trees
Molecular phylogenetics and phylogenomics have greatly revised and enriched the fungal systematics in the last two decades. Most of the analyses have been performed by comparing single or multiple orthologous gene regions. Sequence alignment has always been an essential element in tree construction. These alignment-based methods (to be called the standard methods hereafter) need independent verification in order to put the fungal Tree of Life (TOL) on a secure footing. The ever-increasing number of sequenced fungal genomes and the recent success of our newly proposed alignment-free composition vector tree (CVTree, see Methods) approach have made the verification feasible.
In all, 82 fungal genomes covering 5 phyla were obtained from the relevant genome sequencing centers. An unscaled phylogenetic tree with 3 outgroup species was constructed by using the CVTree method. Overall, the resultant phylogeny infers all major groups in accordance with standard methods. Furthermore, the CVTree provides information on the placement of several currently unsettled groups. Within the sub-phylum Pezizomycotina, our phylogeny places the Dothideomycetes and Eurotiomycetes as sister taxa. Within the Sordariomycetes, it infers that Magnaporthe grisea and the Plectosphaerellaceae are closely related to the Sordariales and Hypocreales, respectively. Within the Eurotiales, it supports that Aspergillus nidulans is the early-branching species among the 8 aspergilli. Within the Onygenales, it groups Histoplasma and Paracoccidioides together, supporting that the Ajellomycetaceae is a distinct clade from Onygenaceae. Within the sub-phylum Saccharomycotina, the CVTree clearly resolves two clades: (1) species that translate CTG as serine instead of leucine (the CTG clade) and (2) species that have undergone whole-genome duplication (the WGD clade). It places Candida glabrata at the base of the WGD clade.
Using different input data and methodology, the CVTree approach is a good complement to the standard methods. The remarkable consistency between them has brought about more confidence to the current understanding of the fungal branch of TOL.
The entire evolutionary history of life can be studied using myriad sequences generated by genomic research. This includes the appearance of the first cells and of superkingdoms Archaea, Bacteria, and Eukarya. However, the use of molecular sequence information for deep phylogenetic analyses is limited by mutational saturation, differential evolutionary rates, lack of sequence site independence, and other biological and technical constraints. In contrast, protein structures are evolutionary modules that are highly conserved and diverse enough to enable deep historical exploration.
Here we build phylogenies that describe the evolution of proteins and proteomes. These phylogenetic trees are derived from a genomic census of protein domains defined at the fold family (FF) level of structural classification. Phylogenomic trees of FF structures were reconstructed from genomic abundance levels of 2,397 FFs in 420 proteomes of free-living organisms. These trees defined timelines of domain appearance, with time spanning from the origin of proteins to the present. Timelines are divided into five different evolutionary phases according to patterns of sharing of FFs among superkingdoms: (1) a primordial protein world, (2) reductive evolution and the rise of Archaea, (3) the rise of Bacteria from the common ancestor of Bacteria and Eukarya and early development of the three superkingdoms, (4) the rise of Eukarya and widespread organismal diversification, and (5) eukaryal diversification. The relative ancestry of the FFs shows that reductive evolution by domain loss is dominant in the first three phases and is responsible for both the diversification of life from a universal cellular ancestor and the appearance of superkingdoms. On the other hand, domain gains are predominant in the last two phases and are responsible for organismal diversification, especially in Bacteria and Eukarya.
The evolution of functions that are associated with corresponding FFs along the timeline reveals that primordial metabolic domains evolved earlier than informational domains involved in translation and transcription, supporting the metabolism-first hypothesis rather than the RNA world scenario. In addition, phylogenomic trees of proteomes reconstructed from FFs appearing in each of the five phases of the protein world show that trees reconstructed from ancient domain structures were consistently rooted in archaeal lineages, supporting the proposal that the archaeal ancestor is more ancient than the ancestors of other superkingdoms.
Cymbidium orchids, including some 50 species, are the famous flowers, and they possess high commercial value in the floricultural industry. Furthermore, the values of different orchids are great differences. However, species identification is very difficult. To a certain degree, chloroplast DNA sequence data are a versatile tool for species identification and phylogenetic implications in plants. Different chloroplast loci have been utilized for evaluating phylogenetic relationships at each classification level among plant species, including at the interspecies and intraspecies levels. However, there is no evidence that a short sequence can distinguish all plant species from each other in order to infer phylogenetic relationships. Molecular markers derived from the complete chloroplast genome can provide effective tools for species identification and phylogenetic resolution.
The complete nucleotide sequences of eight individuals from a total of five Cymbidium species’ chloroplast (cp) genomes were determined using Illumina sequencing technology of the total DNA via a combination of de novo and reference-guided assembly. The length of the Cymbidium cp genome is about 155 kb. The cp genomes contain 123 unique genes, and the IR regions contain 24 duplicates. Although the genomes, including genome structure, gene order and orientation, are similar to those of other orchids, they are not evolutionarily conservative. The cp genome of Cymbidium evolved moderately with more than 3% sequence divergence, which could provide enough information for phylogeny. Rapidly evolving chloroplast genome regions were identified and 11 new divergence hotspot regions were disclosed for further phylogenetic study and species identification in Orchidaceae.
Phylogenomic analyses were conducted using 10 complete chloroplast genomes from seven orchid species. These data accurately identified the individuals and established the phylogenetic relationships between the species. The results reveal that phylogenomics based on organelle genome sequencing lights the species identification—organelle-scale “barcodes”, and is also an effective approach for studying whole populations and phylogenetic characteristics of Cymbidium.
Chloroplast genome; Phylogenomics; Species identification; Organelle-scale barcodes; Phylogeny; Divergence hotspot
Molecular phylogenetics relies on accurate identification of orthologous sequences among the taxa of interest. Most orthology inference programs available for use in phylogenomics rely on small sets of pre-defined orthologs from model organisms or phenetic approaches such as all-versus-all sequence comparisons followed by Markov graph-based clustering. Such approaches have high sensitivity but may erroneously include paralogous sequences. We developed PhyloTreePruner, a software utility that uses a phylogenetic approach to refine orthology inferences made using phenetic methods. PhyloTreePruner checks single-gene trees for evidence of paralogy and generates a new alignment for each group containing only sequences inferred to be orthologs. Importantly, PhyloTreePruner takes into account support values on the tree and avoids unnecessarily deleting sequences in cases where a weakly supported tree topology incorrectly indicates paralogy. A test of PhyloTreePruner on a dataset generated from 11 completely sequenced arthropod genomes identified 2,027 orthologous groups sampled for all taxa. Phylogenetic analysis of the concatenated supermatrix yielded a generally well-supported topology that was consistent with the current understanding of arthropod phylogeny. PhyloTreePruner is freely available from http://sourceforge.net/projects/phylotreepruner/.
phylogenomic; orthology; paralogy; gene tree
Phylogenetic hypotheses based on complete genome data are presented for the Gammaproteobacteria family Vibrionaceae. Two taxon samplings are presented: one including all those taxa for which the genome sequences are complete in terms of arrangement (chromosomal location of fragments; 19 taxa) and one for which the genome sequences contain multiple contigs (44 taxa). Analyses are presented under the Maximum Parsimony and Maximum Likelihood optimality criteria for total evidence datasets, the two chromosomes separately, and individual analyses of locally collinear blocks. Three of the genomes included in the 44 taxon dataset, those of Vibrio gazogenes, Salinivibrio costicola, and Aliivibrio logei have been newly sequenced and their genome sequences are documented here.
Phylogenetic results for the 19-taxon datasets show similar levels of collinear subset of dataset incongruence as a previous study of 22 taxa from the sister family Shewanellaceae, while also echoing the strong phylogenetic performance of random subsets of data also shown in this study. Phylogenetic results for both the 19-taxon and 44-taxon datasets corroborate previous hypotheses about the placement of Photobacterium and Aliivibrio within Vibrionaceae and also highlight problems with how Photobacterium is delimited and indicate that it likely should be dissolved into Vibrio to produce a phylogenetic taxonomy. The 19-taxon and 44-taxon trees based on the large chromosome are congruent for the majority of taxa that are present in both datasets. Analyses of the 44-taxon sampling based on the second, small chromosome are quite different from those based on the large chromosome, which is not surprising given the dramatically divergent nature of the small chromosome and the difficulty in postulating primary homologies.
The phylogenetic analyses presented here represent the most comprehensive genome-level phylogenetic analyses in terms of taxa and data. Based on the availability of genome data for many bacterial species on GenBank, many other bacterial groups would also be amenable to similar genome-scale phylogenetic analyses even when present in multiple contigs. The result that collinear subsets of data are incongruent with the concatenated dataset and with each other while random data subsets show very little incongruence echoes the result of previous work on Shewanellaceae. The 44-taxon phylogenetic analysis presented here thus represents the future of phylogenomic analyses in scope and complexity.
Understanding the evolutionary relationships of all eukaryotes on Earth remains a paramount goal of modern biology, yet analyzing homologous sequences across 1.8 billion years of eukaryotic evolution is challenging. Many existing tools for identifying gene orthologs are inadequate when working with heterogeneous rates of evolution and endosymbiotic/lateral gene transfer. Moreover, genomic-scale sequencing, which was once the domain of large sequencing centers, has advanced to the point where small laboratories can now generate the data needed for phylogenomic studies. This has opened the door for increased taxonomic sampling as individual research groups have the ability to conduct genome-scale projects on their favorite non-model organism.
Here we present some of the tools developed, and insights gained, as we created a pipeline that combines data-mining from public databases and our own transcriptome data to study the eukaryotic tree of life. The first steps of a phylogenomic pipeline involve choosing taxa and loci, and making decisions about how to handle alleles, paralogs and non-overlapping sequences. Next, orthologs are aligned for analyses including gene tree reconstruction and concatenation for supermatrix approaches. To build our pipeline, we created scripts written in Python that integrate third-party tools with custom methods. As a test case, we present the placement of five amoebae on the eukaryotic tree of life based on analyses of transcriptome data. Our scripts available on GitHUb and may be used as-is for automated analyses of large scale phylogenomics, or adapted for use in other types of studies.
Analyses on the scale of all eukaryotes present challenges not necessarily found in studies of more closely related organisms. Our approach will be of relevance to others for whom existing third-party tools fail to fully answer desired phylogenetic questions.
The relationships among all diploid genome types of the rice genus were clarified using 142 single-copy genes
The completion of rice genome sequencing has made rice and its wild relatives an attractive system for biological studies. Despite great efforts, phylogenetic relationships among genome types and species in the rice genus have not been fully resolved. To take full advantage of rice genome resources for biological research and rice breeding, we will benefit from the availability of a robust phylogeny of the rice genus.
Through screening rice genome sequences, we sampled and sequenced 142 single-copy genes to clarify the relationships among all diploid genome types of the rice genus. The analysis identified two short internal branches around which most previous phylogenetic inconsistency emerged. These represent two episodes of rapid speciation that occurred approximately 5 and 10 million years ago (Mya) and gave rise to almost the entire diversity of the genus. The known chromosomal distribution of the sampled genes allowed the documentation of whole-genome sorting of ancestral alleles during the rapid speciation, which was responsible primarily for extensive incongruence between gene phylogenies and persisting phylogenetic ambiguity in the genus. Random sample analysis showed that 120 genes with an average length of 874 bp were needed to resolve both short branches with 95% confidence.
Our phylogenomic analysis successfully resolved the phylogeny of rice genome types, which lays a solid foundation for comparative and functional genomic studies of rice and its relatives. This study also highlights that organismal genomes might be mosaics of conflicting genealogies because of rapid speciation and demonstrates the power of phylogenomics in the reconstruction of rapid diversification.
Reconstructing the evolutionary relationships of species is a major goal in biology. Despite the increasing number of completely sequenced genomes, a large number of phylogenetic projects rely on targeted sequencing and analysis of a relatively small sample of marker genes. The selection of these phylogenetic markers should ideally be based on accurate predictions of their combined, rather than individual, potential to accurately resolve the phylogeny of interest. Here we present and validate a new phylogenomics strategy to efficiently select a minimal set of stable markers able to reconstruct the underlying species phylogeny. In contrast to previous approaches, our methodology does not only rely on the ability of individual genes to reconstruct a known phylogeny, but it also explores the combined power of sets of concatenated genes to accurately infer phylogenetic relationships of species not previously analyzed. We applied our approach to two broad sets of cyanobacterial and ascomycetous fungal species, and provide two minimal sets of six and four genes, respectively, necessary to fully resolve the target phylogenies. This approach paves the way for the informed selection of phylogenetic markers in the effort of reconstructing the tree of life.
The Escherichia coli species contains a variety of commensal and pathogenic strains, and its intraspecific diversity is extraordinarily high. With the availability of an increasing number of E. coli strain genomes, a more comprehensive concept of their evolutionary history and ecological adaptation can be developed using phylogenomic analyses. In this study, we constructed two types of whole-genome phylogenies based on 34 E. coli strains using collinear genomic segments. The first phylogeny was based on the concatenated collinear regions shared by all of the studied genomes, and the second phylogeny was based on the variable collinear regions that are absent from at least one genome. Intuitively, the first phylogeny is likely to reveal the lineal evolutionary history among these strains (i.e., an evolutionary phylogeny), whereas the latter phylogeny is likely to reflect the whole-genome similarities of extant strains (i.e., a similarity phylogeny).
Within the evolutionary phylogeny, the strains were clustered in accordance with known phylogenetic groups and phenotypes. When comparing evolutionary and similarity phylogenies, a concept emerges that Shigella may have originated from at least three distinct ancestors and evolved into a single clade. By scrutinizing the properties that are shared amongst Shigella strains but missing in other E. coli genomes, we found that the common regions of the Shigella genomes were mainly influenced by mobile genetic elements, implying that they may have experienced convergent evolution via horizontal gene transfer. Based on an inspection of certain key branches of interest, we identified several collinear regions that may be associated with the pathogenicity of specific strains. Moreover, by examining the annotated genes within these regions, further detailed evidence associated with pathogenicity was revealed.
Collinear regions are reliable genomic features used for phylogenomic analysis among closely related genomes while linking the genomic diversity with phenotypic differences in a meaningful way. The pathogenicity of a strain may be associated with both the arrival of virulence factors and the modification of genomes via mutations. Such phylogenomic studies that compare collinear regions of whole genomes will help to better understand the evolution and adaptation of closely related microbes and E. coli in particular.
Phylogenomics; Collinearity; Pathogenicity; Escherichia coli; Adaptation
Phylogenomic approaches to the resolution of inter-species relationships have become well established in recent years. Often these involve concatenation of many orthologous genes found in the respective genomes followed by analysis using standard phylogenetic models. Genome-scale data promise increased resolution by minimising sampling error, yet are associated with well-known but often inappropriately addressed caveats arising through data heterogeneity and model violation. These can lead to the reconstruction of highly-supported but incorrect topologies. With the aim of obtaining a species tree for 18 species within the ascomycetous yeasts, we have investigated the use of appropriate evolutionary models to address inter-gene heterogeneities and the scalability and validity of supermatrix analysis as the phylogenetic problem becomes more difficult and the number of genes analysed approaches truly phylogenomic dimensions. We have extended a widely-known early phylogenomic study of yeasts by adding additional species to increase diversity and augmenting the number of genes under analysis. We have investigated sophisticated maximum likelihood analyses, considering not only a concatenated version of the data but also partitioned models where each gene constitutes a partition and parameters are free to vary between the different partitions (thereby accounting for variation in the evolutionary processes at different loci). We find considerable increases in likelihood using these complex models, arguing for the need for appropriate models when analyzing phylogenomic data. Using these methods, we were able to reconstruct a well-supported tree for 18 ascomycetous yeasts spanning about 250 million years of evolution.
Molecular systematics occupies one of the central stages in biology in the genomic era, ushered in by unprecedented progress in DNA technology. The inference of organismal phylogeny is now based on many independent genetic loci, a widely accepted approach to assemble the tree of life. Surprisingly, this approach is hindered by lack of appropriate nuclear gene markers for many taxonomic groups especially at high taxonomic level, partially due to the lack of tools for efficiently developing new phylogenetic makers. We report here a genome-comparison strategy to identifying nuclear gene markers for phylogenetic inference and apply it to the ray-finned fishes – the largest vertebrate clade in need of phylogenetic resolution.
A total of 154 candidate molecular markers – relatively well conserved, putatively single-copy gene fragments with long, uninterrupted exons – were obtained by comparing whole genome sequences of two model organisms, Danio rerio and Takifugu rubripes. Experimental tests of 15 of these (randomly picked) markers on 36 taxa (representing two-thirds of the ray-finned fish orders) demonstrate the feasibility of amplifying by PCR and directly sequencing most of these candidates from whole genomic DNA in a vast diversity of fish species. Preliminary phylogenetic analyses of sequence data obtained for 14 taxa and 10 markers (total of 7,872 bp for each species) are encouraging, suggesting that the markers obtained will make significant contributions to future fish phylogenetic studies.
We present a practical approach that systematically compares whole genome sequences to identify single-copy nuclear gene markers for inferring phylogeny. Our method is an improvement over traditional approaches (e.g., manually picking genes for testing) because it uses genomic information and automates the process to identify large numbers of candidate makers. This approach is shown here to be successful for fishes, but also could be applied to other groups of organisms for which two or more complete genome sequences exist, which has important implications for assembling the tree of life.
Members of the Roseobacter clade which play a key role in the biogeochemical cycles of the ocean are diverse and abundant, comprising 10–25% of the bacterioplankton in most marine surface waters. The rapid accumulation of whole-genome sequence data for the Roseobacter clade allows us to obtain a clearer picture of its evolution.
In this study about 1,200 likely orthologous protein families were identified from 17 Roseobacter bacteria genomes. Functional annotations for these genes are provided by iProClass. Phylogenetic trees were constructed for each gene using maximum likelihood (ML) and neighbor joining (NJ). Putative organismal phylogenetic trees were built with phylogenomic methods. These trees were compared and analyzed using principal coordinates analysis (PCoA), approximately unbiased (AU) and Shimodaira–Hasegawa (SH) tests. A core set of 694 genes with vertical descent signal that are resistant to horizontal gene transfer (HGT) is used to reconstruct a robust organismal phylogeny. In addition, we also discovered the most likely 109 HGT genes. The core set contains genes that encode ribosomal apparatus, ABC transporters and chaperones often found in the environmental metagenomic and metatranscriptomic data. These genes in the core set are spread out uniformly among the various functional classes and biological processes.
Here we report a new multigene-derived phylogenetic tree of the Roseobacter clade. Of particular interest is the HGT of eleven genes involved in vitamin B12 synthesis as well as key enzynmes for dimethylsulfoniopropionate (DMSP) degradation. These aquired genes are essential for the growth of Roseobacters and their eukaryotic partners.
Although the overwhelming majority of genes found in angiosperms are members of gene families, and both gene- and genome-duplication are pervasive forces in plant genomes, some genes are sufficiently distinct from all other genes in a genome that they can be operationally defined as 'single copy'. Using the gene clustering algorithm MCL-tribe, we have identified a set of 959 single copy genes that are shared single copy genes in the genomes of Arabidopsis thaliana, Populus trichocarpa, Vitis vinifera and Oryza sativa. To characterize these genes, we have performed a number of analyses examining GO annotations, coding sequence length, number of exons, number of domains, presence in distant lineages, such as Selaginella and Physcomitrella, and phylogenetic analysis to estimate copy number in other seed plants and to demonstrate their phylogenetic utility. We then provide examples of how these genes may be used in phylogenetic analyses to reconstruct organismal history, both by using extant coverage in EST databases for seed plants and de novo amplification via RT-PCR in the family Brassicaceae.
There are 959 single copy nuclear genes shared in Arabidopsis, Populus, Vitis and Oryza ["APVO SSC genes"]. The majority of these genes are also present in the Selaginella and Physcomitrella genomes. Public EST sets for 197 species suggest that most of these genes are present across a diverse collection of seed plants, and appear to exist as single or very low copy genes, though exceptions are seen in recently polyploid taxa and in lineages where there is significant evidence for a shared large-scale duplication event. Genes encoding proteins localized in organelles are more commonly single copy than expected by chance, but the evolutionary forces responsible for this bias are unknown.
Regardless of the evolutionary mechanisms responsible for the large number of shared single copy genes in diverse flowering plant lineages, these genes are valuable for phylogenetic and comparative analyses. Eighteen of the APVO SSC single copy genes were amplified in the Brassicaceae using RT-PCR and directly sequenced. Alignments of these sequences provide improved resolution of Brassicaceae phylogeny compared to recent studies using plastid and ITS sequences. An analysis of sequences from 13 APVO SSC genes from 69 species of seed plants, derived mainly from public EST databases, yielded a phylogeny that was largely congruent with prior hypotheses based on multiple plastid sequences. Whereas single gene phylogenies that rely on EST sequences have limited bootstrap support as the result of limited sequence information, concatenated alignments result in phylogenetic trees with strong bootstrap support for already established relationships. Overall, these single copy nuclear genes are promising markers for phylogenetics, and contain a greater proportion of phylogenetically-informative sites than commonly used protein-coding sequences from the plastid or mitochondrial genomes.
Putatively orthologous, shared single copy nuclear genes provide a vast source of new evidence for plant phylogenetics, genome mapping, and other applications, as well as a substantial class of genes for which functional characterization is needed. Preliminary evidence indicates that many of the shared single copy nuclear genes identified in this study may be well suited as markers for addressing phylogenetic hypotheses at a variety of taxonomic levels.
Subterranean mammals have been of great interest for evolutionary biologists because of their highly specialized traits for the life underground. Owing to the convergence of morphological traits and the incongruence of molecular evidence, the phylogenetic relationships among three subfamilies Myospalacinae (zokors), Spalacinae (blind mole rats) and Rhizomyinae (bamboo rats) within the family Spalacidae remain unresolved. Here, we performed de novo transcriptome sequencing of four RNA-seq libraries prepared from brain and liver tissues of a plateau zokor (Eospalax baileyi) and a hoary bamboo rat (Rhizomys pruinosus), and analyzed the transcriptome sequences alongside a published transcriptome of the Middle East blind mole rat (Spalax galili). We characterize the transcriptome assemblies of the two spalacids, and recover the phylogeny of the three subfamilies using a phylogenomic approach.
Approximately 50.3 million clean reads from the zokor and 140.8 million clean reads from the bamboo ratwere generated by Illumina paired-end RNA-seq technology. All clean reads were assembled into 138,872 (the zokor) and 157,167 (the bamboo rat) unigenes, which were annotated by the public databases: the Swiss-prot, Trembl, NCBI non-redundant protein (NR), NCBI nucleotide sequence (NT), Gene Ontology (GO), Cluster of Orthologous Groups (COG), and Kyoto Encyclopedia of Genes and Genomes (KEGG). A total of 5,116 nuclear orthologous genes were identified in the three spalacids and mouse, which was used as an outgroup. Phylogenetic analysis revealed a sister group relationship between the zokor and the bamboo rat, which is supported by the majority of gene trees inferred from individual orthologous genes, suggesting subfamily Myospalacinae is more closely related to subfamily Rhizomyinae. The same topology was recovered from concatenated sequences of 5,116 nuclear genes, fourfold degenerate sites of the 5,116 nuclear genes and concatenated sequences of 13 protein coding mitochondrial genes.
This is the first report of transcriptome sequencing in zokors and bamboo rats, representing a valuable resource for future studies of comparative genomics in subterranean mammals. Phylogenomic analysis provides a conclusive resolution of interrelationships of the three subfamilies within the family Spalacidae, and highlights the power of phylogenomic approach to dissect the evolutionary history of rapid radiations in the tree of life.
Spalacidae; Phylogenomics; Transcriptome; Mitochondrial genome; Subterranean rodents