Genome-wide comparison of phylogenetic trees is becoming an increasingly common approach in evolutionary genomics, and a variety of approaches for such comparison have been developed. In this article we present several methods for comparative analysis of large numbers of phylogenetic trees. To compare phylogenetic trees taking into account the bootstrap support for each internal branch, the Boot-Split Distance (BSD) method is introduced as an extension of the previously developed Split Distance (SD) method for tree comparison. The BSD method implements the straightforward idea that comparison of phylogenetic trees can be made more robust by treating tree splits differentially depending on the bootstrap support. Approaches are also introduced for detecting tree-like and net-like evolutionary trends in the phylogenetic Forest of Life (FOL), i.e., the entirety of the phylogenetic trees for conserved genes of prokaryotes. The principal method employed for this purpose includes mapping quartets of species onto trees to calculate the support of each quartet topology and so to quantify the tree and net contributions to the distances between species. We describe the applications methods used to analyze the FOL and the results obtained with these methods. These results support the concept of the Tree of Life (TOL) as a central evolutionary trend in the FOL as opposed to the traditional view of the TOL as a ‘species tree’.
Forest of life; tree of life; phylogenomic methods; tree comparison; map of quartets
The widespread exchange of genes among prokaryotes, known as horizontal gene transfer (HGT), is often considered to “uproot” the Tree of Life (TOL). Indeed, it is by now fully clear that genes in general possess different evolutionary histories. However, the possibility remains that the TOL concept can be reformulated and remain valid as a statistical central trend in the phylogenetic “Forest of Life” (FOL). This article describes a computational pipeline developed to chart the FOL by comparative analysis of thousands of phylogenetic trees. This analysis reveals a distinct, consistent phylogenetic signal that is particularly strong among the Nearly Universal Trees (NUTs), which correspond to genes represented in all or most of the analyzed organisms. Despite the substantial amount of apparent HGT seen even among the NUTs, these gene transfers appear to be distributed randomly and do not obscure the central tree-like trend.
Proteorhodopsin phototrophy is expected to have considerable impact on the ecology and biogeochemical roles of marine bacteria. However, the genetic features contributing to the success of proteorhodopsin-containing bacteria remain largely unknown. We investigated the genome of Dokdonia sp. strain MED134 (Bacteroidetes) for features potentially explaining its ability to grow better in light than darkness. MED134 has a relatively high number of peptidases, suggesting that amino acids are the main carbon and nitrogen sources. In addition, MED134 shares with other environmental genomes a reduction in gene copies at the expense of important ones, like membrane transporters, which might be compensated by the presence of the proteorhodopsin gene. The genome analyses suggest Dokdonia sp. MED134 is able to respond to light at least partly due to the presence of a strong flavobacterial consensus promoter sequence for the proteorhodopsin gene. Moreover, Dokdonia sp. MED134 has a complete set of anaplerotic enzymes likely to play a role in the adaptation of the carbon anabolism to the different sources of energy it can use, including light or various organic matter compounds. In addition to promoting growth, proteorhodopsin phototrophy could provide energy for the degradation of complex or recalcitrant organic matter, survival during periods of low nutrients, or uptake of amino acids and peptides at low concentrations. Our analysis suggests that the ability to harness light potentially makes MED134 less dependent on the amount and quality of organic matter or other nutrients. The genomic features reported here may well be among the keys to a successful photoheterotrophic lifestyle.
Archaeal and bacterial ribosomes contain more than 50 proteins, including 34 that are universally conserved in the three domains of cellular life (bacteria, archaea, and eukaryotes). Despite the high sequence conservation, annotation of ribosomal (r-) protein genes is often difficult because of their short lengths and biased sequence composition. We developed an automated computational pipeline for identification of r-protein genes and applied it to 995 completely sequenced bacterial and 87 archaeal genomes available in the RefSeq database. The pipeline employs curated seed alignments of r-proteins to run position-specific scoring matrix (PSSM)-based BLAST searches against six-frame genome translations, mitigating possible gene annotation errors. As a result of this analysis, we performed a census of prokaryotic r-protein complements, enumerated missing and paralogous r-proteins, and analyzed the distributions of ribosomal protein genes among chromosomal partitions. Phyletic patterns of bacterial and archaeal r-protein genes were mapped to phylogenetic trees reconstructed from concatenated alignments of r-proteins to reveal the history of likely multiple independent gains and losses. These alignments, available for download, can be used as search profiles to improve genome annotation of r-proteins and for further comparative genomics studies.
Antigenic drift in the influenza A virus hemagglutinin (HA) is responsible for seasonal reformulation of influenza vaccines. Here, we address an important and largely overlooked issue in antigenic drift: how does the number and location of glycosylation sites affect HA evolution in man? We analyzed the glycosylation status of all full-length H1 subtype HA sequences available in the NCBI influenza database. We devised the “flow index” (FI), a simple algorithm that calculates the tendency for viruses to gain or lose consensus glycosylation sites. The FI predicts the predominance of glycosylation states among existing strains. Our analyses show that while the number of glycosylation sites in the HA globular domain does not influence the overall magnitude of variation in defined antigenic regions, variation focuses on those regions unshielded by glycosylation. This supports the conclusion that glycosylation generally shields HA from antibody-mediated neutralization, and implies that fitness costs in accommodating oligosaccharides limit virus escape via HA hyperglycosylation.
Influenza A virus is highly susceptible to neutralizing antibodies specific for the viral hemagglutinin glycoprotein (HA), and is easily controlled by standard vaccines. Influenza A virus remains an important human pathogen, however, due to its ability to rapidly evade antibody responses. This process, termed antigenic drift, is due to the accumulation of amino acid substitutions that modify HA antigenic sites recognized by neutralizing antibodies. In this study, we perform bioinformatic analysis on thousands of influenza A virus isolates to better understand the influence of N-linked glycosylation on antigenic drift. HA from human IAV isolates can accommodate up to 6 oligosaccharides in its globular domain. We show that for H1, H2, and to a somewhat less extent H3, HAs, the number of glycosylation sites in the globular domain does not greatly modify the total degree of variation in antigenic sites, but rather focuses variation on sites whose access to antibodies is unaffected by glycosylation. Our findings imply that glycosylation protects HA from antibody neutralization, but functional impairment limits the number of oligosaccharides that HA can accommodate.
Phylogenetic trees of individual genes of prokaryotes (archaea and bacteria) generally have different topologies, largely owing to extensive horizontal gene transfer (HGT), suggesting that the Tree of Life (TOL) should be replaced by a “net of life” as the paradigm of prokaryote evolution. However, trees remain the natural representation of the histories of individual genes given the fundamentally bifurcating process of gene replication. Therefore, although no single tree can fully represent the evolution of prokaryote genomes, the complete picture of evolution will necessarily combine trees and nets. A quantitative measure of the signals of tree and net evolution is derived from an analysis of all quartets of species in all trees of the “Forest of Life” (FOL), which consists of approximately 7,000 phylogenetic trees for prokaryote genes including approximately 100 nearly universal trees (NUTs). Although diverse routes of net-like evolution collectively dominate the FOL, the pattern of tree-like evolution that reflects the consistent topologies of the NUTs is the most prominent coherent trend. We show that the contributions of tree-like and net-like evolutionary processes substantially differ across bacterial and archaeal lineages and between functional classes of genes. Evolutionary simulations indicate that the central tree-like signal cannot be realistically explained by a self-reinforcing pattern of biased HGT.
phylogenetic tree; horizontal gene transfer; species quartets; computer simulation
The Relative Codon Deoptimization Index (RCDI) was developed by Mueller et al. (2006) as measure of codon deoptimization by comparing how similar is the codon usage of a gene and the codon usage of a reference genome.
RCDI/eRCDI is a web application server that calculates the Relative Codon Deoptimization Index and a new expected value for the RCDI (eRCDI). The RCDI is used to estimate the similarity of the codon frequencies of a specific gene in comparison to a given reference genome. The eRCDI is determined by generating random sequences with similar G+C and amino acid composition to the input sequences and may be used as an indicator of the significance of the RCDI values. RCDI/eRCDI is freely available at http://genomes.urv.cat/CAIcal/RCDI.
This web server will be a useful tool for genome analysis, to understand host-virus phylogenetic relationships or to infer the potential host range of a virus and its replication strategy, as well as in experimental virology to ease the step of gene design for heterologous protein expression.
Comparative genomics has revealed extensive horizontal gene transfer among prokaryotes, a development that is often considered to undermine the 'tree of life' concept. However, the possibility remains that a statistical central trend still exists in the phylogenetic 'forest of life'.
A comprehensive comparative analysis of a 'forest' of 6,901 phylogenetic trees for prokaryotic genes revealed a consistent phylogenetic signal, particularly among 102 nearly universal trees, despite high levels of topological inconsistency, probably due to horizontal gene transfer. Horizontal transfers seemed to be distributed randomly and did not obscure the central trend. The nearly universal trees were topologically similar to numerous other trees. Thus, the nearly universal trees might reflect a significant central tendency, although they cannot represent the forest completely. However, topological consistency was seen mostly at shallow tree depths and abruptly dropped at the level of the radiation of archaeal and bacterial phyla, suggesting that early phases of evolution could be non-tree-like (Biological Big Bang). Simulations of evolution under compressed cladogenesis or Biological Big Bang yielded a better fit to the observed dependence between tree inconsistency and phylogenetic depth for the compressed cladogenesis model.
Horizontal gene transfer is pervasive among prokaryotes: very few gene trees are fully consistent, making the original tree of life concept obsolete. A central trend that most probably represents vertical inheritance is discernible throughout the evolution of archaea and bacteria, although compressed cladogenesis complicates unambiguous resolution of the relationships between the major archaeal and bacterial clades.
The Codon Adaptation Index (CAI) was first developed to measure the synonymous codon usage bias for a DNA or RNA sequence. The CAI quantifies the similarity between the synonymous codon usage of a gene and the synonymous codon frequency of a reference set.
We describe here CAIcal, a web-server available at that includes a complete set of utilities related with the CAI. The server provides useful important features, such as the calculation and graphical representation of the CAI along either an individual sequence or a protein multiple sequence alignment translated to DNA. The automated calculation of CAI and its expected value is also included as one of the CAIcal tools. The software is also free to be downloaded as a standalone application for local use.
The CAIcal server provides a complete set of tools to assess codon usage adaptation and to help in genome annotation.
This article was reviewed by Purificación López-García, Dan Graur, Rob Knight and Shamil Sunyaev.
The Codon Adaptation Index (CAI) is a measure of the synonymous codon usage bias for a DNA or RNA sequence. It quantifies the similarity between the synonymous codon usage of a gene and the synonymous codon frequency of a reference set. Extreme values in the nucleotide or in the amino acid composition have a large impact on differential preference for synonymous codons. It is thence essential to define the limits for the expected value of CAI on the basis of sequence composition in order to properly interpret the CAI and provide statistical support to CAI analyses. Though several freely available programs calculate the CAI for a given DNA sequence, none of them corrects for compositional biases or provides confidence intervals for CAI values.
The E-CAI server, available at , is a web-application that calculates an expected value of CAI for a set of query sequences by generating random sequences with G+C and amino acid content similar to those of the input. An executable file, a tutorial, a Frequently Asked Questions (FAQ) section and several examples are also available. To exemplify the use of the E-CAI server, we have analysed the codon adaptation of human mitochondrial genes that codify a subunit of the mitochondrial respiratory chain (excluding those genes that lack a prokaryotic orthologue) and are encoded in the nuclear genome. It is assumed that these genes were transferred from the proto-mitochondrial to the nuclear genome and that its codon usage was then ameliorated.
The E-CAI server provides a direct threshold value for discerning whether the differences in CAI are statistically significant or whether they are merely artifacts that arise from internal biases in the G+C composition and/or amino acid composition of the query sequences.
The highly expressed genes database (HEG-DB) is a genomic database that includes the prediction of which genes are highly expressed in prokaryotic complete genomes under strong translational selection. The current version of the database contains general features for almost 200 genomes under translational selection, including the correspondence analysis of the relative synonymous codon usage for all genes, and the analysis of their highly expressed genes. For each genome, the database contains functional and positional information about the predicted group of highly expressed genes. This information can also be accessed using a search engine. Among other statistical parameters, the database also provides the Codon Adaptation Index (CAI) for all of the genes using the codon usage of the highly expressed genes as a reference set. The ‘Pathway Tools Omics Viewer’ from the BioCyc database enables the metabolic capabilities of each genome to be explored, particularly those related to the group of highly expressed genes. The HEG-DB is freely available at http://genomes.urv.cat/HEG-DB.
OPTIMIZER is an on-line application that optimizes the codon usage of a gene to increase its expression level. Three methods of optimization are available: the ‘one amino acid–one codon’ method, a guided random method based on a Monte Carlo algorithm, and a new method designed to maximize the optimization with the fewest changes in the query sequence. One of the main features of OPTIMIZER is that it makes it possible to optimize a DNA sequence using pre-computed codon usage tables from a predicted group of highly expressed genes from more than 150 prokaryotic species under strong translational selection. These groups of highly expressed genes have been predicted using a new iterative algorithm. In addition, users can use, as a reference set, a pre-computed table containing the mean codon usage of ribosomal protein genes and, as a novelty, the tRNA gene-copy numbers. OPTIMIZER is accessible free of charge at http://genomes.urv.es/OPTIMIZER.
Using information from several metabolic databases, we have built our own metabolic database containing 434 pathways and 1157 different enzymes. We have used this information to
construct a dendrogram that demonstrates the metabolic similarities between 282 species. The resulting species distribution and the clusters defined in the tree show a certain
taxonomic congruence, especially in recent relationships between species. This dendrogram is another representation of the tree of life, based on metabolism that may complement
the trees constructed by other methods. For example, the metabolic dissimilarity we demonstrate between Symbiobacterium thermophilum (previously defined as Actinobacteria) and the
other Actinobacteria species, and the metabolic similarity between S. thermophilum and Clostridia, combined with other evidence, suggest that S. thermophilum may be re-classified as
metablic pathways; enzymes; dendogram; taxonomy; species