|Home | About | Journals | Submit | Contact Us | Français|
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Orthologous genes are frequently presumed to perform similar functions. However, outside of model organisms, this is rarely tested. One means of inferring changes in function is if there are changes in the level of gene conservation and selective constraint. Here we compare levels of gene conservation across three bacterial groups to test for changes in gene functionality.
The level of gene conservation for different orthologous genes is highly correlated across clades, even for highly divergent groups of bacteria. These correlations do not arise from broad differences in gene functionality (e.g. informational genes vs. metabolic genes), but instead seem to result from very specific differences in gene function. Furthermore, these functional differences appear to be maintained over very long periods of time.
These results suggest that even over broad time scales, most bacterial genes are under a nearly constant level of purifying selection, and that bacterial evolution is thus dominated by selective and functional stasis.
We are interested in whether the functional importance of orthologous genes changes across bacterial taxa. We pose a simple question: if a gene plays an important role for the functioning of a bacterium, is the orthologue of that gene in a distantly related bacterium also particularly important? To measure functional importance, we look at how strongly genes are conserved over time. If a group of orthologous genes are well conserved across different taxa, this implies that strong purifying selection is acting to maintain these genes. Similarly, if a group of orthologous genes are lost quickly across different taxa, this implies that purifying selection is acting only weakly to maintain these genes. If the strength of purifying selection for individual genes does not change across bacterial groups (i.e. if there is a correlation in the level of orthologue conservation), this implies that the functional importance of most orthologous genes does not change quickly. On the other hand, if there is little correlation in the level of gene conservation across bacterial groups, this implies that the functional importance of many orthologues changes often, perhaps because of differences in the genetic backgrounds of organisms (e.g. compensating mechanisms at other loci).
Here we show that between three bacterial clades, the levels of conservation for specific orthologues are highly correlated. This correlation remains even when examining subgroups of functionally similar genes, such as genes involved in ribosome function. This suggests that despite large differences in genetic background, the strength of selection acting to maintain any specific orthologue remains approximately constant, and that most genes maintain their specific functionality over long periods of time.
We used stochastic character mapping  to calculate a measure of gene conservation that accounts for phylogenetic relatedness between taxa. Briefly, for each protein coding gene in E. coli K12 W3110, we determined whether an orthologous gene was present or absent for all other bacteria with fully sequenced genomes (447 other genomes in total). Together with information on the phylogenetic history, these data were used to calculate a parameter for each orthologue that reflects the rate (probability per unit time) that the orthologue the orthologue will be lost or gained along a branch (see Additional file 1). Because this parameter value is mostly determined by how quickly an orthologue is lost over time, we term this parameter the rate of orthologue loss (ROL). Low ROL values imply that along a branch, there is a low probability of that orthologue being lost. High ROL values imply that along a branch, there is a high probability that the orthologue will be lost. For each orthologue, one ROL was calculated for all branches in a clade.
All genomes were downloaded from the NCBI database in May of 2007 (Additional file 2), and a phylogeny was constructed using a concatenated set of 73 conserved orthologues (Additional files 3 and 4). The program SIMMAP  was used to calculate all ROL values (Additional file 5), and gene functional classes were divided using MultiFun . For detailed materials and methods, see Additional file 1.
We tested whether the ROL values for specific orthologues change between different clades of bacteria. We calculated the ROL values for orthologous genes in three bacterial groups: the γ – and β-proteobacteria; the α-proteobacteria, which are the sister clade of the γ-β-proteobacteria and diverged approximately 2.5 billion years ago ; and the Bacilli-Molllicutes clade, which diverged from the γ-β-proteobacteria just over three billion years ago . For all of these clades, we found that the ROL values for orthologous genes were highly correlated (Fig. (Fig.1;1; r2 = 0.673, Pearson's ρ = 0.756 for γ-β-proteobacteria versus α-proteobacteria; r2 = 0.488, Pearson's ρ = 0.628 for γ-β-proteobacteria versus Bacilli-Molllicutes; all data are listed in Additional file 5).
A simple explanation for the high correlation between ROL values is that it is driven by differences in gene essentiality: essential genes will be strongly conserved, while nonessential genes will be weakly conserved. To test this, we divided the genes into essential and nonessential groups, based on the experimental results from two recent studies that used either E. coli K12 MG1655  or E. coli K12 BW25113 . We disregarded any discrepancies in essentiality annotation between the two studies and focused only on those genes for which they agree on the classification of essentiality. We found that even when excluding orthologues that are classified as essential in E. coli, the correlations remained very high (Fig. (Fig.1;1; r2 = 0.539 and r2 = 0.332, respectively).
A second explanation for the high correlation between ROL values is that it is driven by differences between functional classes of genes. For example, informational orthologues may be highly conserved, whereas genes involved in metabolic functions may be less conserved. To test this hypothesis, we calculated the correlation coefficients for ROL values of orthologues within single functional classes of genes as delineated by MultiFun  (see Additional file 1). We found that within MultiFun classes, ROL values between bacterial groups were again highly correlated, even when considering only nonessential genes. The r2 values between γ-β-proteobacteria and α-proteobacteria varied from 0.740 (for information transfer genes related to DNA, MultiFun class 2.1) to 0.260 (for carbon utilization genes, class 1.1) (Fig. (Fig.2).2). The r2 values between γ-β-proteobacteria and Bacilli-Mollicutes varied from 0.600 (for structural genes in the ribosome, MultiFun class 6.6) to 0.020 (for structural genes responsible for surface antigens, class 6.3) (Fig. (Fig.2).2). Together, these data suggest that ROL values remain constant over long stretches of time, on the order of billions of years, and that this constancy is driven neither by broad differences in gene functionality, nor differences in gene essentiality, but by specific differences in gene function.
We have assumed above that the level of gene conservation reflects the strength of purifying selection acting on a gene: well-conserved genes are under strong purifying selection, while less conserved genes experience only weak purifying selection. Here we test this assumption by asking how well our measure of gene conservation, ROL, corresponds to growth phenotypes, which we know to be under selection. Specifically, if the deletion of a gene causes lethality even under benign laboratory conditions, then the loss of this gene is almost certainly lethal in the natural environment and is thus under strong purifying selection. We first ask, then, how well ROL values correlate with annotations of gene essentiality. The ROL values for essential and nonessential genes are shown in Fig. Fig.3A.3A. On average, genes that have been classified as essential in E. coli K12 have a dramatically lower ROL than non-essential genes.
To quantify the relationship between ROL and essentiality, we used a receiver operator characteristic (ROC) curve. This curve describes the relationship between the fraction of false positives and the fraction of true positives when using ROL to discriminate between essential and nonessential genes. One means of quantifying this relationship is by calculating the area under the ROC curve (the AUC), which is equivalent to the probability that a randomly chosen essential gene will have a lower ROL than a randomly chosen nonessential gene . If ROL were perfectly predictive of gene essentiality, the AUC would be 1.0; the AUC for this analysis was 0.947 (Fig. (Fig.3B),3B), and strongly suggests that ROL values do reflect the strength of purifying selection acting on a gene.
We also asked whether ROL values and the quantitative effects of gene deletions are correlated. Using data on growth yield in rich media of deletion mutants , we found a small but highly significant relationship between a gene's ROL value and the growth yield of that deletion strain (Fig. S2; r2 = 0.0628, p < 0.0001; Spearman's ρ = 0.127, p < 0.0001). Again, this suggests that ROL values reflect the strength of purifying selection acting on a gene.
We have shown that ROL values for specific orthologues are correlated over long broad evolutionary distances, and that these correlations remain strong even within specific functional classes of genes and for genes that are not essential for cellular viability. In other words, the constancy of the level gene conservation across bacterial orders seems to result from specific differences in gene function. The strength of the correlations we find here are of similar magnitude to one found in a previous study of correlations between protein evolutionary rates within the Chlamydiaceae . Notably, the Chlamydiaceae are far more closely related than the clades considered here, so a high correlation should not be surprising. However, we have also considered selection on a more general scale (gene presence versus gene absence), which likely increases the strength of the correlations. Interestingly, for some orthologues, ROL values have changed considerably across taxonomic groups (we show three examples in Figs. Figs.44 and and5).5). We propose that these genes have changed in functional importance, resulting in either increased or decreased purifying selection.
Some essential E. coli genes have orthologues that are consistently lost at high rates among other γ-β-proteobacteria, α-proteobacteria, and Bacilli-Mollicutes, contrary to the high level of conservations expected for essential genes. This is not due to these genes only being essential in E. coli and nonessential in other taxa. In Table Table11 we show a list of genes that are essential in E coli K12 and which have high ROL values (greater than 2.4 in all three bacterial groups studied (Fig. (Fig.1)),1)), together with data from an empirical study of gene essentiality in the γ-proteobacterium Acinetobacter baylyi . Of nine genes with an orthologue in Acinetobacter, eight are also essential in Acinetobacter. This suggests, surprisingly, that some genes, despite being essential, are lost frequently, and is consistent with the view that compensation at other sites in the genome may occur even for "essential" functions.
Many of the essential genes that are lost at high rates are recent innovations. Considering those genes that are essential in E. coli K12 but are lost at high rates from other γ-β-proteobacteria, 44% (18 out of 41) have a distribution restricted to the γ-β-proteobacteria and are thus likely to be relatively recent additions to the genomic repertoire. In contrast, of the essential genes with low ROL values (less than 2.4), only 0.9% (2 out of 222) are restricted to the γ-β-proteobacteria. Previous work has shown that recently acquired genes tend to be incorporated at the edge of the cellular network . Such peripheral genes may thus be more easily removed from the genome, with fewer interactions to compensate.
These results confirm and extend previous studies that have investigated the relationship between essentiality and gene conservation [11-13]. However, here we have used a phylogenetically corrected measure of gene conservation (ROL). Additionally, we have found that the ability of orthologue conservation to predict gene essentiality is far higher than has previously been realized , most likely due to the lower accuracy of earlier datasets. Finally, we have shown for the first time a correlation between gene conservation and quantitative measures of deletion phenotypes (growth yield, Fig. S2).
Our metric of gene conservation, which takes into account phylogenetic history, provides a considerable improvement over simpler measures such as the fraction of taxa that retain a specific orthologue (retention). Using retention to predict essentiality yields an AUC of 0.937, meaning that essential genes are incorrectly ranked higher than nonessential genes 6.3% of the time. Using ROL, the misclassified fraction is reduced to 5.3%, a reduction of 16% in the error rate. ROL has the additional advantage of being based on a specific evolutionary model, which itself may provide biological insights, for example into the relative rates of gene loss versus horizontal transfer (i.e. the ratio of gene loss versus gene gain in lineages).
Finally, we note that high-throughput experimental assessments of gene essentiality are prone to both false positive and false negative results (i.e. annotating a non-essential gene as essential and vice versa). The level of agreement on essentiality between the two most recent studies of gene essentiality [5,6] is similar to the level of agreement between both studies and ROL (all are between 94% and 95%), and far greater than between the first experimental study of gene essentiality  and the latter two experimental studies. This suggests that ROL may be a valid and useful means of cross-validating experimental studies in order to find genes likely to be false positives or false negatives, which could then be reexamined.
ROC: receiver operator characteristic; AUC: area under the ROC curve; ROL: rate of orthologue loss; W3110: E. coli K12 W3110.
The authors declare that they have no competing interests.
OKS and MA conceived of the study. OKS performed the bioinformatic and phylogenetic analyses and drafted the manuscript. MA edited the manuscript.
Detailed materials and methods. Detailed materials and methods are outlined here, including the methods used to build the phylogeny and the method used to infer ROL values. Three supplementary figures are also included, showing the phylogeny, the relationship between ROL and deletion strain growth phenotype, and the sensitivity of ROL values to changes in the ratio of gene loss to gene gain.
Bacterial and Archaeal genomes used to construct orthologue sets. All taxa used in the phylogenetic analysis are listed here; the taxon names are shown as written in the NCBI database.
Universal orthologues used to construct the distance based phylogeny. The list of universally distributed genes used to construct the phylogeny are listed here; the top row indicates the E. coli orf. Below the E. coli orf, the orthologous reading frame in each genome is listed.
Full amino acid alignment used to construct phylogeny. This Phylip format file shows the full amino acid alignment used to construct the distance based phylogeny. The taxa names are abbreviated; the full names of each taxon are listed in Additional file 2.
Essentiality annotations and ROL values for all E. coli genes. The data here show the annotations of essentiality and non-essentiality, as well as the ROL values calculated for each orf in E. coli. The data are listed in tab-delimited columns, as follows: E. coli orf name; Blattner number used by PEC, Blattner number used by Keio study; whether the gene is annotated as essential by the PEC study (1 = essential, 2 = nonessential, 4 = unknown); whether the gene is annotated as essential by the Keio study; the ROL value calculated when using all 448 bacterial taxa, the ROL when using only γ-β-proteobacteria (NA: orthologues were present in fewer than 10% of the taxa); the ROL when using only α-proteobacteria; the ROL when using only Bacilli-Mollicutes.
We thank the Theoretical Biology group at ETH Zurich for discussions. Funding was provided by the Roche Research Foundation and the Novartis Foundation (to OKS), and the Swiss National Science Foundation (to MA).