Sequence substitution levels of orthologs and gene duplication
Orthologous protein sequence pairs were identified for human and mouse as described under Methods. Protein sequences were aligned, and the resulting amino acid sequence alignments were used to guide the alignment of nucleotide coding sequences (CDSs). These alignments were used to estimate sequence divergence levels (substitutions per site) for amino acids as well as for non-synonymous (dN) and synonymous (dS) CDS substitutions. For pairs of human-mouse orthologs, within-genome sequence comparisons were used to classify them as duplicates or singletons, based on whether or not they had detectable paralogs, and the average sequence divergence levels for these two classes of genes were compared. The classification of orthologous pairs as duplicates and singletons was done using three criteria: 1 – presence of paralogs in the human genome alone, 2 – presence of paralogs in the mouse genome alone, and 3 – presence of paralogs in human or mouse. For all three classification criteria, the average ortholog amino acid substitution levels for duplicates were substantially (and statistically highly significantly) lower than those of singletons (Table and Figure ).
Evolutionary distance (substitution level) comparisons between duplicates and singletons
Figure 1 Average substitution levels, with 95% confidence intervals, for orthologous human-mouse sequence pairs with paralogs (duplicates – light gray bars) and with no paralogs (singletons – dark gray bars). The x-axis labels indicate comparisons (more ...)
It is a formal, albeit unlikely, possibility that these differences in sequence diversity between the duplicate and singleton classes are due to different mutation pressures. To control for this possibility, the ratio of dN/dS was taken as an approximate measure of selective constraint and compared between the duplicate and singleton classes. As with the amino acid substitution levels, dN/dS is substantially lower for duplicates than for singletons (Table and Figure ). The dN/dS ratio is considered to be an indicator of the mode and strength of the selection operating during evolution of a gene [9
]. Thus, the finding that, in comparisons between human and mouse orthologs, duplicates, on average, have lower dN/dS values than singletons strongly suggests that the former are subject to stronger purifying selection than the latter.
A formal possibility exists that the observed differences between the evolutionary rates of duplicates and singletons were due solely to the presence of extremely rapidly evolving gene pairs of potentially mis-identified orthologs (Methods; Figure ). To examine the possible contribution of this effect, all human-mouse gene pairs with dS > 2 standard deviations (sd) from the mean were removed from consideration, and the differences in average substitution levels between duplicates and singletons were re-calculated. This did not result in any appreciable differences from the original results (Figure ) that were obtained using a cut-off of 3 sd (for the 2 sd cut-off, the amino acid gamma distance was 0.18 for duplicates and 0.29 for singletons, and the dN/dS ration was 0.14 for duplicates and 0.20 for singletons).
Figure 5 Ortholog identification control. a – The symmetrical best BLAST hits approach may mis-identify orthologs in rare cases where there is an ancient gene duplication followed by differential loss of paralogs. b – The dS distributions before (more ...)
It is also formally possible that the differences in the rates of evolution between duplicates and singletons are due to the difficulty in the detection of paralogs of rapidly evolving genes, which would result in erroneous classification of such genes as singletons. To control for this potential bias, dS between duplicate human genes and their most closely related paralogs were determined. Duplicate genes were then re-classified as singletons if dS for a pair of human paralogs was greater than dS between each of the respective genes and its mouse ortholog. Under this procedure, only recent paralogs were classified as duplicates. The substitution rate differences between these duplicates and the resulting set of "pseudo-singletons" were re-calculated. This procedure did not result in qualitative change in the results; in fact, the magnitude of the difference between duplicate and singleton amino acid substitution rates slightly increased (from -0.11 in Figure to -0.14). Thus, the differences in the evolution rates between duplicates and singletons did not seem to be due to a detection bias.
Orthologous protein sequence pairs were identified and aligned for two more pairs of eukaryotes and for three pairs of prokaryotes, all with complete genome sequences, and the pairwise sequence alignments were used to determine amino acid substitution (evolutionary) levels for duplicate versus singleton orthologs. For both additional eukaryotic species pairs (insects and yeasts), the average ortholog amino acid substitution levels for duplicates were substantially (and statistically highly significantly) lower than those for singletons (Table and Figure ). The same qualitative pattern was seen for the prokaryotic species comparisons, with the duplicate class showing consistently lower average amino acid substitution rates (Table and Figure ). However, the differences were far less pronounced than in the case of eukaryotes, and in only one case the average rate difference between duplicates and singletons was marginally statistically significant (Table ).
Average amino acid substitution levels, with 95% confidence intervals, for orthologous pairs with paralogs (duplicates – light gray bars) and with no paralogs (singletons – dark gray bars). Species comparisons are shown on the x-axis.
A similar relationship between gene duplication and evolutionary diversity was observed when the amino acid substitution levels between orthologs were considered with respect to the number of detectable paralogs for a given ortholog. For all three eukaryotic comparisons, there are statistically significant negative correlations between the number of amino acid substitutions per site and the number of paralogs (Table ); in other words, proteins with more paralogs tend to evolve more slowly between species than proteins with fewer paralogs. However, the magnitudes of these correlations are slight (Table ). The effect was even less pronounced for the prokaryotic comparisons; while the correlations between the sequence diversity levels and the number of paralogs were all negative, the magnitudes of these correlations were quite small and none of them was statistically significant (Table ).
Correlation between ortholog substitution levels and the number of paralogs
More striking than the relationship between evolutionary sequence diversity and the number of paralogs was the correlation between the amino acid substitution levels between orthologs and those between the most closely related paralogs. For each orthologous pair with detectable paralogs, the amino acid distances between orthologs were plotted against the distances between one of the orthologs and its most closely related paralog. For all three eukaryotic comparisons, there was a highly significant positive correlation between the two sequence divergence levels (Table ). Thus, orthologs that evolve relatively slowly between species tend to have more closely related paralogs within genomes, and orthologs that evolve more rapidly have less closely related paralogs. The r2 values for these relationships were about an order of magnitude greater than those for the comparisons between sequence diversity and the number of paralogs (compare Tables and ). As in the previous cases, the relationship between ortholog and paralog amino acid substitution levels is not nearly as strong for the prokaryotes as it is for eukaryotes (Table ). Nevertheless, the connection between gene duplication and sequence evolution of orthologs in prokaryotes is most evident in this comparison, with two out of the three correlations being statistically significant (Table ).
Correlation between ortholog substitution levels and the substitution levels between the most closely related paralogs
Age of duplications and substitution levels
One advantage of the comparison of orthologs rather than paralogs is that the time of divergence is the same, namely, the time of speciation, for all analyzed orthologous pairs. Paralogous pairs, in contrast, will often have diverged via duplication at different times. In the case of orthologous proteins then, differences in substitution levels are primarily due to differences in the strength of purifying selection, whereas the apparent differences in substitution levels for paralogous protein pairs are additionally affected by differences in the time of duplication. This distinction is relevant to the comparison of evolutionary rates between orthologs versus evolutionary rates between closest paralogs [22
]. As described above, there is a strong positive correlation between these rates. This correlation could indicate that proteins that are strongly conserved between species are also strongly conserved within genomes, or it could mean that proteins that are strongly conserved between species tend to have more recent duplicates in the genome. In an attempt to distinguish between these two explanations for the positive correlation between ortholog and paralog sequence divergence, gene duplications were partitioned into approximate isotemporal classes. For example, all-against-all sequence comparisons were performed for human, mouse and Fugu rubripes
(Fugu), and the results were used to partition duplications along three evolutionary classes (Figure ): – duplications that occurred along the human lineage (after the human – mouse divergence), 2 – duplications that occurred along the mammalian lineage (after the divergence between Fugu and the human – mouse lineage), and 3 – relatively ancient duplications that occurred along the lineage that leads to all three species (before their divergence). The same procedure was also used to partition duplications along three yeast evolutionary lineages (Figure ).
Mapping of lineage-specific expansions to individual branches of phylogenetic trees. Shown for vertebrates (a) and yeasts (b).
Once this partitioning was complete, the ortholog versus closest paralog amino acid substitution levels were analyzed independently for each of the three classes of duplicates. This procedure has the effect of normalizing (to a degree) the time of duplication so that only proteins encoded by genes that duplicated along the same evolutionary lineage are compared. When this was done, the correlations between orthologs and paralog amino acid substitution levels within each isotemporal class of duplicates became even stronger than those seen for the pooled data (Table ). This result strongly suggests that the same functional constraints govern a gene's evolution after speciation and after duplication.
Correlation between ortholog substitution levels and the substitution levels between the most closely related paralogs for lineage specific expansions (Figure 3)
Acceleration versus deceleration of gene duplicate's evolution
Both theoretical and empirical studies have previously pointed to an acceleration of sequence substitution following gene duplication [5
This is thought to be due to either a relaxation of purifying selection or the action of positive, diversifying selection (or perhaps both). For instance, when pairs of paralogs were compared to pairs of orthologs that have similar levels of protein divergence, it was shown that the paralogs had higher dN/dS values [6
]. This was taken as evidence for a relaxation of selection immediately after gene duplication. Consistent with this notion, two recent studies have shown that members of duplicate pairs often evolve at significantly different rates after duplication and that the more rapidly evolving duplicates have elevated dN/dS [7
]. In light of these observations, it seems surprising that we found strong evidence here that orthologs with duplicates evolve more slowly than singletons. Indeed, the evolutionary history of those orthologs that have duplicates would seem to include a period of accelerated evolution after gene duplication, whereas orthologs without duplicates are unlikely to have experienced such an acceleration. Thus, everything else being equal, duplicates would be expected to evolve faster than singletons.
To investigate this apparent contradiction, we identified triplets of genes which included a single mouse gene and a pair of human paralogs that evolved via a duplication subsequent to the human-mouse divergence (Figure ). The dN and dS values for the human paralogs in such gene sets were compared to the average dN and dS values for the mouse gene and each of its human co-orthologs. Comparisons between the human paralogs showed significantly higher average dN/dS ratios (t-test, P < 5 × 10-5
) than the human-mouse ortholog comparisons (Figure ). This pattern holds across a series of increasingly stringent cut-offs based on the level of dS between paralogs (Figure ). The same pattern was also seen in a reciprocal comparison, when levels of dN and dS for human and mouse orthologs were compared to levels of dN and dS for mouse paralogs (data not shown). For the most closely related paralogs, the dN/dS ratio of paralogs was ~3-fold greater than the dN/dS ratio for the same paralogs and their single ortholog in another species (Figure ), which is remarkably close to the value determined previously with a different approach [6
]. The magnitude of the difference declined for more distant paralogs (Figure ), in accord with the notion that the acceleration of evolution occurs immediately after duplication [5
Figure 4 Post-duplication relaxation of purifying selection in paralogs. a – Schematic illustrating the rationale for the comparison of dN/dS for human-mouse orthologs versus human paralogs. dN/dS levels were averaged for sets of proteins, related as shown, (more ...)
These observations suggest, consistent with previous findings, that paralogs do indeed experience a post-duplication period of accelerated evolution, which is apparently due to the relaxation of purifying selection. These results stand in stark contrast to the finding that orthologs with duplicates are more evolutionarily conserved than orthologs with no duplicates. It seems that there are two countervailing forces at work on the sequence evolution of duplicate genes: i) acceleration of substitution between paralogs caused by relaxation of purifying selection after duplication, and ii) relative reduction of substitution rate for genes with duplicates compared to singletons, which is predicated upon the stronger functional constraints affecting the former. The post-duplication acceleration has the effect of mitigating the sequence divergence differences between duplicates and singletons. This makes the differences in substitution levels that are observed between these two classes of orthologs even more notable.
Functional distribution of duplicated genes
Taken together, the measurements of sequence diversity reported here as well as previous observations and theoretical arguments [5
] suggest that the fate of duplicated genes depends greatly on their functional utility. Selection probably does continue to operate on the products of gene duplication but only in cases when the duplicates contribute substantially to organismic fitness. In order to further assess the validity of this notion, functional distributions of orthologs with and without duplicates were examined. The database of euk
roups of proteins [24
] was used to classify eukaryotic proteins into four broad functional categories: 1 – information storage and processing, 2 – signaling and other cellular processes (such as protein folding, degradation and trafficking), 3 – metabolism and 4 – poorly characterized. These distributions were then compared for the two classes of orthologs, those that possess duplicates and those that do not (Table ). The distributions of the observed numbers of proteins in each category were compared using a χ2
test where the expected numbers were calculated based on the functional distribution for all proteins. For all three eukaryotic comparisons, the functional distributions of the proteins in the two classes – duplicates versus singletons – were shown to be significantly different (Table ). In almost all cases, the difference was most pronounced for the poorly characterized functional category; there are far fewer poorly characterized proteins among duplicates than expected. By contrast, the set of singletons is enriched in poorly characterized proteins. Thus, duplicates that are retained and conserved during evolution are enriched for proteins with known functions, particularly proteins that function in signaling and other cellular processes. Apparently at odds with this general pattern is the fact that duplicates have fewer information storage and processing proteins than expected, while singletons have more than expected. This could be due to the fact that many proteins involved in translation, transcription and replication function as multi-subunit complexes (e.g., the ribosome and RNA polymerase holoenzyme) such that duplication of the genes for individual subunits could lead to a dominant negative effect that would be selected against [26
χ2 testa of the functional distributionb of eukaryotic orthologs (duplicates versus singletons)
General discussion and conclusions
After the submission of the current work, we became aware of a very recent, independent study that reached the same major conclusion as we do here, namely that duplicate genes are, on average, more evolutionarily conserved than singletons in eukaryotes [27
]. The analytical approach employed by Davis and Petrov was conceptually similar to ours in that it involved the comparison of ortholog substitution levels for genes designated as duplicates or singletons. However, an important difference between the two studies is that the approach of Davis and Petrov involved the characterization of genes as duplicates or singletons in one pair of species, Caenorhabditis elegans
and S. cerevisiae
, and the estimation of substitution levels in another pair of species, D. melanogaster
and A. gambiae
. This allowed for an estimate of substitution levels independent of the effects of gene duplication, whereas the substitution levels analyzed here were affected by duplication. We believe that it was important, as it is done here, to demonstrate on the same dataset that evolution of duplicated genes is shaped by the interplay of two opposing effects, the initial increase in substitution rate after gene duplication, and generally lower evolutionary rate of duplicates compared to singletons.
The results reported here point to two opposing trends in the evolution of duplicate genes. For the analyzed eukaryotic species, there is a clear relationship between gene duplication and the sequence divergence of orthologs: duplicates tend to evolve more slowly, on average, than singletons. Two recent studies reported conflicting observations on the relative rates of evolution of duplicates and singletons. Yang, Gu, and Li performed a comparison of S. cerevisiae-C. albicans
orthologs with and without duplicates and found that the former, on average, evolved slower than the latter, in a qualitative agreement with the results described here [23
]. In contrast, Nembaware and coworkers analyzed the evolutionary rates of human paralogs with varying levels of divergence and found that, in human vs. mouse comparison, a particular class of paralogs with intermediate divergence evolved significantly faster than singletons [22
]. It remains unclear what caused this difference in conclusions. However, the statistical significance and robustness of the lower level of substitutions in duplicates compared to singletons, which was observed for all compared genome pairs (albeit to a much lower extent in prokaryotes than in eukaryotes) in the present study, strongly suggests that duplicates indeed tend to evolve slower than singletons.
The finding that, on average, duplicates are more evolutionarily conserved than singletons is probably explained by the fact that the duplicates that are retained by selection are of greater functional utility than those that are lost after gene duplication. Thus, the selective pressure acting on the sequences of duplicated genes is, on average, greater than that affecting the sequences of singletons. The difference in the functional distributions between duplicated and non-duplicated genes is consistent with this notion. Apparently, genes that encode proteins with domains that are already widely employed in various cellular processes are more likely to contribute to the functional diversification of an organism via gene duplication than are genes encoding proteins with more limited functional utility.
However, in accord with the previous findings [5
], we also demonstrate here a substantial acceleration of sequence substitution immediately after gene duplication. Thus, the observation that, when orthologs are considered, duplicates tend to evolve more slowly than singletons is somewhat paradoxical. If one or more members of a set of paralogous genes experience a period of accelerated evolution, one might expect that, everything else being equal, this would have the effect of elevating the substitution levels between those genes and their orthologs above those characteristic of singletons. However, the results described here indicate that genes with duplicates are "more equal" than singletons in that the former, on average, are subject to more stringent purifying selection than the latter, presumably due to the relatively greater functional utility manifest in the increased likelihood of duplication fixation.
The relationship between duplication and ortholog sequence evolution also seems to be at odds with the fact that a considerable number of essential proteins, e.g., components of the core machineries of translation and transcription, do not have any paralogs but nevertheless evolve slowly. In contrast, some large multigene families, such as the immunoglobulins, encode proteins that evolve rapidly [28
]. It appears that these two classes of proteins are exceptional: the former are subunits of stoichiometric complexes whose duplications is discourage by selection due to the deleterious effects of imbalance [26
], whereas the former are adaptive linear specific expansions of paralogous families evolving under positive selection [30
]. These well known exceptions to the general pattern reported here seem to render the relationship between gene duplication and ortholog substitution levels, which are averages based on comparisons of thousands of proteins, even more striking