Genes in the human genome vary in their evolutionary age. A considerable proportion of human genes (e.g., ~10%, even only considering “strict orthologs” with unambiguous one-to-one relationships [
Berglund et al. 2008]) can be detected in the yeast genome, implying that they originated before the common ancestor of human and yeast diverged more than 1.5 billion years ago. On the other hand, human genome contains a small fraction of genes found in only one or a few closely related species, such as, mammals- or primates-specific genes (e.g.,
morpheus [
Johnson et al. 2001] and
SPANX [
Kouprina et al. 2004]). Recent bioinformatics analysis revealed 270 primate-specific and 364 mammal-specific genes; some of them may have originated de novo (
Toll-Riera et al. 2009;
Toll-Riera, Castelo, et al. 2009). Indeed, there is increasing experimental evidence for emergence of new genes from noncoding mammalian genomic regions (
Heinen et al. 2009;
Knowles and McLysaght 2009).
We have classified human/chimp genes based on the breadth and the depth of their phylogenetic distributions in 11 eukaryotic genomes using three related but distinct metrics that quantify the breadth (LS), the depth (PL), and the rate of GL (
Krylov et al. 2003;
Alba and Castresana 2005;
Cai, Woo, et al. 2006). We confirmed that younger and less broadly distributed proteins evolved at distinctly higher divergence rates than older and broadly distributed genes (
Domazet-Loso and Tautz 2003;
Daubin and Ochman 2004;
Alba and Castresana 2005;
Wang et al. 2005;
Cai, Woo, et al. 2006;
Kuo and Kissinger 2008). This pattern is very pronounced: for instance, the correlation coefficient between one of the measures of the phylogenetic breadth and depth (LS) and the rate of protein evolution between humans and chimps (K
a or K
a/K
S) is higher than 0.5. Another illustration of the strength of this signal is that human/chimp genes that cannot be detected in the mouse genome and beyond have been evolving approximately 4 times faster between humans and chimps than the human/chimp genes whose presence can be detected all the way to yeast. In addition, this effect is robust to the variation in levels of gene expression, existence of paralogs, relative abundance of CpG sites, GC content of genomic regions, and classes of gene functions (i.e., GO annotations). The age of a gene or the breadth of its phylogenetic distribution is thus one of the best predictors of its rate of evolution (
Alba and Castresana 2005;
Cai, Woo, et al. 2006).
The fast evolution of genes that have a restricted phylogenetic distribution raises a possibility that even old and broadly distributed but fast-evolving genes might be misclassified as young and lineage specific due to our inability to detect their orthologs in distant species (
Elhaik et al. 2006). Fortunately, this entirely reasonable concern appears not to generate severe ascertainment problems in practice.
Alba and Castresana (2007) simulated the evolution of protein genes using the same overall evolutionary rates and the same among-site rate heterogeneity as observed in mammalian genes. They found that Blast could detect practically all genes in this analysis all the way to the level of divergence between yeast and mammals. This is probably because even fast-evolving proteins tend to contain some conserved segments. These conserved segments, even if they are fairly short, can still be detected by the local alignment algorithm of Blast. One of our phylogenetic measures, PL, exclusively dependents on Blast to determine gene age and should be reliable based on the simulations of
Alba and Castresana (2007). One of the other measures, LS, should be at least as sensitive as PL and thus should not be affected severely either. We provided two additional lines of evidence that our results are not artifactual. First, we split the genes into two groups based on their rate of evolution between humans and chimps. We were able to detect faster evolution of younger and more narrowly distributed genes within each group and most importantly within the group of slowly evolving genes. The second line of evidence is based on the use of the number of GL measure. This measure classifies genes based on the detected number of losses in the phylogeny for genes that can be detected in the most distant taxa, in our case human/chimp and yeast. In the case of GL, all human/chimp genes can be detected in yeast making it very unlikely that the apparent absence of these genes in much closer related lineages was due to the failure of detection and not due to their true absence.
The faster protein evolution of younger or more narrowly distributed genes must be due to changes in the way natural selection operates on mutations in these genes. It is not due to the difference of mutation rates because the patterns of evolution at synonymous sites in younger genes are indistinguishable from those in older genes. In addition, these patterns are robust to the variation in GC content across the human/chimp genomes, which in principle could generate spurious signals. But what are these changes in the natural selection? There are two nonmutually exclusive possibilities: 1) younger genes can be subject to weaker selective constraint (weaker purifying selection) and/or 2) younger genes are subject to positive selection more frequently.
We have used genome-wide SNP data in humans and the divergence data between human and chimp to demonstrate that at least the first possibility is true. Younger and less broadly distributed genes are subject to substantially less selective constraint. The weaker constraint is evident in the higher density and higher population frequencies of nSNPs in younger genes. In fact, nSNPs in the youngest genes segregate at the same frequencies as sSNPs, whereas the frequency of nSNPs is substantially reduced in the older genes. These results are robust to the use of any of the three SNP data sets that we used, namely Applera, dbSNP, and Perlegen data sets. In addition, we observed the clear anticorrelation between the fraction of nearly neutral mutations and gene age, that is, the younger genes are, the higher proportion of new mutations in genes are nearly neutral. The pattern is strong as the increase of the proportions from old genes to youngest genes can be as high as 4-fold (see Results). One reason for the weaker selective constraint in younger and less broadly distributed genes is that these genes might be less functionally important or at least less consistently important than older and more broadly distributed genes. Consider a gene that can be found in the genomes of yeast and humans and in every taxon in between. It is clear that such a gene is not only old but also has a very low probability of loss due to inactivating mutations. This implies that inactivating mutations in such genes are consistently strongly deleterious most likely because such genes perform important or even essential functions. In such genes, as surmised by
Wilson et al. (1977), even subtle amino acid mutations would tend to lead to sufficiently strong deleterious effects to be noticed by natural selection. In contrast, a substantial proportion of younger genes and especially genes with patchy phylogenetic distributions either have been lost in some lineages or at least we have no evidence that they cannot be lost. Indeed, given that genes are formed all the time by a variety of mechanisms while the number of genes within genomes do not continuously increase, we can surmise that a substantial proportion of younger genes are destined to be lost over relatively short periods of time (see also
Wolf et al. 2009). This means that for many of the younger genes even null mutations are not always strongly deleterious. It is not surprising then that such genes show weaker selective constraint against more subtle amino acid-changing mutations. We emphasize that the gene age effect should be taken as a prior in studying the fitness effect of mutations of genes. Our analysis has been restricted to human genes; however, the patterns we found should be applicable to other species, especially, given that a general birth-and-death model has been found applicable to genes in multiple lineages (
Wolf et al. 2009).
We used two approaches (DoFE [
Eyre-Walker and Keightley 2009] and MKPRF [
Sawyer and Hartl 1992;
Bustamante et al. 2002,
2005]) to test the second possibility, namely that younger genes experience a higher rate of positive selection. Using DoFE, we estimated ω
A for each LS class of genes. We detected no correlation between LS and corresponding ω
A for genes in LS classes, providing no evidence of higher prevalence positive selection in younger genes. However, using MKPRF, we did find some evidence that there were proportionally more genes showing signs of positive selection (γ > 0) in younger age classes. The proportion of genes with a positive γ goes from ~1–2% in the oldest genes to ~6–12% in the more lineage-specific genes (LS groups 7 through 10). Because this result can be biased by the higher prevalence of slightly deleterious nSNPs in the older genes, we reran the analysis either after eliminating rare (<15%) SNPs (
Fay et al. 2001) or after subsampling nSNPs in different LS categories to match that in the youngest and the least biased gene category. Furthermore, MKPRF results might be affected by the choice of a different prior and the use of different models (hierarchical vs. nonhierarchical) (
Li et al. 2008). In all these additional analyses, MKPRF results continue to suggest that a higher proportion of younger genes exhibit signs of positive selection. The inconsistent results produced by two methods emphasize that the evidence for higher rate of adaptive evolution among younger genes is tentative. Among other things, the difficulty of detecting the difference could be due to a weak genome-wide signal of positive selection associated with human protein-coding genes in general.
Nevertheless the higher rate of adaptation in the young genes might be consistent with the ideas that lineage-specific genes may drive morphological specification, enabling organisms to adapt to changing conditions (
Khalturin et al. 2009) and also with the observation that young genes tend to be less functionally important. Fisher’s geometric models of adaptation predicts that small phenotypic changes should have a higher probability of being advantageous (
Fisher 1939) (but see [
Kimura 1983;
Orr 2002]). If mutations in younger genes tend to have more subtle phenotypic effects, then such effects would be both less likely to be deleterious and more likely to be adaptive. In this way, older, indispensable proteins would form the conserved, ancient, unchanging core of functionality of the cell and the organism, whereas the newly added and patchily distributed genes would not only contribute to genic and functional diversity among lineages directly but also disproportionately underlie their continuous adaptation to environmental changes. Furthermore, if adaptation preferentially takes place in young and lineage-specific genes while deleterious mutations preferentially land in ancient and shared genes, then the ways organisms fail would bear more resemblance with each other than the ways in which they adapt. The case in point is that most human genes with known disease-causing mutations do tend to be old (
Domazet-Loso and Tautz 2008;
Cai et al. 2009). This is good news for the investigation of human disease through the investigation of even distantly related animal models.