This study presents the first attempt to systematically identify adaptive gene losses in the human genome since the common ancestor of euarchontoglires, approximately 75 Mya. Using losses of well-established genes as the proxy for adaptive gene losses, we focused on identifying a class of pseudogenes that were once functional and retained this function through tens of millions of years of evolution. We confidently identified 26 losses of well-established genes, including 16 that were not previously known in the literature. The highlight of this analysis is the ability to automatically detect losses of genes bearing no significant homology to any functional paralog in the human genome. Their functional precursors had an ancient origin, but enough evolutionary time has elapsed to erase any significant homology with other genes in the human genome. These genes were functioning for hundreds of millions of years and silenced recently within the past 75 My.
It has been proposed that the majority of pseudogenes are either dead-on-arrival [
58] or inactivated quickly after duplication [
27]. Therefore, it is not surprising that we have identified a much smaller number of pseudogenes as compared to the thousands identified by previous whole genome analysis that aimed to catalog the human genome for unprocessed pseudogenes [
30,
31,
36,
64]. We overlapped our results with two well-known pseudogene databases, Yale pseudogene database, composed of mostly various computational predictions [
64], and VEGA pseudogene collection, compiled by manual curation [
65]. We found limited overlap between the losses identified in with both pseudogene sets (
Table S4). Only two out of 31 annotated, zero out of 14 hypothetical, and five out of 27 ORs were found by all three analyses. Neither database has
GULO, Cardiotropin 2, or many others listed in (see note b). A recent genome scan identified 67 human-specific gene losses, including 36 ORs [
58]. Excluding ORs, only one out of the six human-specific gene losses identified in , Gpr33, was also discovered in that study [
58]. Another possible overlap is Ugt2b1, which belongs to a tandem cluster of Ugt2B genes on Chromosome 4. The limited overlap in part reflects the difference in methodology used to identify the pseudogenes, but also makes apparent that none of these methods in their present state are able to form the complete set of losses of genes with ancient origins. It also confirms that we have identified some unprocessed pseudogenes derived from functional precursors of ancient origin, where evolution has erased any significant homology to their current functional paralogs.
The gene loss candidates shown in are by no means a complete list of losses of well-established genes in the human lineage during the past 75 My. TransMap gene model prediction methodology is not perfect, many factors can introduce prediction errors including uncertainties in sequence alignments, errors generated by the gene model prediction and evaluation procedures, and evolutionary changes of the gene structures across mammalian species (
Text S1). For example, the well-known human specific loss of
CMAH (a CMP-sialic acid hydroxylase) [
66] was not found by this analysis due to the strictness of TransMap gene model predictions, causing a valid
CMAH gene model in the dog genome to be excluded because it featured a noncanonical GC-AG splice junction. However, the use of an outgroup genome and the mRNA filter makes the analysis far more likely to produce false negatives than false positives. Several other factors also contribute to this incompleteness. First, our method using human–mouse–dog comparison relied upon well-defined mouse genes to seed the search and valid dog predictions for outgroup confirmation. Problems in either one will return a false negative result. Our analysis missed
MYH16 because it is not in mouse RefSeq, which could be due to an independent loss or a misannotation. We further investigated its absence and found that the
MYH16 syntenic region is not present in the mouse genome, indicating an independent loss in mouse via genomic deletion. Our analysis required a valid conceptual translation in the dog genome, which may fail to occur due to TransMap prediction errors, sequencing gaps, or an independent loss in dog. However, the chance of producing a valid mapping increases if multiple outgroups, such as the opossum genome [
67] or a computationally reconstructed ancestral genome [
68], were used and the resultant gene loss predictions were combined. For example, the previously documented human specific loss of Htr5b [
69] can be identified using a reconstructed boreoeutherian genome as the outgroup (Haussler lab, unpublished data). Our analysis can also be improved by extending our seed mRNAs to include those from other species and by using multiple outgroup genomes. For example, using chimpanzee
MYH16 mRNA as a seed could have found this pseudogene in human.
Our analysis may not identify human polymorphic gene losses. For example, the human-specific loss of
CASP12 [
58,
59] was not identified by our analysis because the latest human genome assembly (NCBI release 36) has the functional allele. Several other human polymorphic losses were also missed by our analysis for the same reason [
70,
71]. These polymorphic null alleles are potentially crucial to human diseases, e.g.,
CASP12 in sepsis and
CCR5 in HIV infection. Incorporating human EST and mRNA information, as was done by Hahn et al. [
70,
71], or the human SNP dataset [
72], could help our method identify human polymorphic gene losses. Overlapping those alleles with human disease loci, such as those documented in OMIM database [
73] or identified by genetic association studies, might lead to the identification of new human disease associated genes. Another factor that may cause the method to overlook gene losses is related to segmental duplication. After a gene is duplicated, both the ancestral copy (the copy in the original genomic context) and the daughter copy (the copy duplicated in the new genomic context) are equally subject to degenerative mutations. Since our analysis evaluates based on the status of the ancestral copy, if evolution silences the daughter copy, it will not be identified by our method. However, this type of false negative is quite limited in our results because it only applies when a segmental duplication occurred after the boreoeutherian common ancestor. Treating the daughter copy in the same way as the ancestral copy will solve this problem, except in the case of a tandem segmental duplication, where it is difficult to distinguish the ancestral copy from the daughter copy.
Among the 26 losses of well-established genes, six were identified to be lost independently in the human and old world monkey lineages (numbers 8, 11, 12, 13, 15, 25 in ). This can be interpreted as a confirmation for adaptive evolution, if we believe that a common selection pressure forced these genes to be lost in separate clades. Other known independent losses such as Caspase15 and Gpr33 seem to confirm this hypothesis [
74,
75]. An alternative interpretation is that the gene function is no longer needed, such as the loss of
GULO in guinea pigs and humans [
40]. However, it is also quite probable that the original loss did not occur independently on different lineages, but rather a common mutation that was missed by the analysis might have occurred earlier on a shared ancestor to inactivate the gene. This might have been a mutation in a noncoding region, or a mutation that was erased by secondary mutations such as genomic deletions. For example, a prior, noncoding mutation in any of the six cases we found could have disrupted the transcription, translation, or regulatory signals of the gene in the common ancestor of old world monkeys and apes, rendering the gene effectively inactive at the time that these lineages split. Since the gene is no longer under selective pressure to maintain its integrity, secondary ORF-disrupting mutations could follow, occurring independently in the separate lineages, as observed by our analysis.
To identify genes that are truly lost, we have focused on regions lacking any reported mRNA evidence, including in cell lines derived from cancer cells. A large number of candidates with differential mutational status in the human and dog gene predictions (918 out of 1,008) were filtered out because they overlap with some mRNA evidence in humans. The majority of these are likely to be TransMap prediction errors (
Text S1,
Table S1). However, some pseudogenes still generate transcripts if the transcription signal is intact, and these would be overlooked by our method. An example of a transcribed pseudogene in the human genome that appears on this list is
CATSPER2 (chr15: 41815434–41825788), represented by GenBank mRNA BC066967, and BC047442. The mammalian gene collection annotates it as a transcribed pseudogene. If a pseudogene is transcribed and spliced, its mRNA transcript with ORF-disrupting mutations (i.e., premature stop codon) is targeted and degraded by the cell's RNA surveillance pathway of nonsense mediated decay [
76], although this process may not be complete. Only with time will these pseudogenes will be completely silenced at the level of transcription. In addition, studies have shown that occasionally a pseudogene, like Makorin1, not only transcribes but also plays a vital biological role in stabilizing the mRNA of its homologous coding gene [
77]. Thus it is difficult to prove that a transcribed pseudogene is completely nonfunctional.
Theories of molecular evolution suggest three outcomes for new genes arising from gene duplication: degeneration due to functional redundancy, evolution into a new function, or function sharing by both copies [
27]. The expected time that elapses before a gene is inactivated is thought to be relatively short [
27]. Lynch and Conery estimated the half-life of a new duplicate to be around 16 My in the human lineage [
27,
78,
79]. Using this estimate, after our cutoff of 50 My, 11% of redundant genes caused by duplications are expected to be intact by chance. After 60 My (the shortest estimation that passes the cutoff in ), only 7.5% will be left. Twenty-six candidates in are classified as losses of established genes using the 50 My cutoff, and many have an estimated functional period after duplication that is much longer than 50 My. This suggests that they are likely to have evolved independent functions before pseudogenization and thus likely to be true losses of well-established genes. In addition, our method used the lower-bound estimation for the functional time length for this classification. Although the higher-bound estimations for four candidates (
PFPL,
ABCA14,
LOC344492,
BC018465 in ) satisfy the 50 My cutoff, their low estimations do not. As complete genome sequences for additional mammals become available in the future, the timing of duplication and pseudogenization can be greatly refined, potentially classifying some of these four candidates as losses of established genes as well.
It is nontrivial to determine whether these losses we have found were truly adaptive. It is very likely that neutral losses at dispensable loci account for a subset of our results. For example, GULO, a vitamin C biosynthesis gene, is thought to have been lost in primates because primates have ample dietary supply of ascorbic acid, reducing or removing the selective pressure that maintains this gene. In general, it is difficult to differentiate between neutral loss due to removal of selective pressure, as proposed by the “use it or lose it hypothesis,” and positively selected adaptive loss, as by the “less is more” hypothesis, without knowing the gene's precise biological functions. Given our current knowledge of human genes, identifying the losses of established genes seem to be the best strategy in the search for more ancient (before 250 Mya) adaptive gene losses on a genomic scale. The resulting list is a much more enriched set of candidates.
In summary, our analysis identified a set of losses that are highly enriched for well-established genes in the human genome against a large background of pseudogenes. Expanding these results to include genes and genomes from the entire mammalian clade will generate a more accurate and comprehensive picture of adaptive gene losses in human evolution. From a theoretical standpoint, it will provide insight into the role that loss of functional genes plays in evolutionary adaptation [
4]. The method presented here can also be generalized to discover gene losses in other organisms on a genomic scale.