|Home | About | Journals | Submit | Contact Us | Français|
Although our knowledge of the genes and genomes of extinct organisms is improving as a result of progress in sequencing ancient DNA, the transcriptomes of extinct organisms remain inaccessible, owing to the rapid degradation of messenger RNA after death. We provide empirical evidence that gene expression levels in the reproductive tissues of mice and during early mouse development correlate highly with the rate of inherited retroposition: the source of processed pseudogenes in the genome. Thus, processed pseudogenes might serve as fossilized footprints of the expression of their parent genes, shedding light on ancient transcriptomes that could provide significant insights into the evolution of gene expression.
Sequencing the genomes of ancient extinct organisms is a ‘dream come true’ for evolutionary geneticists. With the unraveling of whole or partial mitochondrial and nuclear genome sequences of extinct species, including the woolly mammoth [1-3] and Neanderthal [4-6], many important evolutionary events, from human origins to past epidemics and crop domestications, can be dissected to an unprecedented resolution. For example, Neanderthal genome sequences have and will continue to inform us about the potential genetic contribution of Neanderthals to the contemporary human gene pool [4-7] and will enable the identification of human-unique genetic changes that occurred in the past few hundred thousand years of human evolution.
Research in evolutionary developmental biology repeatedly demonstrated the importance of gene expression changes in phenotypic evolution [8-10]. However, owing to the instability of message RNA (mRNA), its extraction from ancient organisms is beyond reach. Although prediction of gene expression levels and patterns from DNA sequences has been attempted in unicellular model organisms , it is not yet possible in complex multicellular organisms, particularly if the prediction is based on ancient DNA sequences that are incomplete or inaccurate. Although it is theoretically possible to infer the expression of a gene in an ancestral organism using the expression data from extant organisms , such inferences are likely to be unreliable, because gene expression evolves rapidly [13-15]. Here we explore the possibility that processed pseudogenes provide information about past gene expression levels and present an analytical framework for retrieving such information.
The genomes of many eukaryotes, including humans, are ‘littered’ with pseudogenes – defunct relatives of known genes that have lost their protein-coding ability. Many pseudogenes originate through germline retroposition, where the mRNA of a gene is reverse-transcribed to complementary DNA and reinserted into the genome at random . Most of these re-inserted retrogenes lack promoters and are ‘dead-on-arrival’. They gradually accumulate open-reading-frame (ORF)-disrupting mutations, and are eventually recognizable as processed pseudogenes (Pψs). Pψs have been used for archiving splice variants  and for studying rates and patterns of neutral mutations . In principle, the number of Pψs of a given parent gene depends on the abundance of the mRNA of the parent gene in the germline, which is determined jointly by the rates of mRNA production and degradation [19, 20]. Indeed, a large proportion of Pψs are from housekeeping genes, which generally are highly transcribed and have stable mRNAs . However, a recent comparison between the expression level of a gene with the number of its ‘Pψ offspring’ failed to find gene expression level to be a dominant determinant of the rate of retroposition . Here we hypothesize that the present-day expression level of a gene should be best correlated with the number of its young Pψ offspring, not all Pψ offspring, because a gene's expression in the germline might not have been constant during evolution. Because retroposition occurring at earlier stages of development (i.e. embryogenesis) are more likely than that in later stages to impact the germline, we examined mouse expression data from a variety of tissues at different developmental stages (see supplementary material online). Our datasets include the comprehensive catalog of mouse pseudogenes (www.pseudogene.org) and 23 mouse microarray expression datasets from Gene Expression Omnibus (www.ncbi.nlm.nih.gov/geo/) (Table S1).
As a proxy for Pψ age, we calculated the nucleotide sequence divergence (d) between each of 5,315 Pψs and their respective parent genes in mouse, at third codon positions, using Kimura's 2-parameter distance . We then binned the pseudogenes in increments of d =0.05. The age distribution of Pψs in our dataset (Figure 1a) reflects the combined effects of an exponential decay of nonfunctional sequences in the genome and an accrual of pseudogene characteristics over time. The former effect is due to the gradual deletion of nonfunctional sequences from the genome and gradual divergence of pseudogenes from their parent genes beyond recognition. The latter effect occurs because very recently formed Pψs are under-detected due to the lack of ORF-disrupting substitutions (also below). The age distribution of Pψs is also highly influenced by the cellular reverse transcriptase activity in the past, which in mammals is determined by the activity of long interspersed element 1 (LINE1) . We correlated parent gene expression in various wild-type tissues at various developmental stages with the number of Pψ offspring of different age groups. We found the correlations to be greater for expression levels in reproductive and embryonic tissues than for those in somatic tissues (e.g. lung, kidney, liver, and neural tissues) across all Pψ age categories (Figure 1b; see also Figure S1 and Table S1 in the online supplementary material) When considering all Pψs younger than d (see earlier definition of d), we observed the correlation to reach its maximum at d =0.25 (Figure S1). In theory, the lower the d value, the higher the correlation. In reality, however, very recently formed Pψs are under-detected (see above). Based on the rates of point and insertion/deletion mutations in mammals, we calculated that retropseudogenes should be detectable when d =0.1 (see Supplementary Methods). The likely reason that the correlation is not higher for d =0.1 than d =0.25 is that the former group contains much fewer genes than the latter group. We divided all Pψs into three broad age groups that may be referred to as young (d <0.25), middle (d =0.25-0.65), and old (d >0.65) Pψs. It is apparent that the correlation is much higher for the young group than for the middle and old groups (Figure 1b). The highest correlation was found between the parent gene expression levels during embryonic gonad development and the number of Pψ offspring with d <0.25 (Pearson's R =0.54, P <10-115; Spearman's ρ =0.38, P <10-50; Figure 1c). For this dataset, one can indeed use the number of Pψs to predict the expression levels of their parent genes to some extent. For example, expressions of parent genes with 5 Pψs are significantly higher than those with 4 Pψs (P =0.003, one-tail Mann-Whitney U test; see additional results in Table S2). Note that because microarray-based measurement of gene expression is imprecise, particularly for comparisons among genes as in the present case , the actual correlation between parent gene expression and Pψ offspring number is expected to be even greater. These results confirm that expression levels of parent genes affect the rate of retroposition and is consistent with the understanding that inheritable retroposition occurs in germlines or in early embryos before the separation of germlines. The relatively high correlation (R ≈0.3) for some somatic tissues is probably caused by the presence of genes in the data set (especially housing-keeping genes) with correlated expression between somatic tissues and germlines.
Given the strong correlation between the number of recent Pψs and the current expression levels of their parent genes, Pψs of different ages can be treated as fossilized footprints to infer past expression levels of parent genes in embryonic or reproductive tissues, which might provide insights into the evolution of gene expression. For example, overall, 62%, 28%, and 10% of mouse Pψs fall into young (d <0.25), middle (d =0.25-0.65), and old (d >0.65) age groups, respectively (Figure 2a). By contrast, mouse Hmgb2 (encoding the high mobility group B 2 protein), which is important for male gonad development and spermatogenesis , gave rise to 65 Pψs, of which 16 (25%), 8 (12%), and 41 (63%) belong to the three age groups, respectively. This age distribution is significantly different from the genomic distribution (χ2 test, P <10-45) and shows a decrease in the expression of Hmgb2 (compared with the genomic average) in the evolution leading to mouse (Figure 2a). By examining all parent genes in a likewise fashion, we observed that 59 (3.2%) mouse and 37 (1.9%) human parent genes differ significantly from their respective genomic averages in age distribution at a false discovery rate of 1%, presumably resulting from evolutionary changes in expression level in embryonic tissues or germlines (Figure 2b; Table S3). A similar approach can be used to study the evolutionary trajectory of expression divergence between duplicate genes.
Here we showed strong correlation between the number of recent Pψs and the current expression levels of their parent genes in mouse. Because this phenomenon is a result of the retroposition rate being determined by the abundance of mRNA, we predict that similarly strong correlations also exist in other organisms, although it is not feasible to validate this prediction at this time, owing to the lack of appropriate gene expression data from germlines or during early development. We illustrated the utility of Pψs in analyzing gene expression evolution. The overall accuracy of similar analyses in any extant or ancient genome depends on three important factors: identification of Pψs, identification of the parent genes of Pψs, and the estimation of Pψ ages. Identification of Pψs is not a trivial task and it is usually executed by searching in a sequenced genome for protein-coding-gene homologs that contain ORF-disrupting substitutions and lack introns. By design, this approach is limited to Pψs of intron-containing parent genes, due to the difficulty in distinguishing between processed and non-processed pseudogenes of intron-less parent genes (~10% in mouse ). Furthermore, because it takes on average a few million years for ORF-disrupting mutations to occur and get fixed in Pψs (of humans and related primates) , very young Pψs will be under-detected, hampering the study of parent gene expression in the very recent past. Other factors, such as the E-value cutoff used in the BLAST search of homologous sequences, also affect the identification of Pψs. For example, very old Pψs are hard to detect even with loose E-value cutoffs, simply because these Pψs are too different in DNA sequence from their parent genes. But, as long as the same criteria are applied to all Pψs, the bias for individual parent genes is expected to be minimal.
Currently, Pψs are assigned to their functional parent genes by BLAST searches and are assumed to have arisen from their parent genes through retroposition. However, it is possible that a Pψ is secondarily generated by simple duplication of another Pψ, rather than by retroposition of its functional parent gene. In theory, the number of duplication-generated Pψs is expected to be proportional to the number of retroposition-generated Pψs. Thus, treating all Pψs as retroposition-derived should not bias the age distribution of a parent gene relative to that of the genomic average. In reality, however, segmental duplication rate varies widely across the genome and it might be necessary to differentiate between duplication-generated and retroposition-generated Pψs. For this purpose, phylogenetic analysis will be useful, as the sister clade of each duplication-generated Pψ would not include the functional parent gene. Another complication in identifying the parent gene of a Pψ is that the parent gene might no longer exist, for two reasons. First, the parent gene might have been lost during evolution, which will lead to the identification of a wrong parent gene or no parent gene, depending on whether the true parent gene has a paralog in the genome. Second, a parent gene might have been duplicated to generate two (or more) daughter genes since the birth of the Pψ that is under investigation. In this case, BLAST can assign the Pψ to either daughter gene, depending on which one evolves more slowly, although the true parent gene should be the common ancestor of these daughter genes and can be identified through phylogenetic analysis.
We measured the age of a Pψ by its nucleotide sequence divergence d from its functional parent gene at the parent gene's third codon positions. This approach is applicable to both retroposition-generated and duplication-generated Pψs. However, it could lead to an underestimation of the ages of Pψs of more conserved parent genes than those of less conserved parent genes. This is because, even at third codon positions, a substantial fraction of mutations are nonsynonymous (~31% if there is no transition/transversion bias), which are less likely to be fixed in a conserved parent gene than in a less conserved one. Thus, for two Pψs of the same true age, the one derived from the more conserved parent gene tends to look younger than the one derived from the less conserved parent gene. This bias could be alleviated if one calculates d using only four-fold degenerate sites of parent genes, but such d estimates might have larger variances owing to smaller numbers of usable sites. Because transcribed but untranslated regions (UTRs) of parent genes are also frequently present in Pψs, they may be used for estimating d as well. However, because UTRs of parent genes are subject to stronger functional constraints than four-fold degenerate sites , they are less ideal for Pψ age estimation.
We used the Pψ information to infer relative changes of gene expression during evolution. Is it possible to infer the absolute gene expression level in the past? We found that even among parent genes with the same number of young Pψs, their log2-transformed microarray expression levels still vary substantially, with a standard deviation that often reaches a quarter of the mean (Table S2). Thus, it is probably difficult to infer precisely the absolute expression level of a gene in the past from the number of its Pψs.
These caveats notwithstanding, our results point to a fascinating observation that the Pψ record embedded in the genome of an organism provides a uniquely informative glimpse of past gene expression levels, which complements the knowledge gained from sequencing ancient extinct genomes to provide a more comprehensive picture of evolution. A recent study suggested that gene expression differences between species are generally smaller at the protein level than at the mRNA level . Our approach can infer mRNA expression changes, but not protein expression changes. Future research should improve the reliability of the retrieval of ancient transcriptome information and apply it to the study of evolution.
We thank members of the Zhang laboratory and three anonymous reviewers for valuable comments. This work was supported by research grants from the National Institutes of Health to J.Z.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.