The genomic elements of SDs, paralogs and pseudogenes are the result of duplication events in the genome and hence are closely related. We have performed an integrative analysis of these elements, which provides additional information about the formation of pseudogenes and sheds light on the underlying genomic processes. We present a rigorous scheme of classification based on the presence of pseudogenes in SDs.
It is known that genes are enriched in SDs with ~17.8% genes located in SDs (
11). We find that ~44.4% duplicated pseudogenes and ~21.1% processed pseudogenes are located in SDs. The alignments of the SD regions that contain them reveal the ‘true parents’ of the pseudogenes and allow annotation of duplicated–processed pseudogenes: pseudogenes that result from the duplication of other processed pseudogenes. It was reported by Kim
et al. (
5) that processed pseudogenes show a significant association with SDs and from the presence of highly similar processed pseudogenes at SD junctions, they concluded that processed pseudogenes may have contributed to SD formation in some cases. This might partly explain the observed enrichment of processed pseudogenes in SDs in the current analysis. However, a high enrichment of processed pseudogenes in SDs confirms our view that most processed pseudogenes located in SDs are indeed a result of duplication events—mostly of other processed pseudogenes (duplicated–processed) and in some instances (case SPDi—B) of parent genes (misassigned previously as processed pseudogenes).
When the pseudogene and parent gene align with each other within the two larger SD regions (case SPDi) we compare the number of nucleotide substitutions per site in pseudogene and the SD region containing it. This comparison indicates that most pseudogenes were likely disabled at roughly the same time as original duplication and have been evolving under neutral rates of selection since then. We note that this conclusion applies for pseudogenes under this category (case SPDi) even if the pseudogene is a direct result of duplication of another pseudogene. In such cases, the first pseudogene formed by duplication of segment containing parent gene likely started evolving with a neutral rate of nucleotide substitution after the disablement event; followed by a second SD event to give a second pseudogene that continued to evolve neutrally. Hence, similar number of nucleotide substitutions per site in the second pseudogene (relative to the parent) as the larger SD region containing it (relative to the SD region containing parent) indicate that the initial disablement (in the first pseudogene) likely occurred at the same time as the initial duplication event.
We find that even though the enrichment of pseudogenes in SDs is not due to the presence of olfactory receptor (OR) pseudogenes, 98 out of 300 (~32.7%) OR pseudogenes that were classified as processed pseudogenes by PseudoPipe are found in SDs. ORs form the largest mammalian gene superfamily and it is estimated that ~63% of them are actually non-functional duplicated pseudogenes (
29). OR genes consist of single protein-coding exons and hence the classification of OR pseudogenes into processed and duplicated by PseudoPipe can be tricky. We think that most OR pseudogenes previously classified as processed by PseudoPipe are indeed the result of duplication events and we have now changed their annotation to duplicated pseudogenes. Indeed, 62 out of 98 OR pseudogenes in SDs fall under the case SPDi (B) where parent gene and pseudogene align with each other in SD pairs.
We note that although the SD data provides pair-wise information, we extract the entire set of regions where a particular segment is duplicated from this data for our analysis. Additionally, since the SD pair-wise data does not provide information about the directionality of duplications, we use a set of ancestral loci obtained previously in a separate study by Jiang
et al. (
17). We find that amongst parent genes and pseudogenes that are found in SDs, a higher percentage of parent genes (of duplicated pseudogenes) than pseudogenes are located on these ancestral loci of SDs.
A limitation of our current analysis is its dependence on annotated SDs which correspond to relatively new duplications in the genome (≥90% sequence identity). For instance, some processed pseudogenes which are not located in SDs could be the result of older duplication events of other processed pseudogenes, but we are unable to label them as duplicated–processed based on current analysis. However, we note that the classification scheme presented in this article does not fundamentally rely on SD definition and can be applied with a different set of SDs obtained using a lower sequence similarity criteria. We have demonstrated that this classification scheme enables integration of the knowledge of the entire length of the sequence that was copied during the duplication events (SD regions) with the pseudogene data and helps gain significant additional insight which can not be obtained solely from the sequence homology between the parent genes and pseudogenes.
It is interesting to note that while SDs are hotspots for various kinds of structural variations such as insertions, deletions and inversions (
4), both genes and pseudogenes are enriched in these regions. It is possible that the genes and pseudogenes located in SDs are strongly correlated with the variations between individuals. Hence, in the search to find polymorphic genes or polymorphic pseudogenes (genes that are functional in certain populations and rendered non-functional in others) (
30), perhaps the genes and pseudogenes located in SD regions would be the best candidates for further investigation. With individual genomics data becoming available at an unprecedented rate, the variability of these pseudogenes in different populations would be the focus of future studies (
31).