Pseudogenes are DNA sequences similar to genes encoding functional proteins, but are presumed to be nonfunctional due to mutations and truncation by premature stop codons. In this study, we focus on the largest family of pseudogenes, processed pseudogenes of ribosomal proteins (RPs). Previous in silico
studies have shown that the human genome consists of thousands of processed RP pseudogenes, although there is only one functional gene for each of the 80 human RPs, with the exception of three functional RP retrotransposons [1
]. The availability of numerous whole genome sequences presents us an opportunity to do a comparative analysis of these pseudogenes in various organisms.
Processed pseudogenes are formed by reverse transcription and integration of processed mRNA into the genome. In the case of human processed pseudogenes, their integration into the genome has been shown to be mediated by L1 transposons and this is believed to be the primary mechanism by which they are generated [6
]. We chose to focus on RP pseudogenes because they constitute the largest family of pseudogenes (approximately 2000 RP processed pseudogenes). RP genes are constitutively expressed at reasonably stable levels and are very highly conserved. In addition, RPs have high levels of sequence conservation among various species, which enables us to trace lineages of their pseudogenes easily [7
]. The large dataset of RP pseudogenes in conjunction with several completely sequenced genomes allows us to identify orthologous ribosomal pseudogenes in syntenic regions.
Sakai et al
] estimate that processed pseudogenes are formed at a rate of about 1-2% per gene per million years based on the analysis of processed pseudogenes in human and mouse genomes. Gene duplications occur at a predicted rate of 0.9% per gene per million years in the human genome and are believed to be an important resource for genome evolution. Therefore, they suggest that processed pseudogenes might also play a role in increasing genome diversity, similar to duplication events.
To date, there has been no systematic evaluation of processed pseudogenes in syntenic regions on a large scale. While a study on kinases indicated that processed pseudogenes are not conserved between human and mouse, this study pertains to a very small sample size of about 100 kinase pseudogenes [9
]. Suyama et al
] identified and annotated genes and duplicated pseudogenes under the assumption that processed pseudogenes will not be found in syntenic regions. However, there is no a priori
reason to expect this. In fact, many studies have identified transcribed processed pseudogenes both by in silico
methods as well as targeted experimental analyses. Harrison et al
] analyzed expressed sequence tag (EST) and microarray expression data and came up with a list of about 200 processed pseudogenes that are transcribed in the human genome. The ENCODE consortium experimentally validated transcription of some pseudogenes. They annotated 201 pseudogenes in the ENCODE regions; two-thirds of these pseudogenes were processed. It was shown that at least a fifth of the 201 pseudogenes were transcribed based on pseudogene-specific RACE (rapid amplification of cDNA ends) analyses combined with results obtained from tiling microarray data and high throughput sequencing [12
]. Recently, two studies have shown that processed pseudogenes regulate gene expression by means of the RNA interference pathway in mouse oocytes [13
]. Another study has shown that some ABC transporter pseudogenes are transcriptionally active. They have also shown that the gene expression of an ABC transporter protein is regulated by the expression of its pseudogene in the human genome [15
]. Thus, processed pseudogenes are emerging as interesting elements in the genomic landscape capable of being potentially functional.
An elegant study showed that a small number of pseudogenes with high sequence identity to the parent protein are conserved between human and mouse [16
]. They suggest that the conservation of sequence in such pseudogenes with high identity to their parent despite being 70 million years old (time of human-mouse divergence) implies a functional role for such pseudogenes. Based on expression evidence and the fact that these conserved sequences are found in syntenic regions between human and mouse, they catalogued a set of 20 pseudogenes that could be potentially functional. The 20 pseudogenes included only two processed pseudogenes that are conserved between human and mouse. The large family of RP processed pseudogenes and the availability of whole genome sequences of many organisms allow us to perform a comprehensive and systematic comparative analysis of RP processed pseudogenes in sytenic regions. It is conceivable that some of them would be conserved across species if they were biologically relevant. RP pseudogenes present a specific problem in that they are often annotated mistakenly as genes due to very high sequence similarity to the parent protein. Here, we use the method developed to identify RP pseudogenes [1
], which is elaborated in the Materials and methods section.
For this study, we identified processed RP pseudogenes in four genomes - human, chimpanzee, mouse and rat - using an automated pipeline [17
]. We investigated the degree to which processed RP pseudogenes are conserved among the four species. While a significant number of papers have addressed the global synteny between human, chimpanzee, mouse and rat based on DNA sequence alignments, we do not have comprehensive data on detailed local synteny [18
]. In order to identify well-defined syntenic regions, we defined syntenic regions as sequences conserved in position between orthologous gene pairs. This is similar to the methods used by others where synteny has been derived based on local gene orthology [10