Pseudogenes are defined as genes that have lost their ability to produce a functional protein. Although such relics have been identified in all genomes, the number and persistence of pseudogenes varies greatly among species: in human, the estimated number of pseudogenes ranges from 10,000 to 20,000 [
1,
2], while in Drosophila, only 110 pseudogenes (or 1 pseudogene per 130 genes) were identified [
3]. Pseudogenes are hypothesized to arise by gene duplication, including retrotransposition during which a retrotransposase mediates the integration a transcript into the genome [
4] (see Additional file
1). Since they are redundant with the genes from which the transcript originated (hereafter termed parent gene) and are integrated without a promoter into random locations in the genome, the products of retrotransposition events are likely to be nonfunctional and to accumulate disabling mutations faster than functional genes. In such cases, they are termed retrotransposed pseudogenes or processed pseudogenes. In general, acceleration of evolutionary rates have been measured immediately following duplication and used to explain functional diversification such as subfunctionalization, neofunctionalization and pseudogenization [
5,
6].
Limited effort has been put into whole-genome identification of pseudogenes in plants, and, although whole-genome, segmental and tandem duplications have played a large role in the evolution of plant genomes [
7,
8], most of the literature has focused on the more readily identifiable retrotransposed pseudogenes. The
Arabidopsis Information Resource (TAIR) has released the annotation of 859 pseudogenes in TAIR8, which were presumably the result of a manual annotation effort [
9]. Studies in rice (
Oryza sativa ssp
indica) and
Arabidopsis have focused on chimeric genes originating from the recruitment of additional exons by retrotransposed genes. As by-products of these analyses, Wang et al. [
10] found 337 retrotransposed genes containing at least one frameshift mutation in rice, and Zhang et al. [
11] reported 22 in
Arabidopsis. A separate effort using more liberal criteria identified 411 retrotransposed genes in
Arabidopsis, 376 of which were disabled due to frameshifts or premature stop codons [
12].
The majority of studies on pseudogenes focus on the identification of gene relics in the intergenic regions and not among annotated protein coding genes. This is sufficient for highly curated genomes in which pseudogenes have already been annotated. However, an increasing number of genomes are annotated in an automated or semiautomated fashion, and rely partially on
ab initio gene finders, which typically do not predict pseudogenes. The Osa1 Genome Annotation (of
Oryza sativa ssp.
japonica cv. Nipponbare) consists of gene predictions made by the
ab initio gene finder FGENESH, and improved through incorporation of transcript evidence [
13]. Despite expression datasets in the form of Expressed Sequence Tags (ESTs), full-length cDNAs and Massively Parallel Signature Sequencing tags (MPSS), Serial Analysis of Gene Expression (SAGE), and proteomic datasets, over 40% of the non-transposable element (non-TE)-related rice genes are not currently supported by transcript evidence. The
ab initio gene-prediction software FGENESH was chosen for rice due to its combination of high sensitivity (78%) and specificity (76%) at the exon level [
14]. Despite this high performance, FGENESH is likely to circumvent premature stop codons or frameshift mutations leading to premature stop codons in otherwise long open-reading frames (ORF) by adding introns or interrupting the ORF prematurely. Therefore, not only does FGENESH not predict pseudogenes, but it may predict an interrupted ORF where a pseudogene is more likely. Rice pseudogenes annotated by experts and deposited to the Osa1 Community Annotation project are evidence of this issue. Comparison of 72 pseudogenes annotated by community annotators in the Osa1 Release 4 gene annotation revealed that these pseudogenes had either been entirely "missed" by the Osa1 automated pipeline (30 pseudogenes), or had been misannotated (incorrect structures were invoked to circumvent stop codons or frameshifts; 25 pseudogenes), or had been annotated as genes (17 pseudogenes) [
15]. These results suggest that a whole-genome approach to the identification of pseudogenes in the rice gene complement would improve the quality of the annotation.
Pseudogene detection methods rely on the alignment of genes to intergenic regions for the identification of a pseudogene-parent pair. The characteristics of the pseudogenes are further determined based on global alignment of the pseudogenes to their respective parents [
16-
18]. The success of this type of approach is inherently dependent on the quality of the annotation for the organism in question, as it assumes that the structure of the parent gene is accurately predicted [
2]. Yao et al. [
19] used a different strategy: human genes and pseudogenes were identified by ranking the alignments of EST, mRNA, and protein based on identity and coverage. Models created exclusively from non top-ranking alignments (i.e. non-cognate evidence) were labeled as non-transcribed pseudogenes, while models with cognate transcript(s) but frameshifted cognate protein were designated as transcribed pseudogenes. This approach produced a set of pseudogenes with 75 to 80% overlap with manually curated pseudogenes. An important advantage of this strategy is that it obviates the need for a pre-determined set of functional models. However, the authors also demonstrate that, in the case of the human genome (~20,000 genes), a minimum of 5 million ESTs is necessary to avoid over-predicting pseudogenes, a number vastly superior to what is currently available for rice.
We blended the two methods described above by using only fully-supported rice models to identify pseudogenes among a set of rice genes with features potentially indicative of pseudogenes, hereafter termed Genes with Pseudogene Features (GPFs) (see Additional file
2). Pseudogene features assessed were i) lack of alignment to an EST or cDNA (possibly indicating lack of expression), ii) long untranslated regions (UTRs), iii) short coding sequences (CDS), iv) a downstream poly-A tail, and v) for genes in segmentally-duplicated regions: differing protein length or number of exons between the duplicated genes, or lack of paralog and single-exon gene model structure. Parent-derived models were constructed by aligning all fully-supported gene models (i.e., gene models with full-length cDNA transcript support) to the genomic sequence of GPFs. A total of 1,439 pseudogenes, aligning over at least 70% of the parent and containing disablement(s) (frameshifts and/or premature stop codon) were identified in the rice gene complement. We characterized the pseudogenes, identified their most likely origin, investigated their ancestral function, and validated our method by comparing our results to previously identified pseudogenes in rice.