|Home | About | Journals | Submit | Contact Us | Français|
Pseudogenes are copies of genes that cannot produce a protein. They can be detected from disruptions to their apparent coding sequence, caused by frameshifts and premature stop codons. They are classed as either processed pseudogenes (made by reverse transcription from an mRNA) or duplicated pseudogenes, arising from duplication in the genomic DNA and subsequent disablement. Historically, there is anecdotal evidence that the fruit fly (Drosophila melanogaster) has few pseudogenes. Investigators have linked this to a high deletion rate of genomic DNA, for which there is evidence from genetic experiments on genome size. Here, we apply a homology-based pipeline that was developed previously to identify pseudogenes in other eukaryotic genomes, to the fruit fly, so as to derive the first complete survey of its pseudogene population. We find approximately 100 pseudogenes, with at least a sixth of these as candidate processed pseudogenes. This gives a much lower proportion of pseudogenes (compared with the size of the proteome) than in the genomes of other eukaryotes for which data are available (human, nematode and budding yeast). Closest matching proteins to Drosophila pseudogenes are significantly longer than the average protein in its proteome (up to ~60% more than the average protein’s length), in contrast to the situation in the three other eukaryotic genomes. This may be due to the persistence of fragments of longer genes. In the fly pseudogene population, we found most pseudogenes for serine proteases (which are more abundant in the Drosophila lineage compared with the other eukaryotes), immunoglobulin-motif-containing proteins and cytochromes P450. Data on the sequences and positions of the putative pseudogenes are available at: http://www.pseudogene.org/fly. The detection of a small number of pseudogenes in the Drosophila genome and the higher mean length for the closest matching proteins to pseudogenes (possibly because remnants of genes encoding longer proteins are more likely to persist) are further evidence for a high deletion rate of genomic DNA in the fruit fly. The data are useful for molecular evolution study in Drosophila.
Pseudogenes are copies of genes that do not produce a full-length functional protein chain. Their apparent protein-coding sequences are disrupted by frameshifts and premature stop codons as evolution progresses (1–3). They occur in two forms: first, duplicated pseudogenes arise from duplications of a gene (or an exon) that become disabled and subsequently are degraded; secondly, processed pseudogenes arise from reverse transcription of a messenger RNA and reintegration of the resultant cDNA into genomic DNA (1–3). The latter type of pseudogene can arise as a by-product of LINE-1 retrotransposition in humans (4). Surveys have recently been performed on the pseudogene populations of budding yeast, nematode and chromosomes 21 and 22 for human, with a further analysis of over 2000 ribosomal-protein pseudogenes in the whole human genome (5–8). The procedures derived in these papers have been applied to the Drosophila genome in the present study to derive an initial overview of the pseudogene population of this fly. Here, we report the detection of about 100 putative pseudogenes in the Drosophila genome, and present analysis of some of their characteristics, such as the length of their matching proteins and their most common functional groupings.
We applied procedures for detection of pseudogenes based on the identification of protein homology in the genomic DNA that is disabled by frameshifts or premature stop codons; these procedures have been described in detail previously (7). As for our study of human chromosomes 21 and 22, we ensured that we minimized the number of disabled extensions like those observed for known genes [see methods of ref. (7) for the complete procedure; an extension length minimum of 24 residues was found to be suitable]. We used Releases 1 and 2 of the Drosophila genome and the accompanying annotations (9). We disregarded any sequences that may have arisen from disabled copies of transposable elements (10). As before, we assigned as candidate processed pseudogenes, any sequences that (i) are of substantial length (>70% of the length of the closest matching protein sequence) and that have no obvious introns, or (ii) have evidence of polyadenylation and no obvious introns (7). Evidence of polyadenylation is defined as a discernible canonical AATAAA polyadenylation signal followed within 50 nucleotides by a region of elevated polyadenine content (≥30 adenines in a 50 nucleotide stretch), within 1000 nucleotides from the end of the detected homology (7). Drosophila transcripts have a greater tendency than transcripts of the other eukaryotes to use the canonical AATAAA polyadenylation signal (11). We have re-mapped the pseudogene annotations onto the recent Release 3 of the fly genome.
In addition, we examined existing annotations for fly pseudogenes downloaded from the FLYBASE website (http://www.flybase.org). We found 10 previously reported pseudogenes that are in euchromatic DNA, that are not obviously associated with a transposable element and whose sequences were available. However, once we set aside those that do not occur in the sequenced fly strain or that are truncations (and would not be detected by our procedures), we are left with only three existing annotations [two cytochrome P450 pseudogenes and one α-esterase pseudogene (9,12)], each of which are recovered in our study.
InterPro motif families (13) were assigned to pseudogenes by transferring annotations from the closest matching Drosophila protein. Lists of matches for Drosophila proteins were downloaded from the InterPro proteome analysis website (http://www.ebi.ac.uk/proteome). Similarly, Gene Ontology (GO) annotations for function (downloaded from http://www.geneontology.org) were also transferred (14).
We found 110 pseudogenes in the Drosophila genome, which is about one for every 130 proteins encoded in the genome. This proportion is much lower than in the other eukaryotic genomes for which studies on pseudogene populations have been completed (Table (Table1).1). For example, in the single-celled budding yeast (Saccharomyces cerevisiae) there are over 220 pseudogenic ORFs, which is about one for every 30 encoded proteins (5). In human, our surveys have shown that there may be one duplicated and one processed pseudogene for every four genes (7). A recent paper detailing comparative analysis of the genomes of Anopheles gambiae and D.melanogaster describes detection of 176 pseudogenes in Drosophila by searching for disabled protein homology; however, our methods our more conservative, as we disregard any disabled homology fragments that look like disabled extensions to known genes (such as might arise in the last exon of a gene) (see Materials and Methods) (15); also, we disregard any pseudogenic copies of proteins from transposable elements (10). On a related note, we recently found that the fly has more decayed remnants of genes than other sequenced eukaryotes that are undetectable by standard gene prediction and sequence alignment procedures (16).
Processed pseudogenes do not have introns (as they are derived from messenger RNA transcripts), and, if recently integrated into the genome, have detectable characteristic features such as a polyadenine tail with an upstream polyadenylation signal (3,7). We examined the fly pseudogenes for evidence of being processed (Table (Table1).1). About one-sixth (19/110) of the Drosophila pseudogenes have no obvious introns and both a polyadenylation signal and a downstream polyadenine-rich stretch in the genomic DNA, and up to a third of the pseudogenes (34/110) have some evidence of processing (see Materials and Methods and Table Table11 for details). There are six pseudogenic copies of single-exon genes that could be either processed or duplicated pseudogenes. The only previously well-documented evidence of processing in Drosophila is an alcohol dehydrogenase retrosequence, which is part of the gene jing-wei in many Drosophila species (but not melanogaster) (17), and was originally identified as an anomalously conserved processed pseudogene (18,19). Our data show that processed pseudogenes are comparatively rare in the fruit fly genome (Table (Table1),1), indicating either a low rate of generation, or a high rate of deletion from the genome. Indeed, our procedures could be over-assigning pseudogenes as processed, particularly in situations where the pseudogene fragment is too small to discern the original intron–exon boundaries, so the figure of 34/110 pseudogenes as processed should be considered an upper bound.
The pseudogene population and its subpopulation of candidate processed pseudogenes appear to be dispersed randomly along the chromosomes (Fig. (Fig.1)1) [as for genes, there are no notable large-scale gradients or clusterings in their positioning (9), although we must emphasize that we only have a small population]. However, there appears to be clustering of pseudogenes within 2 Mb of either side of the 16 Mb of pericentromeric heterochromatin on chromosome 2 (see 2L and 2R in Fig. Fig.1).1). Such large blocks of heterochromatin are also seen around the X- and third-chromosome centromeres. These clusterings comprise 16 pseudogenes, of which eight were judged to be candidate processed pseudogenes [two of these are homologous to parts of the protein Osa (20)]; of the others, two are homologous to a retroviral reverse transcriptase [InterPro motif IPR000477 (13)]. This pericentromeric area may be a ‘cold-spot’ for genomic DNA deletion.
We calculated the mean length of the closest matching proteins for pseudogenes of the genomes of budding yeast, nematode worm and human (chromosomes 21 and 22 only). We compared this with the same data for the Drosophila genome pseudogenes (Table (Table1).1). In Drosophila, coding sequences that give rise to pseudogenes tend to be rather longer than the average coding sequence, in contrast to the situation in other organisms. Specifically, we found that closest matching proteins for pseudogenes tend to be ~60% longer than the average Drosophila protein (Table (Table1).1). [Their mean length reduces to ~20% longer than average when seven outlying matching proteins of >3000 residues are deleted (Table (Table11 footnote).] This length observation may arise because remnants of longer genes can persist for longer in the genomic DNA than shorter genes, and withstand very high deletion rates of genomic DNA in Drosophila (21).
There is evidence from experiments investigating genome size that Drosophila has a very high genomic DNA deletion rate (21,22). This has traditionally been thought to be the reason that few Drosophila pseudogenes have been discovered in the past (23). There is some evidence that the underlying deletion rate of genomic DNA is also high in nematodes (24,25); however, some gene families, particularly types of G-protein-coupled receptor (GPCR), appear to acquire and ‘use up’ more novel duplications, resulting in a lowered net rate of deletion of pseudogenes in the nematode genome. There is also a marginally significant difference in the same comparison for the budding yeast pseudogene data (Table (Table11 footnote); however, in this case, the proteins that are closest matches to pseudogenes tend to be somewhat shorter than the average protein encoded by the genome. This finding may be related to the high concentration of pseudogenes and homologs of pseudogenes near the telomeres of the budding yeast genome (5).
InterPro motifs (13) and GO function categories (14) were mapped onto the fruit fly pseudogenes via annotations for their closest matching protein sequences. The top-ranking motifs and functions are listed in Tables Tables22 and and33.
The most common InterPro motifs are for serine proteases (there are multiple GO function category designations for these enzymes as well, as ‘serine-type endopeptidase’), and immunoglobulin-like domain motifs. The serine proteases are types of proteins that are very abundant in the fly, but are very rare in the nematode worm (Caenorhabditis elegans), budding yeast (S.cerevisiae) and the weed Arabidopsis thaliana, and of intermediate abundance in the human proteome (see InterPro website: http://www.ebi.ac.uk/interpro). The S1 class of proteases (Table (Table2)2) is thought to have roles in digestion, the complement cascade and in various signaling pathways in the fly (9). This finding continues the theme of pseudogenes tending to occur for lineage-specific or lineage-expanded classes of proteins, observed previously for other eukaryotes (1). Interestingly, there are also multiple pseudogenes for the cytochromes P450 (Table (Table3),3), which are proteins that have a ‘broad’ substrate specificity. In other organisms, classes of proteins that have a ‘breadth’ of substrate specificity, or ‘binding diversity’, have many pseudogenes, such as the chemoreceptors in the nematode (6), and the immunoglobulins and olfactory receptors in human (7,26). The fact that the cytochromes P450 were not ‘counted’ in the InterPro motif listings demonstrates the utility of combining different methods (here both GO function categories and InterPro motifs) to characterize the functional role of sequences.
Notably, we find only one GPCR pseudogene, which contrasts to the situation in the nematode worm and in the human genome, where several hundred such pseudogenes are found (6,26). This may be because of a fundamental difference in the organization of GPCR genes in the fly; they are distributed among many loci in small clusters of one, two or three genes (9), whereas in the nematode and in human, there are large arrays of dozens of genes with interspersed pseudogenes (6,26). Also, we detect no ribosomal-protein pseudogenes; in contrast, processed pseudogenes from transcripts for these proteins are abundant and ubiquitous in the human genome, suggesting that appropriate reverse transcriptase specificity is not as available or as potent in the fly (7,8). There are two assignments of candidate processed pseudogenes each for serine proteases and for immunoglobulin-like domains; removing them from the counts does not change the identity of the 10 most common domains (Table (Table22).
We have completed an initial survey of the pseudogene population in the Drosophila genome. We find about 100 pseudogenes, with at least one-sixth of these as candidate processed pseudogenes. Two features of the fly pseudogene population arguably arise from a comparatively high genomic DNA deletion rate in the fly, relative to the rate of duplication of genes and gene parts: (i) there is a comparatively small number of putative pseudogenes (Table (Table1),1), relative to the genomes of other eukaryotes; (ii) closest matching proteins to pseudogenes appear to be rather longer than the average protein sequence in the proteome. Finally, the most pseudogenes occur for serine proteases (which are relatively abundant in the Drosophila lineage, compared with the other eukaryotes), immunoglobulin motif-containing proteins and cytochromes P450. Data relating to this paper are available at http://www.pseudogene.org, including chromosomal positions and protein sequences with disablements. We have re-mapped our annotations onto Release 3 of the genome, and are currently honing our methods for pseudogene detection in Drosophila with consideration of underlying substitution rates in the DNA, and other concepts. Our fruit fly data further add to the picture of evolution of the size and diversity of eukaryotic proteomes. In the human genome, there seems to be a clear correlation between the numbers of processed pseudogenes and the amount of non-coding DNA on a chromosome [(8); Z.Zhang, unpublished data]. However, as can be seen in Table Table11 [and refs (1,7,27)], no obvious relationship has yet emerged between the size of a pseudogene population, the size of a proteome, and the amount of coding and non-coding DNA in genomes as whole entities. Detailed analysis of many more genomes will further help in deconvoluting the forces that shape these populations of sequences.