Derivation of transcribed pseudogene annotations (TPAs) in the human genome
To identify transcribed pseudogenes, transcript sequences from the Unigene, RefSeq and H-InvDB databases were mapped onto the human genome and were examined for overlap with pseudogene annotations. These pseudogene annotations were taken from the VEGA http://vega.sanger.ac.uk/
websites (see Methods
for details). We pooled these datasets with re-mappings of: (i)
'disrupted mRNAs' (dmRNAs) [16
], and (ii)
transcribed processed pseudogenes [14
], from previous analyses [14
]. Overlap of the transcripts with pseudogenes was verified through using the positions of mid-sequence disablements (frameshifts and premature stop codons) as 'anchors'. That is, at least one disablement position was required to occur in both the genomic DNA and the transcript sequence (see Methods
for further details).
We found that ~11.5% (1750/15000) of human pseudogenes are transcribed (after correcting for pseudogene annotation overlaps within and between the various data sets) [see Additional file 1
]. Table summarises the numbers of transcribed pseudogene annotations (TPAs) in different categories and data sets. The number of processed pseudogenes that were identified to be transcribed is 3-4 times higher than in our previous analysis [14
]. Interestingly, in humans, regardless of category, only a small fraction of TPAs are transcribed completely in the antisense direction (~3-6%). Such a finding of significant avoidance of antisense transcription (Table ) is surprising, especially for retropseudogenes. Retrotransposed sequences are inserted back into the genomic DNA irrespective of the position of existing local promoters. Thus, one would expect equal numbers of sense and antisense transcripts. However, the above finding indicates a general selection pressure against antisense transcribed pseudogenes, thus generally limiting the possibilities for complementary hybridization with transcripts and RNA elements from homologous genes.
Percentages of TPAs in human and mouse.
We performed a similar survey in the mouse genome for TPAs. Surprisingly, in the mouse genome, we found a very low percentage of TPAs, in comparison to the human genome (<2%) (P << 0.001 for the likelihood of the number in mouse, given the human percentage as an expectation, using binomial statistics). This is despite these two animals having similar amounts of pseudogene annotation data (Table ), and transcript data (203,785 transcript sequences in total for human, and 203,550 for mouse). This indicates that transcribed pseudogenes are rarer in mice than in humans.
Identification of orthologous pseudogenes in mammals
Transcription of pseudogenes per se
does not necessarily indicate functionality. It has been shown that transcriptional activation at a particular genomic locus has a ripple effect on the neighboring loci [21
]. It is therefore possible that many transcribed pseudogenes arise simply because of this. However, of the various identified human TPAs in our present study, those that are evolutionarily conserved across mammals due to natural selection are more likely to be biologically functional. Therefore, we set out to identify a list of such orthologous transcribed pseudogenes that have conserved in ≥2 of our panel of mammals.
Certain gene families are known to spawn large numbers of pseudogenes. Examples include olfactory receptors [22
], ribosomal-protein genes [23
], human thioredoxin and glutaredoxin [24
], ABC transporters [25
], and heat shock proteins [26
]. In such cases, identifying orthologs in other mammals using the standard bi-directional best-hit procedure is problematic, since the rates of sequence evolution may vary in different lineages and genomic regions. Furthermore, such a procedure does not work well for pseudogenes, since these sequences are not evolving like protein-coding genes, which are under strong purifying selection. Because of this, the best match obtained using blastp
to a pseudogene query is expectedly the parent protein-coding gene or a pseudogene recently evolved from the parent gene. Thus, the standard bi-directional best-hit procedure alone is not sufficient. Therefore, here, we have used synteny information between two organisms to pin-point pseudogene orthologs. We have used synteny maps along with homology-based searches to identify conserved orthologs in five mammals (rhesus monkey, mouse, rat, dog and cow) (see Methods
for details). We identified a set of 68 human TPAs that are conserved in at least two of these mammals, representing potentially functional candidates (see Additional file 2
: Table S1). In general, although approximately half (742/1750) of the human TPAs are syntenically conserved in rhesus monkey, only 3% are syntenically conserved in mouse. This suggests that a large number of human transcribed pseudogenes are primate-specific.
A multiple sequence alignment of orthologous sequences for an example taken from Additional file 2
: Table S2, is shown in Figure . The corresponding phylogenetic tree is drawn in Figure . This example is a human pseudogene named 'urn:lsid:pseudogene.org:9606.Pseudogene:4346
', that is homologous to human ADP-ribose pyrophosphatase. In this case, one can see clearly that disablements at several positions in the alignment are conserved in divergent species, and parsimoniously would be assigned in the ancestral sequence. Also, in this phylogenetic tree, dog clusters closer to primates, than rodents do; this may be due to variance in local genomic mutation rates. Interestingly, we find that a significantly higher number of human transcribed pseudogenes were conserved in dog, compared to in mouse (Fisher's exact test, P
-value: 0.0086) (Figure ). There is some debate regarding whether human is phylogenetically closer to rodents than to dog, although most data analysis indicates a rodent-primate grouping [27
Figure 1 Multiple sequence alignment and phylogenetic analysis of a human transcribed pseudogene that has orthologs in the ≥2 other mammals. (a) Multiple sequence alignment of conceptually-translated ortholog sequences (urn:lsid:pseudogene.org:9606.Pseudogene:4346 (more ...)
Figure 2 Number of shared orthologs between human and other mammals. (a) For the 68 conserved TPAs (orthologs in > = 2 mammals); (b) for all conserved TPAs between human and the other mammals. The shared number of cases between dog and human is highlighted (more ...)
Conserved transcribed pseudogenes are overrepresented on chromosome X
It is noteworthy that the highest number of conserved TPAs is on the human chromosome X (13 out of the total of 68; Figure and Additional file 2
: Table S2), followed by 11 cases on chromosome 6. There is also a significant over-representation of human conserved TPAs on these chromosomes after normalizing for the chromosome size and dosage in gametes (χ2
test, d.f. = 1, P
-value < 10-3
). Furthermore, it is chromosome X that is consistently and most significantly overrepresented in the whole population of TPAs, and in the datasets of TPAs that are conserved in monkey and in mouse (calculations not shown). This finding is in line with the observation that over the course of evolution there has been some extensive gene trafficking to/from the mammalian X chromosome [30
Figure 3 A scatter plot showing the number of conserved TPAs on each human chromosome versus the chromosome size (in Mb). The chromosome X is circled. The collective size of the human genome, excluding chromosome X, is 3,098,124,053; and the size of the chromosome (more ...)
Analysis of TPAs for selection to maintain them
Human TPAs from the VEGA data set, that have syntenically-conserved orthologs in rhesus monkey, were analyzed for significant selection pressure to maintain them. This was assessed through comparison of the nucleotide percentage sequence identity between orthologs, with the highest nucleotide percentage sequence identity for the immediately flanking genomic regions, as illustrated in Figure . We chose the VEGA data set for this analysis, since the genomic coordinates of this pseudogene annotation data set are more precisely annotated.
Figure 4 A schematic representation of the procedure for identifying syntenic noncoding (flanking) regions. Flanking regions (10000 nts) are scanned in a sliding window, equal to the length of conceptual human pseudogene transcript), by globally aligning it to (more ...)
Our analysis indicated that TPA orthologs have higher sequence homology in comparison to their syntenic flanking regions. Average sequence identities among different syntenic regions in human and rhesus monkey are as follows: 75.0% (s.d. 12.6%) in the 5' areas, 81.0% in the 3' areas (s.d. 13.5%), and 86.7% (s.d. 9.6%) in the TPA sequences. The difference in the sequence identities between pseudogenes and the respective flanking regions is statistically significant (Wilcoxon signed rank test, P-value: < 5e-50). In the majority of cases (~86%, 293/341), the percentage sequence identity between orthologous TPA sequences is greater than that of the flanking regions. This suggests that significant selection pressure exists to maintain them. We note that similar analysis comparing the human and chimpanzee genomes is not informative because the species are too similar. Also, comparisons of the human genome with the other mammals in our panel are not informative either, because the appropriate regions cannot be aligned accurately or significantly.
Conserved TPAs tend to be GC rich
We examined whether there existed any sequence feature that distinguished conserved TPAs from the rest of the human pseudogene population. A positive finding would indicate that these pseudogenes are not evolving neutrally. It has been widely observed that genes tend to be GC-rich in comparison to non-transcribed genomic segments [31
]. Here, we examined whether TPAs and annotations of non-transcribed pseudogenes showed any difference in the GC contents relative to their flanking regions (GCdiff
). Pseudogene populations from a variety of organisms have been shown to relax to the composition of intergenic DNA over evolutionary time [32
]. Here, the GC content of neutrally-evolving pseudogenes would be expected to relax to that of the background genomic GC content. Interestingly, GC content calculations revealed that 84% (327/391) of the human TPAs derived from the VEGA pseudogene data set, that are conserved in rhesus monkey, have GC content greater than their flanking regions (Table ). This compares to 74% (5289/7154) for the non-transcribed cases (Table ). This difference is statistically significant (χ2
test, d.f. = 1, P
-value < 10-4
). A similar result is obtained if we examine the whole population of TPAs, or also if we just examine transcription of conserved processed
pseudogenes. There is however no such significant differences for transcribed pseudogenes of the 'nonprocessed' and 'unclassified' pseudogene categories (Table ). This shows that there is a greater tendency for the conserved TPAs to be GC-rich than for non-transcribed cases, and that this tendency arises primarily because of transcription of processed pseudogenes. This finding on GC content is further evidence that such transcribed pseudogenes are not evolving neutrally. Such GC trends can be explained by selection for transcriptional efficiency, as noted above [31
Proportions of pseudogenes that have GC contents higher relative to their flanking regions in different categories of VEGA annotated human pseudogenes.
We checked whether the age of pseudogenes could be causing the GC content differences noticed above. To do this, we looked at the GCdiff in the following (VEGA) subsets of TPAs, i.e.: (i) TPAs unique to humans; (ii) TPAs conserved in rhesus monkey only; (iii) TPAs conserved in more divergent mammals such as mouse, rat, dog and cow. We found that 82.5% (381/462) of set (i) have GCdiff > 0 (i.e., GC content of pseudogene greater than that of the flanking region). Similar percentages were observed in the other classes: 84.1% (317/377) in set (ii), and 77.78% (21/27) in set (iii). There was no statistically significant difference in the GCdiff between any two of the classes (χ2 test, P-value > 0.55), suggesting that age of pseudogenes does not have any influence on the observed GC content differences.
Ka/Ks trends for TPAs that are conserved in rhesus monkey
We decided to assess the genomic evidence for a lack of protein-coding ability in the human TPAs that are syntenically conserved in rhesus monkey. We compared TPA characteristics to the characteristics of two other groupings: (i) known human protein-coding genes with orthologs in rhesus monkey; (ii) populations of simulated sequences that are randomly mutating without coding-sequence selection pressures. The human TPA sequences are used as starting sequences for these simulations. The protocol for these simulations is described in 'Methods: Ka/Ks ratio calculations'.
The ratio of non-synonymous to synonymous substitution rates (Ka/Ks) provides a measure of selection pressure for protein-coding ability on nucleotide sequences. Classically, values significantly <<1.0 indicate purifying selection, whereas sequences without coding ability theoretically yield values near 1.0. We examined the trends for Ka/Ks in the population of human TPAs that are conserved syntenically in the rhesus monkey. Ka/Ks was calculated using the PAML package (as described in Methods). The distribution of Ka/Ks is shown for TPAs, split into two groups, those that have a disrupted protein domain of known three-dimensional structure (TPADD, blue bar, Figure ), and those that do not (TPANDD, red bar, Figure ).
Figure 5 Distributions of Ka/Ks for TPADDs (blue bar), and TPANDDs (red bar). Also shown are: the distribution of Ka/Ks for simulated sequences that have randomly mutated without coding-sequence selection pressures (green curve; derived as described in Methods (more ...)
In addition, we calculated the distribution of Ka/Ks values for sequences that are randomly mutating without coding-sequence selection pressures. These sequences were generated using the simulation protocol described in 'Methods: Ka/Ks ratio calculations'
, using the human TPAs as starting sequences (green curve, Figure ). The Ka/Ks distribution for these simulated sequences does not peak at ~1.0, as would be classically expected. This is likely due to some inaccuracy in modeling the expected frequency for the different possible nucleotide substitutions, which varies for different genomic areas [3
]. The distribution for TPADD
s peaks in the range 0.6 to 1.0. This peak is similar for the randomly-mutating sequences (Figure ). For the TPANDD
s, the peak is at lower Ka/Ks values (0.4-0.6).
As a further comparison, we have calculated the Ka/Ks curve for orthologous pairs of protein-coding genes from the rhesus monkey and the human (blue curve, Figure ). Clearly, these protein-coding sequences behave very differently from the TPAs, with a substantial mode in the range 0.0 to 0.2. In summary, these Ka/Ks trends indicate that the substitution patterns in the TPAs generally behave like non-protein-coding sequences, and not like protein-coding ones. This is despite the overall significant conservation relative to surrounding intergenic genomic DNA that was discussed in the previous section.
Analysis of the ratio of non-synonymous to synonymous substitution rates (Ka/Ks) relative to orthologous TPAs in dog and in mouse
To gain a more complete picture, we also examined Ka/Ks values for TPAs that are conserved in two more divergent species, the dog and the mouse. We compared Ka/Ks values for orthologous TPA pairs (termed Ka/KsΨ-ortho), with the corresponding Ka/Ks values for their parent genes (Ka/Ksparent-ortho) (Figure ). These were calculated for human/dog (Figure ), and human/mouse comparisons (Figure ). For human/dog comparisons, the substantial majority (83%) have Ka/KsΨ-ortho > Ka/Ksparent-ortho, whereas for human/mouse all of the pseudogene pairs have larger Ka/Ks values than their corresponding parent pairs.
Figure 6 Scatter plots showing Ka/Ks ratio comparisons between TPA sequences and their respective orthologous parental protein coding genes for: (a) human/dog comparisons, (b) human/mouse comparisons. Ka/Ks values for TPAs, that are significantly less than values (more ...)
The Ka/Ks results suggest that these transcribed pseudogenes are relaxing to higher Ka/Ks values, since origination from their parents. But why do they not have Ka/Ks values of ~1.0? We suggest that this is chiefly because: (i)
there may be some inaccuracy in modelling the expected frequency for the different possible nucleotide substitutions, which varies for different genomic areas (as noted in the previous section); (ii)
in some cases, the transcribed pseudogenes were originally protein-coding, and became disabled subsequently in multiple lineages; (iii)
some of them maintain an imprint of the original coding sequence because of selection pressure for regulation of homologous genes via
antisense interference (e.g.
, through genesis of siRNAs); (iv)
selection pressures on non-synonymous codon substitution rates in protein-coding genes, may have relaxed in the pseudogenes, contributing to an apparent relative increase in Ks; (v)
it is also possible that some of these sequences are currently protein-coding, and have evolved through multiple coding-sequence disablements, as discussed previously [4
To examine these data more closely, we calculated whether the Ka/KsΨ-ortho values are significantly less than would be expected for mutation without coding-sequence selection pressures (using the simulational analysis described in the Methods section). Several cases with such significant values (that may indicate purifying selection typical of protein-coding sequences), are observed (represented by circles in the Figure plots). These Ka/Ks values (that apparently indicate protein-coding ability) may arise for the reasons listed in the preceding paragraph.
In addition, we examined whether the TPAs contain a protein domain of known three-dimensional structure, that is disabled by a frameshift or a premature stop codon (denoted 'TPADDs'; see Methods for details of annotation of such domains). The TPADDs are indicated by unfilled symbols in parts (a) and (b) of Figure . Interestingly, in the human-dog comparisons, there are three cases of TPA orthologous pairs that have such a disabled protein domain, despite Ka/Ks values that indicate apparent purifying selection. These sequences are thus of 'intermediate' character, i.e., they have evidence of both protein-coding ability and pseudogenicity.
Antisense homologies of human pseudogenes to other full-length human cDNAs
Transcribed pseudogenes can regulate the expression of other genes by RNA interference mechanisms. For example, antisense transcribed RNA from a NOS pseudogene regulates the expression of neural nitric oxide synthase (nNOS) protein through formation of RNA duplex [34
]. Therefore, we investigated how many of the TPAs have antisense homology to the annotated full-length human cDNAs (E
-value < 1e-10
and alignment length > = 100 nucleotides). A small proportion (8.3%, 69/828) of the human (VEGA) pseudogenic transcripts have either complete or partial, but significant, reverse complement (antisense) homology to human cDNAs. Of these, 63 have short length strong antisense homology to human cDNAs (alignment length > = 20, mismatches < = 2). However, there is no significant association of such antisense homologies to pseudogene transcription, since non-transcribed pseudogenes have similar levels of antisense homology (7.65%, χ2
-value = 0.5).
Out of the identified 68 human conserved TPAs, 3 have antisense homology to human cDNAs (E
-value < 1e-10
, 5 if alignment length > = 50 nucleotides is considered) (Table ). These are cases that may generate small interfering RNAs (siRNAs) that could potentially regulate the expression levels of their homologous genes. Pseudogenes have been implicated in the negative regulation of parental genes (for a review, see ref. [2
]) and in the Dicer
-mediated generation of small RNAs [18
]. It would be interesting to verify experimentally whether these pseudogene transcripts can indeed generate small interfering RNAs, through the action of Dicer
Human conserved TPAs that have antisense homology to human full length cDNAs.
Small RNA mappings to pseudogenes
Transcribed pseudogenes can also regulate the transcription of genes by producing siRNAs that have antisense homology [18
]. Due to unavailability of genome-wide human siRNA data, we used the siRNA data for the mouse genome from Tam et al
] and Watanabe et al
] to check how many of the small RNAs mapped to mouse transcribed pseudogenes that we identified. Interestingly, 24 out of 136 (17.6%) mouse TPAs had siRNA mappings compared to ~1% (178/18168) of the total mouse pseudogenes. The above difference is statistically significant (P
-value < 0.05, using normal statistics for the distribution of the mean number of transcribed pseudogenes in a sample of 136 cases). This demonstrates that transcribed pseudogenes are significantly likely to generate siRNAs in mouse. For comparison, in Arabidopsis thaliana
, ~40% of 572 pseudogenes have small RNA mappings [36