A pipeline was derived to detect pseudogenic exons (ΨEs) in the immediate chromosomal milieu of genes (Figure ; see Methods for details). A ΨE is defined as an exon copy whose coding ability is compromised by a frameshift or a premature stop codon. Such frameshifts and stop codons are the most obvious indicators of coding-sequence decay. The designated parent exon for a ΨE is the most similar exon in the surrounding annotated gene structure. In addition, we annotated duplicated exons (DEs) in the transcripts from each gene, as described in Methods.
Pipeline annotation of DEs and ΨEs. The pipeline annotation is summarized.
We focused on four mammalian genome assemblies with high (>7X) coverage (human, cow, mouse and rat), to analyze the extent of the occurrence of ΨEs. We examined for significant trends in the distribution of ΨEs for a variety of properties. In particular, we focussed on assessing the peculiarities of the ΨEs in comparison to the general population of duplicated exons. We analyzed the following: (i) divergence from designated parent exons; (ii) association with protein families; (iii) association with Gene Ontology functional categories; (iv) position of ΨEs with respect to the intron-exon structure of the gene; (v) participation in alternative splicing, and (vi) coding-sequence selection pressures, as judged by Ka/Ks values.
Table summarizes the distribution of ΨEs. Strikingly, ΨEs occur at a consistent level across all of the mammalian genomes studied. The annotation pipeline identified between ~300 to ~600 cases of ΨEs per genome. These ΨEs occur for 0.4–1.0% of genes, with a frequency of 1.3–2.0 ΨEs per gene. In addition, we determined ~4000–7000 duplicated exons (DEs) within the annotated genes of each of the four studied mammals (Table ). A substantial fraction (~12–22%) of the ΨEs are located on the strand opposite to the putative parent gene (Table ), indicating some sort of inversion process in their generation.
(i) Divergence from designated parent exons
We analyzed the distribution of percentage sequence identity between the ΨEs and their respective designated parent exons. These distributions were compared to an equivalent distribution for DEs (Figure ). This equivalent distribution is from comparison of the DEs to their most homologous exons within the same gene. The distributions generally have a mode for both DEs and ΨEs at 40–50% (Figure ). Therefore, ΨEs are not unusually divergent in terms of protein sequence identity with respect to DEs in general.
Figure 2 Distributions of protein sequence identity for DEs and ΨEs. These curves are for the data sets listed in Table 1. There are four panels for each of the four mammals analysed, labelled with the binomial species name. For each panel, the DE curve (more ...)
In addition, we examined distributions of Ks
values for those exons which align to their designated parent exons with > = 70% amino-acid sequence identity (to avoid consideration of sequences with codon saturation) (Figure ). Although recently, evidence has been uncovered indicating that Ks
values are under selection in mammals [8
], they can still be used in a comparative sense to compare the age trends in populations of sequences. In general, there is a notable tendency for very young exon duplications, with a peak appearing in the Ks
distributions for all species at the interval 0.00–0.10, for ΨEs and for duplicated exons in general. Interestingly, also, a sizeable fraction of ΨEs appear to be derived from anciently duplicated exons (i.e., 30–60% having Ks
> 1.4); such exons were likely duplicated earlier in vertebrate evolution, and became disabled later during mammalian speciation.
Distributions of Ks for DEs and ΨEs. These curves are for the data sets listed in Table 1. The DE curve is green, and the ΨE curve is red. The bin label x is for all values such that, x-0.1 < value ≤ x.
The distribution of exon sizes of DEs has medians in the range ~40–50 amino acid residues (Figure , Additional File 1
). However, ΨEs are substantially longer than DEs in general (median values in the range 70–110 amino acid residues, and broader distributions) (Figure ). This larger size trend for ΨEs arises chiefly from the exon size trends for the specific gene families that tend to make large numbers of ΨEs, such as the Zinc-finger-containing (ZFC) genes (see Additional File 2
and protein family section below). In aggregate, the majority of the ΨEs (> ~75%) have at least half of their designated parents' length, and ~55% have between 0.9–1.1 of their parents' length (Figure ). A small percentage (6–13%) of the ΨEs are marginally longer than their parent exons (Figure ); this is potentially because of neutrally-occurring insertions arising after duplication [9
Figure 4 Distributions of size (in nucleotides) for DEs and ΨEs. These curves are for the data sets listed in Table 1. The DE curve is green, and the ΨE curve is red. The bin label x is for all values such that, x-10 < value ≤ (more ...)
Distributions of fraction of length of parent exon for ΨEs. The bin label x is for all values such that, x-10.0 < value = x.
(ii) Association with protein families
Some gene families spawn large numbers of pseudogenes. Examples include olfactory receptors [10
], ribosomal-protein genes [11
], ABC transporters [12
], and heat shock proteins [13
]. We noted previously that the gene families with the most non-processed pseudogenes tend to be involved in some form of interaction with the environment [1
. through roles in immunity [14
], chemosensation [1
], or small-molecule transport [12
]. Such gene families can also be linked to recent segmental duplications in mammals [16
]. Here, we examined which are the most common protein domain families in the ΨE and DE data sets (Additional File 2
). These numbers indicate the number of exons with at least one copy of each protein domain considered. Exons containing zinc-finger domains and immunoglobulin-like domains are consistently in the top five most abundant for both ΨEs and DEs. Genes for zinc-finger-containing (ZFC) proteins have undergone lineage-specific expansions over the course of mammalian evolution, so decaying ZFC exons are an expected consequence of this, and could perform regulatory roles as part of transcribed pseudogenes [17
]. Transcribed pseudogenes have recently been shown to regulate the expression of homologous genes through the formation of small, interfering RNAs [18
]. Immunoglobulin-like domains are used in many proteins that are involved in various aspects of immunity, and have been previously noted to generate large numbers of pseudogenes [14
]. The most notable difference between ΨEs and DEs in general, is that ΨEs rarely arise that contain EGF-like (epidermal growth factor-like) domains, whereas these exons are consistently abundant, generally (significant difference, P < 0.05, binomial statistics; Additional File 2
). EGF-like domains have expanded greatly in number over the course of mammalian evolution, and are found (with a small number of exceptions) either in the extracellular part of transmembrane proteins or in secreted proteins [20
(iii) Association with Gene Ontology functional categories
We used Gene Ontology (GO) functional classification to assess which functional associations are the most common for ΨEs (Table ). A pairwise comparison between lists of genes was performed to check over-represented terms according to various criteria, for ΨEs, and for DEs generally. In this analysis, we only studied the human, mouse and rat genomes, since these are the genomes with extensive GO functional annotation. Specifically of interest are the GO terms that are over-represented in ΨEs compared to DEs (Table ). Significant over-representation is calculated using a Fisher's exact test with P' < 0.05, and a correction to P' for multiple hypothesis testing [22
Most common Gene Ontology functional categories †
The top ten human DEs and ΨEs GO terms do not differ greatly from each other, in each of the species studied. However, each organism has distinct significant over-representations of GO terms. In the human genome, 'Ion binding' and 'Nucleic acid binding' are significantly over-represented in ΨEs, compared to DEs (Table ). This overrepresentation appears to be chiefly due to ZFC transcription factors, which are obviously candidates for regulation through unproductive splicing and translation, or through the formation of regulatory transcribed pseudogenes. In mouse, 'receptor activity' is significantly over-represented in ΨEs compared to DEs, and 'transferase activity' in rat. These indicate that different types of gene have undergone pseudogenic exon formation in recent evolutionary time in each of these three organisms.
(iv) Position of ΨEs with respect to the intron-exon structure of the annotated gene
In general, the majority of ΨEs are located within the 5' half of the genes in every studied genome (P < 0.01, using χ2
tests; Table ). This scenario suggests that proteins tend to become more complex through addition of exons to the 5' termini of their encoding genes. These exons could be inefficiently spliced and therefore will appear in only a few transcripts, while they may be selected against if they disrupt the normal gene function [23
]. Interestingly, the ΨEs are significantly 5' of their parents in rat (Table ). We suggest that this is due to lineage-specific activity related to specific gene families (Additional File 2
Position of ΨEs in related with their parents
A key issue in examining the distribution of stop codons in ΨEs, is whether they would produce transcripts that are susceptible to nonsense-mediated decay (NMD). We examined for individual stop codons in the ΨEs that would lead to NMD targeting (Table ). The number of such stop codons in ΨEs that would lead to NMD is significantly smaller than what is expected by chance (P < 0.01, using χ2
test), in human and cow, but not in the two rodent genomes. The expected distribution in this case, is calculated from the total size of the gene introns divided appropriately, given the position of the stop codons in each ΨE. This indicates a selection pressure in human and cow, against the positioning of individual stop codons in ΨEs in places that would cause NMD. It has been shown that alternative splicing can be coupled to NMD to regulate the expression of other transcripts from a gene [25
]. This mechanism has been dubbed regulated unproductive splicing and translation
]. There may therefore be a selection pressure against placement of stop-codon-bearing exons in some genes, so that they are not affected by this mechanism.
We curated on the human ΨE data, to search for unexpected positional distributions in genes. In human, forty-five ΨEs were found embedded in an untranslated region (UTR). These UTR-embedded ΨEs are not highly conserved. Only eight of them are also found in chimp and rhesus (four in each species), and none of them are shared by the three primate species simultaneously. None of the embedded ΨEs is conserved in a non-primate species (cow, dog, mouse or rat). This is despite syntenic conservation of 28 out of the 45 genes in a non-primate species involved in the embedding, when manually compared in the UCSC Genome browser [26
]. It is possible that these UTR-embedded ΨEs are remnants of overlapping gene arrangements. The manner of overlap for overlapping gene pairs changes very dynamically over evolution; for example, only 95 out of 255 human overlapping gene pairs were reported to be conserved as overlapping pairs in the mouse genome [27
(v) Participation in alternative splicing
Alternative splice products containing premature stop codons can be degraded through nonsense-mediated decay (NMD), and consequently cause altered expression of protein-coding transcripts through changes in abundance of splicing factors [7
]. We examined whether any ΨEs have been annotated as part of alternative splicings. To do this, we cross-referenced the ASD alternative splicing database [28
] 'splicing event' annotation, with our ΨE list from the human genome. Of the human 284 genes that harbour a ΨE in their genomic milieu, 101 are present in the ASD alternative splicing database. Out of these, we found 22 genes (entailing 59 transcripts) with evidence of transcription of a ΨE as an alternative exon. Analyzing the alternatively-spliced forms in detail, we found four cases of an unusual topology of splicing (Additional File 3
). These four human ΨEs can be differentially spliced in a topologically novel manner, in which one portion of a ΨE is recruited in one splice form, while a different portion of it can take part in another splice form (Additional File 3
(vi) Ka/Ks analysis
., the normalized ratio of non-synonymous and synonymous codon site substitution rates) is a measure of selection on coding sequences; values < 1.0 can indicate purifying selection, whereas values ~1.0 are theoretically expected for neutral selection pressures. Values significantly > 1.0 indicate positive selection over the whole of a sequence. We examined Ka
values for the different populations of ΨEs and DEs. Ka
values were calculated for all exon alignments with amino-acid sequence identity > 70%, to avoid consideration of saturated nucleotide sequences [2
]. In general, the DEs exhibit a mode in the range 0.00–0.25 for Ka
, indicating a tendency to purifying selection (Figure ). In contrast, the ΨE populations do not exhibit such a mode, instead peaking in the range 0.25–0.75 (Figure ). We have previously observed such a Ka
peak for pseudogenic transcripts captured by transposons [29
], and for processed pseudogenes [3
]. Thus is to be expected for endemic populations of neutrally evolving sequences, from comparisons with their putative parent sequences. The reasons for such Ka
values < 1.0 may include: (i)
continued purifying selection on the putative parent sequence; (ii)
an original protein-coding phase for the present-day ΨE. Interestingly, ~30% of ΨE cases, have Ka
values > 1.5, which indicates that they may have undergone positive selection before becoming disabled.
Histograms of Ka/Ks for for DEs and ΨEs. The DE histogram is green, and the ΨE histogram is red. The bin label x is for all values such that, x-0.25 < value ≤ x.