Criteria for Identifying ψmRNAs
We identified ψmRNAs by aligning the FANTOM3 cDNA sequences against all known proteins (from all organisms) in the Swiss-Prot database [17
], and retaining alignments with frameshifts and/or internal stop codons. The alignments were performed by the program FASTX, which translates the cDNA in all three frames, and allows alignments to switch between these frames with forward and reverse frameshifts [18
]. FASTX also estimates the statistical significance of each alignment in terms of an E
-value. These E
-values are very accurate, unless the sequences have unusual monomer compositions [19
]. Therefore we filtered low complexity and tandem repeat sequences using the programs PSEG and XNU. We retained alignments with E
≤ 0.01, meaning that spurious matches to unrelated proteins are expected for 1% of the cDNAs. For each cDNA, only the alignment to its closest Swiss-Prot homologue (top FASTX hit) was considered.
It is essential to demonstrate that these ψmRNAs are real biological transcripts rather than experimental artefacts. Reading frame disruptions often indicate sequencing errors in the cDNA. To exclude these cases, we required independent confirmation of disruptions from the mouse genome sequence. Taking the cDNA-to-genome alignments produced by the FANTOM3 Consortium, we checked that the protein-aligned region of cDNA aligned to the genome without any gaps, except for introns, defined as gaps in the cDNA only of size 15 nt or more and flanked by the standard splice sequences GT and AG. This criterion ensures that frameshifts within the protein–cDNA alignment have no corresponding gaps in the cDNA–genome alignment; so if the frameshifts are due to sequencing errors, the same error must be present in both the cDNA and the genome sequence, which is extremely unlikely. In addition, internal stop codons were counted only if they aligned to identical genomic sequences. Thus, these transcripts contain reading frame disruptions that are not caused by sequencing error.
Although these cDNAs are confirmed by the genome sequence, it might still be argued that they represent intronic or untranscribed sequences, from erroneous cloning of pre-mRNA or DNA. However, many of the ψmRNAs have exon–exon junctions within the protein-aligned region (), which is not consistent with this type of artefact.
A further potential source of error is that some Swiss-Prot proteins may be erroneous translations of frameshifted or noncoding nucleotide sequences. In fact, this is why we used only the manually curated Swiss-Prot database, since we frequently encountered this sort of error when using the more extensive TrEMBL set. In addition, we excluded proteins with the keyword “hypothetical protein”, which is defined as “predicted protein for which there is no experimental evidence that it is expressed in vivo”. Nevertheless, the remaining proteins are likely to include a small percentage of errors.
As a final filter, we eliminated cDNAs that map to the mitochondrial genome and whose only disruptions are internal TGA stop codons, since TGA encodes tryptophan in the mitochondrion. As might be expected, there are no mitochondrial ψmRNAs after applying this criterion.
Number of ψmRNAs
The pipeline described above indicates that 10,679 out of 102,801 FANTOM3 cDNAs are ψmRNAs (Table S1
; format described at http://song.sourceforge.net/gff3.shtml
). Since we used an arbitrary E
-value threshold of 0.01, it is worth examining how the number of ψmRNA predictions varies with E
-value cutoff. reveals a discontinuity around E
. There are 4,746 ψmRNAs with E
, the bulk of which are multi-exon (unprocessed), as might be expected since unprocessed pseudogenes are more likely than processed pseudogenes to possess upstream promoters and be transcribed, owing to the different mechanisms by which they are formed. At E
there is a rapid increase in the number of single-exon ψmRNAs, most of which overlap transposon sequences (). These transposon-associated predictions are discussed further below. Clearly, the number of ψmRNAs we report depends sensitively on the E
-value cutoff. This is expected because ψmRNAs have diverged from their protein-coding homologues by varying degrees, and there comes a point where the sequence similarity is not statistically significant. It is more accurate to say that the number of detectable
ψmRNAs is around 10,000. Finally, of course, neither the FANTOM3 transcriptome set nor the reference Swiss-Prot set is complete, so almost certainly additional ψmRNAs remain to be discovered, e.g., we will miss ψmRNAs that are not similar to any known protein.
Number of ψmRNAs as a Function of Alignment E-Value Cutoff
The FANTOM collection includes groups of cDNAs that come from the same genomic locus. When ψmRNAs with shared exonic nucleotides are clustered (the transcriptional unit criterion [3
]), they reduce to 8,515 clusters. Alternatively, when ψmRNAs with identical genomic mappings of their protein homology segments are clustered, they reduce to 9,583 clusters. cDNAs within such a cluster may differ outside the protein homology region, which could affect their biological behaviour, e.g., a downstream splice site could promote nonsense-mediated decay (NMD). In the following analyses, we either use all 10,679 ψmRNAs or the 9,583 ψmRNA clusters when focusing especially on the protein homology segments. In any case, the ~10,000 ψmRNAs are mostly distinct.
Categories of ψmRNAs
We sought to understand ψmRNAs better by classifying them based on how they are produced and other basic properties (). In 570 cases, all the frame disruptions were associated with large unspliced insertions (≥15 nucleotides) in the transcript relative to their homologous proteins: a signature of intron retention. Some of these cases might be transcribed pseudogenes with large insertion mutations, but in all the cases we investigated manually there were alternative transcripts that spliced out the inserted region. Intron retention is difficult to analyse because it is hard to tell whether we have captured incompletely processed pre-mRNA or genuine splice variants. Presumably these 570 cases include some artefactual, immature sequences and some genuine ψmRNAs.
In a further 831 ψmRNAs, all the frame disruptions were associated with either splice junctions or unspliced insertions as above. In these cases the frame disruptions could be avoided by altering the splicing pattern, and in all cases that we investigated manually there were indeed splice variants that avoided the frame disruptions. So these ψmRNAs are disrupted splice variants of protein-coding genes. The disruptions arise in various ways. Frameshifts are caused by the use of out-of-frame alternative donor and acceptor splice sites, or by skipping or inclusion of exons whose length is not a multiple of three. Internal stop codons arise from alternative splice sites that cause extra genomic sequence to be incorporated in the exons, or from inclusion of facultative exons. Since the FANTOM cDNA collection is biased against finding multiple variants of the same gene [20
], the proportion of ψmRNAs formed through splice variation is an underestimate.
The remaining ψmRNAs include 2,608 that have splice junctions within the protein-aligned region, and 6,670 that do not. The former presumably derive from unprocessed pseudogenes, whereas the latter may come from processed pseudogenes or single exons of unprocessed pseudogenes. For the majority of single-exon ψmRNAs, the protein homology segment overlaps a transposon sequence. Overall, a large majority of ψmRNAs in this list derive from processed and unprocessed pseudogenes, although this might simply reflect the FANTOM bias against splice variants.
Popular ψmRNAs and Transposons
Some Swiss-Prot proteins have multiple ψmRNA homologues. lists the top ten proteins with the most homologues among the clustered ψmRNAs. For all these proteins, the region of the ψmRNA aligned to the protein usually overlaps a particular type of transposon, indicated in the table. Some of these proteins are components of active LINE elements and endogenous retroviruses: the genome contains numerous inactive, decaying copies of such elements, so it is not too surprising that many transcripts contain disabled homologues of these proteins. These are bona fide pseudogenes. Other cases appear to have a converse history, where ancestrally noncoding sequences have been incorporated into a protein-coding region. For example, the mouse Jak3 gene has one splice variant that incorporates an Alu element within the protein-coding segment, and the second coding exon of the mouse Nedd4 gene overlaps a B2 SINE. So these proteins are (genuinely) homologous to many noncoding transcripts that contain Alu and B2 elements. Although it seems strange to call SINE-containing transcripts ψmRNAs, and they can certainly be set aside as a special case, they arguably do fit the definition of a ψmRNA since they contain noncoding sequence that is homologous to protein-coding sequence.
Top Ten Proteins with Most ψmRNA Homologues
We also examined which categories of protein are overrepresented among ψmRNA homologues, compared to FANTOM mRNAs. Swiss-Prot entries include manually curated keywords that categorise the proteins according to functional, structural, and other criteria. Among the clustered ψmRNAs, 119 are homologous to ribosomal proteins (p
= 3 × 10−9
), 206 to G-protein-coupled receptors (p
), and 158 to polyproteins (p
= 4 × 10−47
). There are many ribosomal protein and G-protein-coupled receptor pseudogenes [21
], and so their overrepresentation among ψmRNAs is not surprising. Polyproteins are cleaved to produce several functional polypeptides: their overrepresentation here suggests that some ψmRNAs may encode truncated polyproteins that generate a subset of the polypeptides.
Potential Truncated Proteins and NMD
Transcripts with reading frame disruptions may be translatable into partial protein sequences. For example, the section between the start codon (if it is present) and the first frame disruption might be translated. If the first disruption is a frameshift, a stop codon will usually follow soon after since on average three out of 64 codons are stops in noncoding frames. In some cases there are frame disruptions near the beginning of the transcript–protein alignment, followed by a long, undisrupted region at the 3′ end with an alternative or internal in-frame start codon. In these cases it is tempting to speculate that the long 3′ region is translated, although if the start codon is present it is perhaps more plausible that a short peptide is translated from the start of the aligned region. We considered both possibilities for the ψmRNA predictions. We emphasise that our ψmRNA set does not include every splice variant that encodes a truncated protein: it includes such splice variants only if they contain out-of-frame protein-coding sequence.
We identified translatable open reading frames (ORFs) in the ψmRNAs that have some in-frame overlap with the protein-aligned region. Sometimes more than one such ORF is present. We chose either the ORF with the maximal number of codons aligned in-frame to protein residues (A), or the ORF whose in-frame overlap is earliest (most upstream) (B). Almost a quarter of ψmRNAs (2,372) have no ORF with in-frame overlap with the transcript–protein alignment. On the other hand, about a third of predicted ψmRNAs (3,557) have ORFs of 100 aa or more that overlap some protein residues in-frame, and for approximately 10% of predicted ψmRNAs, the ORF covers greater than 90% of the aligned protein (1,178 ψmRNAs using maximal ORFs and 1,069 ψmRNAs using earliest ORFs). These mostly translatable cases may well encode functional proteins, representing benign protein evolution by slight changes in the start or end of translation. There are several possibilities for the intermediate cases: they may encode functional, truncated proteins such as dominant negatives, they may be untranslated, or they may undergo accidental translation into non-functional proteins.
Truncated ORFs of ψmRNAs and Potential for NMD
NMD is a phenomenon whereby mRNA molecules with nonsense (premature termination) codons undergo rapid degradation. However, the stop codons must lie at least 50–55 nt upstream of an intron. Interestingly, it has been proposed that NMD has a universal role in proofreading mRNA, protecting the cell against potentially toxic dominant negative truncated proteins. Such dominant negatives could arise for example when a ligand-binding domain is preserved but the signal transduction domain is absent [23
]. It has also been shown that in the absence of NMD, the cell overexpresses transcripts arising from retroviral and retroposed elements [24
]. The FANTOM dataset is expected to contain few NMD transcripts, as these are rapidly degraded and, therefore, not likely to be selected for during cloning and full-length sequencing. We assessed how many ψmRNAs may be subject to NMD by checking for premature stop codons 55 nt or more upstream of a splice junction. Supposing that either the maximal ORFs or earliest ORFs are utilised, a fairly small minority of ψmRNAs appear subject to NMD: 1,228 and 2,077, respectively (). In fact, the maximal and earliest ORFs are distinct in only 2,185 cases; in these cases 1,042 earliest ORFs and 193 maximal ORFs satisfy the NMD criterion. This large discrepancy suggests that if a ψmRNA is translated, the maximal ORF is more likely to be utilised.
The opal codon TGA usually encodes a translation stop but occasionally encodes the rare amino acid selenocysteine. It has been reported that the human selenoproteome consists of 25 selenoproteins [25
]. Thus, ψmRNA predictions that have internal TGA codons as their only disruptions might actually encode selenoproteins.
To assess the impact of selenoproteins on our dataset, we plotted the number of ψmRNA predictions from the clustered set that have varying numbers of internal TGA codons exclusively and no other reading frame disruptions (). As a control, we also plotted numbers of ψmRNAs with exclusively TAA or TAG disruptions. There are consistently more cDNAs with TGA disruptions only than TAA only or TAG only: often 2-fold more. As a further control, we counted numbers of internal TAA, TAG, and TGA codons in ψmRNAs with more than one type of internal stop codon (but no frameshifts). These results (2,463 TAA, 2,107 TAG, 2,709 TGA) were used to estimate the expected numbers of TGA-only ψmRNAs, assuming internal stop codons occur randomly and independently in these proportions. These expected numbers are always less than the observed numbers (), and in total there are around 300 more TGA-only cases than expected.
An Excess Number of ψmRNAs Have TGA Stop Codons as the Only Reading Frame Disruption
The most extreme case has ten internal TGA codons and no other frame disruptions (), a very unlikely occurrence if the three types of stop codon are utilised randomly and independently. In fact, this gene encodes a well-characterised selenoprotein, selenoprotein P, thought to function in selenium homeostasis and oxidant defence. Twenty other cases align to known selenoproteins, each with one TGA codon. The two cases with seven and eight TGA codons have borderline FASTX E-values, and might represent novel selenoproteins. An alternative explanation of repeated stop codons, tandem repeats, is unlikely since tandem repeats were filtered using XNU. These results suggest that mice, and by extension perhaps humans, possess many more than 25 selenoproteins.
Selenoproteins often contain a particular secondary structure (SECIS) in the 3′ UTR, so we used the program SECISearch to search for these in the ψmRNAs [25
]. We found no enrichment in predicted SECIS elements in the ψmRNAs with opal codons once we removed RNAs encoding known selenoproteins. The progam has a high false negative rate (28%), so this analysis does not rule out the possibility that many of the TGA-containing ψmRNAs may actually encode novel selenoproteins with the SECIS motif, or that an alternative selenoprotein insertion motif is used.
Nuclear Mitochondrial Pseudogenes
Nuclear mitochondrial pseudogenes (numts) are disabled copies of mitochondrially encoded genes in the nuclear genome. The difference in genetic code (TGA is a tryptophan codon in the mitochondrion but a stop codon in the nucleus) means that numts are often dead on arrival. The clustered ψmRNA set includes 55 transcribed numts. Two of these, FANTOM clones 9330154B14 and C530050P18, have FASTX E < 10−10 and are clearly real numts: they align cleanly to nuclear chromosomes but not to the mitochondrial genome. The remainder have FASTX E ≥ 0.0001 and are thus borderline cases. Aligning these cDNAs to the mitochondrial genome at the nucleotide level did not clarify the situation: four have BLASTN E ≤ 0.01, 13 have E ≤ 0.1, and 49 have E ≤ 1. They are presumably a mixture of ancient numts and spurious alignments. All but three of these numts have disruptions other than TGA stop codons, so they do not explain the excess TGA-only cases observed in .
The clustered ψmRNA predictions include 159 cases with compensatory frameshifts. In these cases the transcript–protein alignment undergoes multiple frameshifts but ends up in the same frame in which it started, and the transcript is translatable in this frame without internal stop codons. These cases should arguably be set aside from the ψmRNA list since they may well encode full-length proteins. However, these transcripts are interesting because their frameshifted portions would encode amino acid sequences completely different from their Swiss-Prot homologues. Manual inspection revealed that the frameshifts sometimes occur in low-similarity regions of the alignment, and are thus unreliable, but other alignments are unambiguous because they have close to 100% identity across their entire length. The number of out-of-frame codons is usually less than 20, but the largest reliable case has 57 out-of-frame codons (). Our method imposes an artificial lower bound on the number of out-of-frame codons, because the alignment score penalty incurred by two frameshifts cannot be compensated by a short run of intervening matches. Although we have ruled out sequencing errors in the FANTOM cDNAs, compensatory frameshifts might be attributed to sequencing errors in the transcript from which the homologous Swiss-Prot protein was derived. Compensatory frameshifts may be a source of large- and small-scale changes in protein evolution.
Number of Out-of-Frame Codons in Transcripts with Compensatory Frameshifts
The compensatory frameshifts are almost always caused by insertion and deletion mutations, but in at least two cases they arise from alternative splicing (see Discussion
). In these cases one section of protein-coding sequence simultaneously encodes two different amino acid sequences in alternative reading frames.
Evolution of Expressed Pseudogenes
We examined the evolutionary forces acting on expressed pseudogenes by estimating nonsynonymous and synonymous substitution ratios (dn and ds) for a subset of 132 ψmRNAs. As expected, these sequences exhibit a shift towards neutral evolution (towards a dn/ds value of one) in comparison with a control set of mouse protein-coding genes, although the distribution of dn/ds values is heterogeneous (). Thus, expressed pseudogenes have a mean dn/ds of 0.809 versus 0.603 for the control set for pairs with dn < 2 and ds < 2 (Wilcoxon rank sum test, p = 3.13 × 10−6), or 0.839 and 0.53, respectively, for all pairs (p = 4.139 × 10−10). Naturally, dn/ds describes only the mode of coding sequence change, and lack of protein-coding conservation does not preclude the possibility of function as noncoding RNA. The distribution of ds values for the expressed pseudogenes is similar to that for the control set of paralogues (), with a high proportion of pairs with ds < 0.2 and a long tail of pairs with higher ds values. Assuming the molecular clock, i.e., a similar linear relationship between time and ds, this suggests that expressed pseudogenes persist in the genome over long evolutionary timescales, similarly to protein-coding paralogues.
Ratios of Synonymous and Nonsynonymous Substitution in Expressed Pseudogenes
To establish the degree of human/mouse conservation, we searched for human orthologues of murine expressed pseudogenes using BLAST. As a positive control, and to differentiate from alignments generated by family members, the Swiss-Prot entries paired with the expressed pseudogenes were used. shows an overlay cumulative plot of E-values obtained for FANTOM clones (line) and paired Swiss-Prot sequences (circles). Both the expressed pseudogenes and the paired Swiss-Prot sequences have a similar total number of alignments (395 and 401, respectively), suggesting that non-specific alignments with family members were reported. However, as is evident from , at lower E-values the cumulative curves strongly diverge, implying that differentiation between the pseudo-orthologue and intact orthologue is feasible. Thus, 55 out of 88 expressed pseudogenes with reported alignments had the lowest E-value alignment on a different human contig than did the corresponding Swiss-Prot entry. These FANTOM clones were designated as putatively conserved expressed pseudogenes. This implies significant overall conservation of at least 39% (51 out of 132 expressed pseudogenes). As expected, putative conserved pairs were older than than the remaining set of expressed pseudogenes (mean ds of 1.68 versus 1.01, Wilcoxon rank sum test, p = 0.002). They also had a lower mean dn/ds of 0.62 versus 0.99 (Wilcoxon rank sum test, p = 0.039). However, dn/ds also decreases with age in the control set of paralogues, with means of 0.53 and 0.65 in matched age groups.
Similarity between Mouse Expressed Pseudogenes and Human Genomic Sequences
Expression and Promoter Characteristics of ψmRNAs
The locations, shapes, and expression patterns of mouse promoters have been revealed by massive sequencing of CAGE tags (approximately 20-nt sequence tags from the 5′ ends of transcripts) (P. Carninci, A. Sandelin, B. Lenhard, S. Katayama, K. Shimokawa, et al., unpublished data). The overall expression levels of ψmRNAs, measured by numbers of associated CAGE tags, are not significantly higher or lower than those of non-ψmRNA FANTOM transcripts. ψmRNAs are significantly (p
= 7 × 10−7
) associated with the BR shape class of promoters, which initiate transcription over a broad region and tend to overlap CpG islands (P. Carninci, A. Sandelin, B. Lenhard, S. Katayama, K. Shimokawa, et al., unpublished data). They are also significantly (p
= 0.0006) associated with the regionally biased expression class of promoters, where sub-regions of the promoter have distinct tissue specificities (H. Kawaji, S. Katayama, A. Sandelin, C. Kai, J. Kawai, et al., unpublished data). ψmRNA promoters have a significant enrichment relative to other promoters for at least 50 transcription-factor-binding motifs from the TRANSFAC database, associated with 39 transcription factors (Table S2
). A large fraction of these transcription factors are nerve system specific and pancreatic beta cell specific (Table S3
). Thus ψmRNAs have distinctive promoter and expression patterns, suggesting that they may occupy specific functional niches.