Retrocopies with strong evidence of expression
To determine how many retrocopies are potentially functional, we used BLASTZ to align all human mRNAs to the human genome, which resulted in several hundred thousand alignments. These matches were then scored for a set of features, including the number of processed introns; the absence of conserved splice sites; breaks in orthology with mouse, dog, and rhesus monkey; the presence, position, and length of the poly(A) tail; and sequence similarity and fraction of the parent mRNA that is represented in the retrogene (see Methods for full description), indicating evidence of the likelihood of recent retroposition. From this we obtained a set of 12,801 candidates that are likely retroposed copies of intron-containing parent genes. In order to set our score threshold, we compared our set to the manually curated Vega of processed pseudogenes (retrocopies of mRNAs that may or may not be functional). When we found disagreements between the sets, we either improved our feature set or discovered problems with the Vega annotation. This resulted in improvements of both. In order to determine if the retrocopies are expressed, we looked for overlap with mRNA or EST evidence. Following filtering to eliminate 6413 cases without mRNA or EST evidence, we found that 6,287 retrocopies showed evidence of expression by at least one EST or one mRNA [See Additional File 1
]. For our analysis we used more stringent requirements for expression (see below) than previous work. We chose not to use Ka/Ks analysis to look for evidence of natural selection, mostly due to the short length of retrogene sequences. Many of those could be functional parts of genes; however, for more recent events, final proof might be difficult.
Major categories of retrocopy contributions
To evaluate the types of events that led to new functional gene candidates or modifications of existing genes, and to reduce the possibility that a given transcript resulted from genomic priming, we increased the stringency factor for evidence of expression and examined in more detail a reduced set of 726 cases that overlapped at least five ESTs and one mRNA or annotated as a gene in RefSeq or UCSC KnownGenes (derived from Swiss-Prot).
We examined, in detail, all cases that were not purely duplication events, specifically those events that exhibited evidence of exon acquisition (see Methods): 1) cases with multiple coding exons, and 2) cases that showed evidence of contributions from retrocopy ORFs in the antisense direction. In general, we found that the retroposition events can be described predominantly by the following three categories: Type I: exon acquisition, in which part of the retrocopy was included into an existing gene transcript, in particular, in which a portion of the retrocopy could potentially serve as a protein coding exon. Type II: retropositional gene duplication, in which apparently no pre-existing host gene at the site of insertion was altered. Type II events, in order be functional, would require recruitment of resident regulatory elements at the site of insertion, such as promoters and/or enhancers, and the process may have been accompanied by intron generation, for example, to reduce the size of 5' UTRs [24
]. Finally, Type III retrocopy events occur, in contrast to Type I and II events, when the retrocopy contributed a sequence that is largely out-of-frame, derived from a UTR, or in the opposite orientation with respect to the retrocopy's parent. Other flanking DNA sequences, including those derived from other transposed elements (SINEs, LINEs, endogenous retroviruses, and DNA transposons), may also be co-opted into the structure of these ab initio
gene candidates. Hence, the Type III genes, if functional, have a protein sequence that is mostly novel.
Comparison with other datasets
We found good agreement between our candidate set of transcribed retrocopies and the major transcribed retrocopy datasets produced by other groups [7
]. Of the 223 transcribed retrocopies reported by Harrison [45
] we agreed with 189 cases. Randomly selected cases that were missing from our set were cases that relied soley on scarce EST data and fell below our threshold. Of the ten randomly selected cases in our set that were missing from the Harrison set, 30% were present in the Kaessmann set [7
]. Of the remaining cases not present in our set or Kaessman's, 20% have weak expression evidence but were nevertheless classified as retrocopies by Harrison [See Additional File 1
]. In contrast, we found many examples in the HOPPSIGEN dataset [47
] that were not present in our dataset or Kaessman's set. A random sample of ten were all found to be either segmental duplications or inactive LINE elements that we assume were false positives in their data. Kaessmann reported 1,080 expressed retrocopies with at least one EST [7
]. We agreed with 936 of these. Most of the cases missing from our set had mitochondrial, immunoglobin or zinc finger genes as parent genes. These were systematically excluded from our dataset because they are frequently generated by a different mechanism, i.e., segmental duplication. We reported 936 cases that were not present in Kaessman's set. Most were due to the smaller starting gene set that his pipeline used and exclusion of parental UTRs from the analysis.
Although the functional potential of Type II retrogenes was discussed early on [23
] and overwhelmingly substantiated over the past 10 years [7
], only a few Type I exon-acquisition events have been reported [49
] as well as de-novo gene evolution [50
]. The significant number of Type I and Type III events that we report demonstrates the extent of the contribution of retrocopies to the evolutionary processes that test, reject, and retain novel amino acid encoding sequence space. All the retrogene candidates fall along a continuum from a large degree of similarity (Type II) to little similarity (Type III) to the original sequences in their respective parent genes. Many of the putative, novel retrogenes, potentially encoding proteins with no similarities to other existing proteins, may have been missed by methods relying on protein alignments, as protein-based screening methods cannot find antisense insertions and also are not able to align UTR regions of retrogenes. Protein alignment methods miss Type III retrogenes entirely. The retrocopies involved in generation of Type III gene candidates made relatively small, but potentially 'seeding', contributions to the formation of novel genes. Of course, it must be emphasized that most of the Type II and Type III mRNA retrocopy-derived "retrogenes" described in this study are putative genes for which no proteins have as yet been documented. While some of these new transcripts may code for proteins, others may serve as non-protein coding RNAs, possibly involved in cellular regulation [14
] or in chromatin remodeling [53
Overview of types of events whose expression was strongly supported
Of the 726 candidate retrocopies whose expression was supported by many transcripts, 624 were composed of single protein coding exons and 102 contained multiple protein coding exons. The 102 cases came from a set of manually examined cases that overlapped known genes with more than one exon (about 500 cases), single exon cases transcribed in the reverse orientation based on EST and mRNA evidence (32 cases), and a random sample of the other single exon cases that slipped through our initial screen due to alternative splicing (see Table ). We compared: 1) the phylogenetic conservation of the ORF in the various species listed in Methods, 2) the relative contribution of the retrocopy to the new gene, 3) the relative contribution of the host gene (where applicable), 4) contribution from other types of transposed elements, 5) whether the retrocopy inserted in the sense or antisense orientation, and 6) we compared the parent ORF to the retrocopy ORF looking for frameshifts and mutations. Conclusions based on phylogenetic analysis are to be treated with caution as the non-human primate sequences contain a sufficient percentage of mistakes, erroneously indicating lack or presence of phylogenetic conservation in predicted ORFs.
Description and distribution of expressed retrocopy events
Type I – genes modified by exon acquisition
Of all cases with strong evidence of expression that we inspected, we identified 5% as being potential gene fusions, or exon-acquisition events. It is generally assumed that inserted retrocopies decay without affecting the structure of the host gene. However, we found several examples in which part of a retrocopy ORF integrated into the host gene (Figure , categories 1–4, 6; Table ), and often led to alternative mRNA splicing (Figure , categories 1, 4, 5). We cannot be sure of the duration of time between the retrotranposition event and the start of alternative splicing. The new splice sites were either fortuitously present in the ORF of the retrocopy, or they arose subsequent to the integration by base changes over time. We did not observe splice sites in retrogenes that coincided with the splice sites in the parent gene. This is not surprising, as important intronic parts of splice sites are removed on the processed mRNA templates prior to retroposition. Therefore, Figure shows generic splice sites as dotted white vertical lines that do not coincide with splice sites used in the novel gene context. The six categories in Figure are defined as follows: 1) Part of protein coding sequence from parent is used as alternatively spliced exon of the host gene. 2) Retrocopy contributes new 3' exon to host gene (mostly in-frame, magenta, and partially out-of-frame, dark red, with respect to parent gene). 3) In-frame contribution (magenta) combined with out-of-frame contribution (dark red) form a new N-terminal encoding region. A short 5' UTR (medium size bar, dark red) has been generated from the ORF of the retrocopy.
Figure 1 Categories of Type I retrocopy events. A. Examples of Type Ia exon acquisitions contributed by "same orientation" of retrocopies (in magenta or dark red) with respect to host gene (light blue); not drawn to scale, splice events are marked by angled lines, (more ...)
Type I retrocopy-exon acquisition events
We also found several examples in which a putative novel exon had been exapted from an ORF, but the reading frame is now different from that of the parent gene (Figure , category 4) which results in a shorter transcript. Also, putative novel exons were exapted entirely from sequences that correspond to UTRs of the parent gene, here alternatively sliced (Figure , category 5). In other instances insertion of a retrocopy and exaptation of an exon from that sequence triggered recruitment of an additional exon entirely from intronic space. For example, in Figure , category 6, the retrocopy contributes in-frame protein coding region (magenta) combined with unannotated intergenic sequence (dark blue) to form a new N-terminal encoding exon for the host gene. In turn, portions of the intronless retrocopy's protein coding region became an intronic sequence in the host gene (overlayed in grey). An interesting example of retrocopy mediated domain shuffling is the CTGLF1 gene (Figure category 2), which started as a cyclin gene, and then had three domains (PH, ArfGap and Ankyrin) contributed by insertion of a CENTG2-derived retrocopy. The mouse version of this gene, AK132782, has only a cyclin domain and represents the ancestral form before the retrocopy insertion. These observations underscore the fact that natural selection exapts novel sequence space in addition to slowly modifying existing sequence space.
Type I exon acquisition in reverse orientation
While at least one group has reported on the existence of sense retrocopy integrations into existing genes, with coding contributions [7
], this is the first report of mRNA retrocopy integrations in the antisense orientation. The existence of this category of retrocopy events, if functional, supports the idea that natural selection has no preference with respect to the origin of novel sequences. In this category, novel exons were recruited from retrocopies that inserted into or adjacent to host genes in the opposite orientation to the retrocopy's parent gene. As in the case of Type Ia 'sense' retrocopies, the splice sites in the 'antisense' retrocopies, of course, do not correspond to those present in the parent genes. Of the Type Ib examples that we manually inspected, polypyrimidine tracts inserted by the retrocopy – used for recognition of splice sites – were frequently derived from antisense oligopurine tracts in the parent gene. These sequences are often rich in codons for lysine, glutamic acid, and glycine, as well as certain codons of arginine in the parent gene. A few Type Ib examples are described in detail below.
1) Internal exon (dark red) added to host gene in the opposite orientation relative to the parent of the retrocopy For example, the BRCA1 gene has an alternatively spliced internal cassette exon (potentially encoding 22 aa) contributed by RPL21 in the antisense direction (Figure , example 1; Table ). The insertion occurred after the New World monkey split and the reading frame is open in chimpanzee, orangutan, and rhesus monkey.
2) Internal exon added to host gene triggered recruitment of an additional protein coding exon from formerly intronic sequence (dark blue). For example, RORA acquired an internal cassette (encoding 25 aa – PDB structures 1N83 and 1S0X and Swiss-Prot P35398) from an antisense retrocopy of CYCS. Interestingly, a second exon (encoding 27 aa) appeared in conjunction with the retrocopy-derived exon, apparently derived from an intronic sequence, that maintains the frame of the gene (Figure , example 2). The open reading frame is maintained in orangutan, rhesus monkey, and marmoset, but there is an early in-frame stop in chimpanzee (confirmed by our re-sequencing, unpublished) – an example in which one lineage did not retain such an innovation (see discussion).
3) Recruitment of a 3' exon including novel ORF and 3' UTR generated from ORF of the retrocopy; it extends the ORF of the host gene (Figure , example 3). Examples are SCP2, HLA-F, and KIAA0415 with potentially functional variants that have alternatively spliced 3' ends that are derived from antisense retrocopy insertions of RRAS2, RPL23, and FLJ10324, respectively. The SCP2 variant that includes the retrocopy-derived exon has a shorter transcript (potentially encoding 338 aa instead of 547 aa) and is only present in chimpanzee and human. In HLA-F the insertion generated a longer ORF (potentially encoding 442 aa instead of 362 aa; Figure , example 3). Importantly, the retro-derived variants of SCP2 and HLA-F are reviewed NCBI Refseq genes.
4) Portion of the retrocopy contributes a potentially alternatively spliced protein coding exon in conjunction with a novel protein coding exon generated from intergenic sequence (dark blue). For C6orf148, we detected two mRNA variants; the first, depicted in Figure (example 4a), presumably also represents the ancestral status prior to the retrocopy insertion. The second putative variant has an alternative, upstream promoter, a new first exon from an unknown source, and a second protein coding exon (Figure , example 4b) derived from the EIF3S6 retrocopy in reverse orientation. Surprisingly, the third putative coding exon in the second variant is also longer than the corresponding N-terminal coding region in the original variant. Part of the EIF3S6 UTR was potentially exapted as a protein coding sequence (example 4a). Both splice forms and open reading frames coexist in chimpanzee and rhesus monkey.
5) Alternative splicing or alternative translation after retrocopy insertion. The first putative protein coding exon becomes one of the 5' UTR exons (light blue); a second 5' UTR exon is recruited from unknown sequences. The first putative protein coding exon (dark red) is recruited from the retrocopy (Figure , example 5). As in the aforementioned examples, DENN1B and HK1 also exhibit mRNA variants with and without their respective retrocopies. Their promoters are shared by both variants and both have putative alternative translation starts. Interestingly, in both cases the version with retrocopy contribution does not start transcription in the first exon, but instead, includes a second UTR exon before splicing to the retrocopy-derived putative antisense coding exon. The next protein coding exon is shared by both variants. The ORF for HK1 is open in chimpanzee, orangutan, and rhesus monkey. DENN1B has valid ORFs only in human and chimpanzee, but the retrocopy-derived portion is present with disruptions in orangutan and rhesus monkey.
6) Two 5' UTR exons, intron, and N-terminal encoding exon are recruited from the protein coding region of the retrocopy. For example, one variant of CSMD3 is expressed in the brain and contains a sequence encoding a potential N-terminal 79 aa exon (Figure , example 6a). The other putative variant of CSMD3 is expressed in testes (based on mRNA and EST evidence), and instead of the 79 aa exon uses part of an antisense RPL18-derived retrocopy; the event might also have led to use of a new promoter for the gene (Figure , example 6b). In addition, a 5' UTR exon and a small intron were derived from the protein coding region of the retrocopy. Only the human version of the retrocopy-containing putative variant has an intact ORF.
Type II duplication events
Of the Type II events, 60 of them contained one or more 5' and/or 3' untranslated exons acquired along with regulatory elements derived from the flanking region of the insertion site [see Additional File 2
, categories 2 and 3] as predicted previously [24
]. A few examples in which introns also arose in flanking UTRs were reported recently [7
]. In our numerous examples, we found no indication that the UTR introns came from the parent gene. Occasionally, a 5' or 3' exon recruited from the locus provided not only a UTR, but also the first or last protein coding exon [see Additional File 3
]. We also observed shorter and longer N- or C-terminal encoding parts of genes in separate lineages [see Additional File 2
, categories 4–8,10,11], representing one of the mechanisms that might explain the frayed ends of many protein homologs, even orthologs [54
]. In addition, we found cases where an intron arose within the coding region after retroposition [see Additional File 2
, category 8–11, Additional File 3
]. These and the aforementioned examples (categories 2 and 3) underscore the notion that even intron-containing genes, especially those with large exons and relatively intron-impoverished with respect to their parent genes, can be derived from retroposition. Similarly, one or several introns of a gene can be lost by recombination with the corresponding retrocopies [55
Type III novel gene candidates
In a small fraction of the cases (16) we examined, a putative new gene with no known homologs included a retrocopy (usually only part thereof) that inserted into the genome and possibly provided protein coding sequence either 1) out-of-frame (Figure , Additional File 4A
) or 2) antisense with respect to the parent gene (Figure , Additional File 4B–E
, Table and Additional File 3
Figure 2 Novel protein-sequence space generated by parts of retrocopies combined with other transposons or unusual events. For each part of the figure, the spliced parent mRNA is shown first (before retroposition) and the resulting gene(s) are shown below. New (more ...)
Type III novel retrogenes that are out of frame or reverse sense with respect to the parent gene
Examples are briefly described as follows: A) The novel candidate gene FLJ25758 was generated from MARK2-derived retrogene featuring protein coding sequence completely out-of-frame with respect to the parent. There is a potential LTR contribution of promoter and the remaining UTR sequences of FLJ25758 gene were derived from flanking sequence shown as blue bars. B) The novel candidate gene FLJ40504 was generated from a KRT18-derived retrocopy out-of-frame with respect to the parent, the remainder of the protein coding sequence was derived from flanking region. C) The first protein coding exon of candidate gene MGC34774 was derived from a MIR element (yellow); the second exon from intergenic region (blue) and the final protein coding exon from the 5' UTR and ORF of L13A-derived retrocopy out-of-frame (dark red). D) The novel gene candidate C15orf21 is processed into three alternatively spliced transcripts, two of which use retroposed sequences as first protein coding exon (yellow). The second protein coding exon, in part from unknown sequences, was fused to an HMG14-derived retrocopy completely out-of-frame. E) The RPL7A-derived retrocopy contributes half of the ORF in-frame (magenta) and half out-of-frame (dark red) yielding novel gene candidate Q9BR82. Upstream exons were contributed by repeats (yellow). F) CT47 gene was generated, in part, from a NAPL1-derived retrocopy, which contributed the C-terminal encoding region out-of-frame. Most of the ORF (blue) arose from intergenic sequence. G) Two alternatively spliced mRNAs originated from an RCN1-derived retrocopy (3' UTR and ORF) in reverse orientation. H) LTRs contributed UTR and first protein coding exons (yellow); other exons were derived from intergenic sequence (blue); candidate gene C20orf91 contains part of a LOC16236-derived retrocopy in opposite orientation. I) Two events, retroposition of SNX9 followed by a second retroposition of SNAG1 or segmental duplication formed the intron containing candidate gene FLJ25328 in reverse orientation to SNAG1, presumably with generation of two introns out of ancestral ORF sequences. J) A novel gene candidate FLJ13355 (2 splice variants) was formed from a C18orf24-derived retrocopy (retrocopy ORF contributed 5' UTR and retrocopy 5' UTR contributed N-terminal encoding region). The second, larger exon including a large part of ORF was contributed by the internal promoter and 5' UTR region of a LINE element (yellow). K) An IQCK-derived retrocopy contributed the protein coding exon in reverse orientation (dark red) yielding novel gene candidate FLJ32895. Downstream UTR exons were derived from Alu elements and intergenic sequences.
One can summarize the following: Although there is no evidence that a protein is produced, the size of the putative new ORFs ranged from 81 to 259 aa in human, and seven maintained open reading frames in human, chimpanzee and orangutan. Only four cases also had open reading frames in rhesus monkey and only two also in marmoset (Table ). Suprisingly, all but three had multiple putative exons, lending further weight to the notion that we were not observing random transcription. While fusions between existing genes and mobile elements have been described [56
], we also observed exons that were generated, in conjunction with the retrocopy, by other types of transposed elements and/or unannotated sequences. For example, most of the CT47 gene [57
] which has evidence of protein coding sequence (Swiss-Prot Q5JQC4) arose initially from unannotated sequence (see below), amplified by segmental duplication and has a valid ORF in human, chimpanzee, orangutan, rhesus macaque, and marmoset. Other cases in which a gene arose from unannotated sequences have been described in flies [50
]. Out of the 16 primate cases, seven had putative protein coding regions that included repetitive elements [shown in yellow tall boxes, Additional File 4A
]. One was composed of a chimeric fusion of two retrocopies [see Additional File 4A
, Additional File 3
Functional categories of source genes
We looked at GO annotations of the parent genes that spawned retrocopies. For both the expressed on non-expressed sets of retrocopies we found no statistically significant enrichment from a normally distributed set.