Genome-wide survey of transcripts containing transposed elements
To evaluate the effect of TEs on the human and mouse transcriptome, we calculated the total number of TEs in both genomes, the number of TEs in introns, and the number of TEs that are present within mRNA molecules. We therefore downloaded EST and cDNA alignments, as well as repetitive elements' annotations of the human genome and the mouse genome from the University of California, Santa Cruz (UCSC) genome browser (hg17 and mm6, respectively) [
28], and analyzed for TE insertions (see Materials and methods, below). Our analysis of the numbers of TEs in the human and mouse genomes is summarized in Tables and , respectively. There are approximately 3.9 and 3.1 million copies of TEs in the human and mouse genomes, respectively. The most abundant TE families within the human genome are
Alu and L1 elements, with almost 1.1 million and 800,000 copies each. The most abundant TE families in the mouse genome are L1 (800,000 copies) and B1 (500,000 copies).
| Table 1TE effect on the human transcriptome |
| Table 2TE effect on the mouse transcriptome |
Next, we examined the number of TEs in introns. It is interesting to note that all families of TEs have a tendency to reside within intronic regions. Between 44% and 66% of TE insertions are located within intronic sequences. Alu in humans and B4 in mice have the highest ratio of insertions within introns (66%), whereas L1 and LTR both in human and mouse have the lowest percentage of copies within introns (58% in human and 56% in mouse for L1, 44% in human and 52% in mouse for LTR). L1 and LTR exhibit a biased insertion in the antisense orientation relative to the mRNA within intronic sequences in both human and mouse: 185,428 and 96,718 L1 repeats were inserted in the antisense and sense orientations in human, respectively; 113,862 and 68,101 L1 repeats in mouse; 96,654 and 39,804 LTRs in human; and 101,001 and 55,689 LTRs in mouse. No such bias was detected in SINEs, or in L2, CR1, and DNA repeats. This shows a tendency toward insertion or fixation of all TEs into intronic sequences.
Did all transposed elements families undergo exonization, and do they all have the same exonization level?
TEs present in EST/cDNA were separated into those that were entered within annotated genes (according to the knownGene list in UCSC; see Materials and methods, below) and those that were not mapped to known genes. These were considered non-protein-coding genes (see Materials and methods, below).
We then examined exonization of TEs, that is, an internal exon in which a TE is either as part of or as the entire exon sequence. All TE families in both human and mouse can undergo exonization (Tables and , respectively; the two right-most columns). We found a much higher level of TE exonization in the human transcriptome than in the mouse transcriptome. We calculated the exonization level (LE) as the percentage of TEs that exonized within the number of intronic TEs (also see Materials and methods, below). In humans, 0.12% of the TEs exonized within protein coding genes (1,824 TE exonizations out of 1,452,916 TEs in introns) and 0.18% of the TEs exonized within non-protein-coding genes (1,653 out of 896,471). In contrast, we found a 0.06% rate of exonization within protein coding genes (506 out of 888,768) and 0.08% (722 out of 942,164) in non-protein-coding genes in the mouse transcriptome. The higher level of exonization in human compared with that in mouse is significant even after normalization of the relative EST/cDNA coverage (7.9 million transcripts in human versus 4.7 million transcripts in mouse - a ratio of 1.7). That is, even if we multiply the exonization of mouse by 1.7, there is still significantly higher exonization in the human genome (χ2 Fisher's exact test; P < 10-29 [degrees of freedom = 1] for protein-coding genes and P < 10-19 [degrees of freedom = 1] for non-protein-coding genes, for a multiplication by 1.7 of the exonization level within the mouse genome).
When the dataset was further reduced to exons in which there were at least two ESTs/cDNAs, confirming their exonization, we also observed a higher exonization level within human genome: 0.05% exonization in human both in coding and non-protein-coding genes, versus 0.03% and 0.02% in mouse coding and non-protein-coding genes, respectively (χ
2;
P < 10
-16 [degrees of freedom = 1] for protein-coding genes and
P < 10
-22 [degrees of freedom = 1] for non-protein-coding genes; see Additional data file 1). The importance of long non-protein-coding RNA was recently demonstrated in human transcripts [
29]. We therefore present an example of an exonization within a non-protein-coding gene (Additional data file 5). The fact that more than 50% of our data are supported by only one item of EST/cDNA evidence raises questions regarding the fidelity of the spliceosome (see Discussion, below).
Several TE families are located in the human and mouse genome, including MIR, L1, L2, CR1 (L3), LTR, and DNA repeats; thus, we can expect there to be a substantial amount of orthologous TE exons (exonization of the same TE in the human-mouse ortholog gene) in these families. However, only six TE exons were found to be orthologous, of which four are exonizations of MIR elements and two are exonizations of DNA repeats. It is doubtful that these are two independent insertion events because MIR and DNA repeats were active in common ancestors of all mammals, and because independent insertion into precisely the same locus is very rare. We therefore suggest that these MIR and DNA repeats were inserted into a common mammalian ancestor. These exons could either result from independent exaptation in the separated lineages or occur as a result of one exaptation event in the human-mouse common ancestor.
Do all TEs have the same exonization potential? That is, do all intronic TEs exhibit the same probability for acquiring mutations that subsequently lead the splicing machinery to select them as internal exons? Our analysis reveals that the majority of TE families exhibit similar exonization capabilities, at around 0.07% in both human and mouse (meaning 0.07% of the intronic TEs exonized). Statistical analysis indicated that there was no difference in the level of exonization of MIR, L1, L2, and CR1 and DNA within the human genome (χ2 = 5.25; P = 0.26 [degrees of freedom = 4]), although LTR exonization in human was higher, compared with that of other SINEs, LINEs, and DNA repeats, but still substantially lower than Alu. Also, there were also no differences in exonization level between B1, B2, B4, ID, MIR, L1, L2, and CR1 within the mouse genome (χ2 = 10; P = 0.18 [degrees of freedom = 7]), and LTR and DNA exhibited a slightly lower level of exonization in mouse. An exceptional case was the Alu exonization level, which was almost three times higher than that of all other TE families, with more than 0.2% of its intronic copies being exonized (all χ2 test values are listed in Additional data file 2). In addition, no differences were found in exonization level between the human and mouse MIR element, L2, and CR1. Interestingly, L1 exonization levels were higher in human than in mouse, and there was also a higher exonization level of LTR and DNA repeats in human compared with mouse. However, the L1 populations were different between human and mouse genomes (Additional data file 7), and the LTR and DNA populations were very heterogeneous. The LTR of the mouse was very abundant with the younger retroviral class II (ERVK), in which almost no exonization was detected.
In summary, these findings indicate that the Alu sequence is a better substrate for the exonization process, as compared with all other TE families. The higher level of exonization for Alu could be due to many 'unproductive' Alu exonizations, which were 'weeded out' in older exonizations. However, our comparison of TE families that were inserted into the genome at around the same time as Alu (L1 in human and B1, B2, and B4 in mouse) and which exhibited a much lower level of exonization than that of Alu probably indicates that Alu is a much better sequence for the exonization process than the others.
Do transposed element exonizations have tissue specificity and cancer characteristics?
To examine TE exons that may be spliced differently among tissues, we used a bioinformatics analysis approach developed previously to identify tissue-specific exons [
30]. We found 74 exons in human and 18 exons in mouse that putatively undergo tissue-specific splicing. In human, 41 exons belong to
Alu, seven are MIR exons, seven are L1 exons, two are L2 exons, one is a CR1 exon, ten are LTR exons, and seven are DNA exons. In mouse, five are B1 exons, four are MIR exons, one is a B4 exon, one is an L1 exon, one is an L2 exon, and six LTR exons. (All of these exons are listed in Additional data file 13; the SINE, LINE, LTR, and DNA exons with tissue specificity score above 95 are listed in Additional data file 10 (parts B and C).
A bioinformatics approach to identifying exons that changed their splicing regulation in cancer is described by Xu and Lee [
31]. We used this approach to analyze our data. We identified 36 such exons in human and 10 in mouse (listed in Additional data file 13). We further filtered our data to search for exons that were intronic within normal tissues and recognized as exons only within cancerous tissues and hence can serve as a potential marker for cancer diagnostics. Six such exons were found in six different genes (
ACAD9,
YY1AP,
KUB3,
AMPK,
NEL-like 1 and
active BCR-related gene) and all of them were primate-specific
Alu exons (Additional data file 10 [part A]). All exons were found within the coding sequence (CDS): in the
YY1AP,
NEL-like1 and
active BCR-related gene they introduce a stop codon, whereas in
ACAD9 and
KUB3 they cause frame shifts. It was only the
Alu exon in
AMPK that did not have a deleterious effect on the protein (it did not introduce a stop codon or cause a frame shift) and was not found to introduce a known protein domain. Except for the exonization within the
NEL-like-1 gene in which the isoform skipping the
Alu exon (meaning the ancestral isoform) could not be detected within cancerous tissues, in all other genes the ancestral isoform was present within the cancerous tissue as well, probably only leading to reduction in the ancestral isoform concentrations. In one of these genes, namely
ACAD9, we experimentally observed exonization in two ovarian cancer cell lines, but not in mRNA extracted from seven nonovarian cell lines (Additional data file 12).
Can we detect exonized transposed elements that are not alternatively spliced?
The 1,824 human and 506 mouse TE exons can affect the transcriptomes in many different ways. In our data, 94% of the exonizations in human and 88% of the exonizations in mouse generated an internal cassette exon (Figure ; as was also reported elsewhere [
3-
5]). In the rest of the cases, the exonization formed alternative 5' splice sites (5'ss), alternative 3' splice sites (3'ss), or constitutively spliced exons. The numbers of the different splice forms of the TE exons in human and mouse are shown in Figure . In the majority of cases, the alternative 5'ss or 3'ss is generated when an exon is alternatively elongated as a result of an alternative 5'ss or 3'ss selection within the TE (Figure and , respectively). Also, in 3.1% and 5.7% of the human and mouse TE exonizations, respectively, the exons are detected
in silico as constitutively spliced. In most of these cases (71%) the constitutively spliced exons were found in the untranslated region (UTR), and in 12.2% of the cases the constitutively spliced exon entered within the CDS and is 'divisible by 3' (preserve the reading frame, also termed symmetrical). In the rest of the cases, when the exonization is within the CDS and is not 'divisible by 3', the gene encodes a hypothetical protein.
Exon 2 of the DMWD gene originated from exonization of a MIR element. This exon is highly conserved within the mammalian class. Figure show the alignments of the exon among human, chimpanzee, rhesus, mouse, rat, dog, and cow ortholog. The divergence of that exon, relative to the consensus MIR sequence, is high (about 25%). However, following exonization the exon is highly conserved among the species. This implies that once the exon has undergone exaptation and acquired a function, a purifying selection prevents accumulation of mutations. The high level of protein conservation (Figure ) suggests that exaptation occurred before the human, mouse, rat, dog, and cow split.
From the four MIR orthologous exons, two were selected for experimental validation. One was selected to show the conserved alternative splicing pattern between human and mouse, and the other to show the conserved constitutively spliced pattern between human and mouse. The Alu was chosen randomly from all constitutively spliced Alu exons found in our analysis. Figure shows the validation of the splicing pattern of three exons. The first exon originating from MIR is conserved between human and mouse, and is alternatively spliced in both species (exon 2 of DMWD gene; Figure , lanes 1 and 2); the second also originates from MIR, and is conserved between human and mouse, but it is constitutively spliced (exon 5 of MYT1L gene; Figure , lanes 3 and 4); and the third one is an Alu exon, which is constitutively spliced (exon 3 of FAM55C gene; Figure , lane 5). This reverse transcription polymerase chain reaction (RT-PCR) analysis confirms that, under the above conditions and within the examined tissues, we can detect only one isoform that contains the exonization. This observation cannot exclude the possibility that this exon is alternatively spliced within other tissues or under different conditions.
Transposed element insertion into last and first exons of the untranslated region
Furthermore, our analysis shows that the influence of TEs on the transcriptome is not limited to the creation of new internal exons from intronic TEs (exonization); TEs can also modify the mRNA, by being inserted within the first or last exon of a gene. The insertion causes an elongation of the first/last exons that are usually part of the UTR or an activation of an alternative intron (termed intronization; Figure , respectively). The analysis of the number of TE insertions within the first or last exon in human and mouse was done on UCSC annotated genes, in which a consensus mRNA sequence exists. We searched for TE insertions within the first and last exon of 19,480 human and 16,776 mouse genes that are listed as known genes in the UCSC genome browser. In human annotated genes, the average length of the first and last exon is 464.6 base pairs) and 1,300 bp, respectively. In contrast, in mouse genes the first exon has an average length of 392.7 bp and the last exon an average length of 1,189 bp. Our analysis revealed that 3,686 TEs were inserted within the first and 10,541 TEs within the last exon of the human transcriptome. In the mouse transcriptome, 1,932 and 7,847 TEs were inserted into the first and last exons, respectively (Figure ). On average, the human transcriptome is significantly enriched with TEs: 3.5% and 7.6% of the first and last exons in human coding genes contain TE insertions, as compared with 0.4% and 1.7% of first and last exons in mouse coding genes that contain TE insertions (Mann-Whitney; first exon P = 0 and last exon P = 0). One-third of all TE insertions within the human first and last exons belong to Alu (35.3%), although Alu elements comprise only 27.9% of TEs within the human genome (χ2; P < 10-9 [degrees of freedom = 1]). When normalizing for the differences in length of the first and last exons, there is no bias for TE insertion within either the first or the last exon of the gene.
Alu element insertion generates new introns
We found four cases in which the insertion of the Alu element into the last exon of the gene was involved in the activation of an alternative intron (called intron retention) within the 3'-UTR of the gene (primate-specific intron gain events). Here, new splice sites were introduced within the last exon of the gene. These events occurred within the SS18L1, PDZD7, C14orf111, and CWF19L1 genes (illustrated in Figure ).
In the
SS18L1 gene, in which the
Alu was inserted in the sense orientation, three mutations within the
Alu sequence activated a 5'ss, whereas the 3'ss and the polypyrimidine tract (PPT) was contributed from the conserved area of the exon. In the
CWF19L1 gene, the last exon is conserved within the mammalian class. Two
Alus were inserted into that exon, one in the sense orientation and the other in the antisense orientation, and the 5'ss and 3'ss were contributed by antisense
AluJo and by the sense
AluSx, respectively (shown in Figure ). Examination of the splicing pattern of this exon in human and mouse by RT-PCR revealed that the exon is constitutively spliced in mouse (Figure , lane 3). However, in human, the same analysis on kidney normal tissue detected two RNA products: intron retention isoform (upper PCR products; Figure , lanes 1 and 2) and spliced product using 3' and 5' spliced sites within the
Alus (Figure , lane 1, lower RCR product). See Figure for a graphical illustration of these splice sites and Figure for their location along the exonic sequence. The spliced intron is flanked by a canonical 5'ss of the 'GC' type and a noncanonical 3'ss of 'tg' instead of 'ag' (see Figure ). The identity of these splice sites was confirmed by sequencing and was supported by 12 cDNA/EST as well, indicating that the same noncanonical splice site is used in all cases (for the list of these cDNA/ESTs, see Additional data file 8). We currently cannot explain how the splicing machinery selects a noncanonical splice site, although it was shown previously that a 'tg' spliced site can serve as a functional 3'ss [
32,
33]. Additionally, it may also be related to RNA editing, because of formation of dsRNA between the sense and antisense
Alu (see, for example, the report by Lev-Maor and coworkers [
16]). This hypothesis is supported by detection of potential deviation between the genomic sequence and some of the cDNA in the flanking exonic sequences. However, further analysis is needed to understand this phenomenon fully.
With regard to the last two genes exhibiting intronization, the C14orf111 and PDZD genes, the last exon is not conserved within mammals. In the C14orf111 gene the last exon comprises L1, three Alu elements, and an LTR insertion. The intron retention is spliced by a 3'ss and a 5'ss that are found within the Alu sequences (Genebank accession BC08600 and BX248271 confirm the splicing of the intron, and BX647810 confirm the unspliced intron). In the PDZD gene there were two Alu insertions. Both the 3'ss and the 5'ss are found within the Alu sequence (Genebank accession BC029054 confirm the splicing of the intron and AK026862 confirm the unspliced intron). All of these cases are within the last exon of the gene, within the 3'-UTR. The intronizations generate an alternative intron, that is, both the Alu insertion and spliced forms are present in the mRNA.
Short interspersed nuclear elements tend to exonize in the antisense orientation
Our dataset shows that
Alu and MIR have a statistically significant bias toward exonization in their antisense orientation, relative to the direction of the mRNA in the human transcriptome. Additionally, B1, MIR, B2, and B4 are biased toward the antisense exonization in the mouse transcriptome (see Tables and , columns 2 and 3). We correlate this phenomenon with the fact that, in most cases, SINE elements contain a polyA tail at the end of their sequence. In the antisense direction, this polyA becomes a polypyrimidine tract that facilitates exonization [
4,
5]. LINEs and DNA repeats in both human and mouse do not exhibit a preferential exonization orientation (the greater number of L1 exonizations in the antisense is caused by its biased insertion in the antisense direction within introns, and not because of a preferential exonization in the antisense orientation). LTRs exhibit a biased exonization in their sense orientation in both human and mouse (for χ
2 test
P value, see Additional data file 3).
| Table 3Architecture of the newly recruited exons in the human genome |
| Table 4Architecture of the newly recruited exons in the mouse genome |
Alu, L1, and long terminal repeat have the highest capability to contribute a whole exon
An exonization can occur if the TE contributes only a 5'ss or 3'ss to the exon or by using both intrinsic 5'ss and 3'ss within the TE (entire exon). We divided our TE exon dataset into three groups: those that contributed a whole exon and those that contributed only a 5'ss or only a 3'ss (Tables and , columns 4 to 6, respectively). In 66% of exonized
Alu and LTR and 68% of exonized L1 elements in the human transcriptome, the whole exon is contributed by the TE. In the mouse transcriptome, 75% of exonized L1 and 67% of exonized LTR are entire exons. In contrast, all other TE exonizations contribute a complete exon in approximately 40% of the cases, rates that are significantly lower than those for
Alu, L1, and LTR (χ
2;
P < 10
-3 [degrees of freedom = 6] for human and
P = 0.05 [degrees of freedom = 5] for mouse). The reason for the high level of
Alu exonization is the low number of mutations needed to activate potent splice sites [
4,
5], as well as the presence of enhancers and silencers that were previously reported to reside within the
Alu consensus sequence [
34]. This observation suggests that
Alu, L1, and LTR TEs have greater potential to be recognized by the spliceosome machinery, and probably many copies of these TEs serve as 'pseudo-exons' (intronic
Alu sequences containing putative 5'ss and polypyrimidine tract-3'ss that are one mutation away from exonization) within introns of protein coding genes [
4,
5].
Do transposed element exonizations enter with the same probability in all parts of the mRNA?
A new exon resulting from TE exonization can reside either within the CDS or the UTR. When inserted within the UTR, the exon can introduce an alternative start-of-coding sequence or it can enlarge the UTR. The different number of exonizations within the mRNA for different TEs is summarized in Tables and (columns 2 to 4) for human and mouse data, respectively. More than 32% of all exonized TEs in both human and mouse are inserted within the UTR regions. To check whether exonization has a bias toward insertion in the UTR or the CDS, we estimated the fraction of the UTR and CDS within human and mouse genes, based on the annotations of 19,480 and 16,776 human and mouse genes, respectively (see Materials and methods, below). In human, the average gene length is 59,186 nucleotides, in which 79% and 21% are CDS and UTR sequences. In mouse, the average gene length is 49,101, in which 73% and 27% are CDS and UTR sequences. Our results revealed a statistically significant bias for exonization of new TE exons in the UTR, as compared with CDS regions, for Alu, MIR, L1, and L2 in human and for B1, MIR, B2, B4, L1, and L2 in mouse (for χ2 test P values, see Additional data file 4). This UTR bias is probably related to selection against exonization in the CDS.
| Table 5The effects of exonization on human protein coding regions |
| Table 6The effects of exonization on mouse protein coding regions |
How many transposed element exons potentially contribute to proteome diversity?
The majority of exonizations in our dataset inserted an in-frame stop codon within the CDS (in 61% to 84% of the cases); in 9% to 24% of the cases they caused a frame shift in the reading frame. Therefore, between 81% and 93% of the exons were potentially harmful because they produced a truncated protein. Only a small fraction of between 7% and 19% did not possess an in-frame stop codon and did not generate a frame shift, and thus potentially contributed a new function to an existing protein (Tables and , columns 5 to 7). When these exons were searched against PROSITE [
35,
36], 54 out of the 93
Alu exons, three out of seven MIR exons, six out of 16 L1 exons, five out of 8 L2 exons, and none of 17 LTR and DNA exons were found to add a new protein domain (Additional data file 9). Overall, 68 exons out of 141 (48%) exhibited a hit against a domain in PROSITE [
36], reducing the number of domain-contributing TE exonizations to 4.3% (in mouse, only one hit against a domain in PROSITE was found). Thus, our results show that a small fraction of relatively young exonized TEs has the potential to contribute to protein functionality. However, we cannot rule out the possibility that the TE exons that do not add a new protein domain also contribute to proteome complexity by inserting into an existing protein domain. Such is the case for exon 8, which is an
Alu exonization within ADAR2; the
Alu exonization that was inserted into the deaminase domain creates a twofold difference in this gene's specific editing activity [
12].
Do new exons resulting from transposed element exonizations differ in their characteristics from conserved alternatively spliced cassette exons?
We next examined the characteristics of these new exons resulting from TE exonization. Conserved alternatively spliced exons are under selective forces different from those in constitutively spliced exons [
11,
37]. These exons contain weaker 5'ss (ΔG), are shorter than constitutively spliced exons, and have a high inclusion level with respect to the new
Alu exons [
3]. Therefore, we examined these characteristics among the different exons that originated from TEs and compared the findings with those for 596 and 44,732 alternatively skipped exons and constitutive exons conserved between human and mouse, respectively.
The TE exons have a low inclusion level, with an average of 19.17 ± 26.2% in human and 26.51 ± 31.8% in mouse, the inclusion level of human TEs being significantly lower (Mann-Whitney;
P < 10
-6; Figure ). Both values are significantly lower than the 64.39 ± 31.1% of conserved alternatively spliced exons (Mann-Whitney;
P = 0 and
P < 10
-66, respectively). The TE exons are, on average, 143.4 ± 118.4 bp long in human and 133.6 ± 75.1 bp in mouse (Figure ). They are therefore significantly longer than conserved alternative exons, in which the average length is 97.7 ± 56.7 nucleotides (Mann-Whitney;
P < 10
-38 and
P = 0, respectively), and similar to conserved constitutive exons in which the average is 132.4 ± 49.9 nucleotides (Mann-Whitney;
P = 0.7 and
P = 0.8, respectively). In addition, the TE exons have a very weak 5'ss, relative to alternatively spliced exons in which the average U1/5'ss strength (ΔG) is -4.87 ± 2.26 kcal/mol for human exons and -4.88 ± 2.26 kcal/mol for mouse (Figure ). This in turn is significantly weaker than the conserved alternative exons, whose ΔG is -5.62 ± 1.9 kcal/mol (Mann-Whitney;
P < 10
-9 and
P < 10
-6, respectively). Conserved alternatively spliced exons have already been shown to have a significantly weaker U1/5'ss strength than constitutively spliced exons [
11]. In humans, the TE exons that originated from
Alu have the lowest inclusion level, the weakest 5'ss and the shortest exons, as compared with all other exonized TEs. In addition, exonized
Alus have low divergence from the consensus sequence, meaning that not many mutations are needed for their exonization. In contrast, the MIR exons have the strongest 5'ss in both human and mouse, the highest inclusion level, and these exons are also the most diverged exons among SINEs (with respect to the consensus sequence) in both human and mouse (Figure ). The high inclusion level could be explained by the fact that the MIR element has one major 5'ss (Figure ), that contains an almost canonical 5'ss sequence (CTA/gtaagt). This is also consistent with the finding that the MIR exons contribute the highest level of constitutive exonization in both human and mouse (Figure ) and have a relatively high inclusion level (Figure ).