|Home | About | Journals | Submit | Contact Us | Français|
Transposable elements (TEs) are major sources of new exons in higher eukaryotes. Almost half of the human genome is derived from TEs, and many types of TEs have the potential to exonize. In this work, we conducted a large-scale analysis of human exons derived from mammalian-wide interspersed repeats (MIRs), a class of old TEs which was active prior to the radiation of placental mammals. Using exon array data of 328 MIR-derived exons and RT–PCR analysis of 39 exons in 10 tissues, we identified 15 constitutively spliced MIR exons, and 15 MIR exons with tissue-specific shift in splicing patterns. Analysis of RNAs from multiple species suggests that the splicing events of many strongly included MIR exons have been established before the divergence of primates and rodents, while a small percentage result from recent exonization during primate evolution. Interestingly, exon array data suggest substantially higher splicing activities of MIR exons when compared with exons derived from Alu elements, a class of primate-specific retrotransposons. This appears to be a universal difference between exons derived from young and old TEs, as it is also observed when comparing Alu exons to exons derived from LINE1 and LINE2, two other groups of old TEs. Together, this study significantly expands current knowledge about exonization of TEs. Our data imply that with sufficient evolutionary time, numerous new exons could evolve beyond the evolutionary intermediate state and contribute functional novelties to modern mammalian genomes.
One of the most intriguing questions in evolutionary biology is the origin of evolutionary novelties. Evolution can create new gene functions via different processes, many of which have been investigated in detail [e.g. the creation of new genes (1), gene duplication (2–4), divergence of protein-coding sequences (5) and evolution of transcriptional regulation (6,7)]. In recent years, comparative analyses of exon–intron structures of orthologous genes from multiple species have revealed frequent creation of new exons during the evolution of higher eukaryotes [reviewed by (8,9)]. A variety of molecular mechanisms, such as exaptation of transposable elements (TE; 10,11), exon duplication (12,13) and de novo exonization from intronic regions (14), can add new exons to evolutionarily ancient genes. Lee and colleagues (14) analyzed the multiple alignment of 17 vertebrate genomes and identified thousands of human exons that were created during primate evolution. Other studies also reported high incidences of exon creation events in primate and rodent genomes (10,15). The vast majority of newly created exons are alternatively spliced with low transcript inclusion levels in expressed sequence tag (EST) sequences, suggesting that they are non-functional evolutionary intermediates (8,9). Nevertheless, certain new exons may have acquired functions during evolution. A well-known example is the new exon (exon 8) of human gene ADAR2 (adenosine deaminase, RNA-specific, B1). This exon was derived from a primate-specific Alu retrotransposon. It inserts a 40-amino acid peptide segment into the catalytic domain of ADAR2, altering the enzymatic activity of the protein product (16). However, despite a growing list of anecdotal reports for regulatory and functional roles of new exons (9), genome-wide analyses of new exons were mostly based on EST data (10,11,14,15,17) and few identified exonization events have been subjected to detailed experimental characterization. Thus, the evolution and impact of new exons remain poorly understood.
TEs are major sources of new exons in higher eukaryotes (9). Almost half of the human genome is derived from TEs (18), and many types of TEs have the potential to exonize (11). For example, exonization of Alu elements has been studied extensively (19–21). Alu is a primate-specific TE that belongs to the SINE (Short Interspersed Nuclear Element) family (22). It was created from the fusion of two 7SL RNA Alu monomers approximately 60 million years ago. It is the most abundant class of TEs in the human genome, with over one million copies occupying approximately 10% of the genomic DNA (18). As Alu contains several sites that resemble consensus splice site signals, intronic Alus have the potential to be recruited into transcripts of their host genes as new exons (19). Prior EST-based studies indicate that all Alu exons are alternatively spliced, and most of them have low transcript inclusion levels (20). We recently performed a large-scale analysis of Alu exons, using high-density exon array data of 330 exons and detailed RT–PCR analysis of 38 exons in 10 human tissues (23). Our study revealed surprisingly diverse splicing patterns of Alu exons in human tissues. We identified a limited number of Alu exons that were constitutively spliced in a broad range of tissues, as well as Alu exons that were spliced in a tissue-specific manner, suggesting that certain Alu exons may have acquired functional and regulatory roles during primate evolution. Strikingly, of the 26 Alu exons with high transcript inclusion levels according to RT–PCR data, there was 4-fold enrichment in exons derived from AluJ, the most ancient Alu subfamily in the human genome (23). This observation suggests that evolutionary time may be a major factor in the functional establishment of new exons. In fact, in one gene (p75TNFR) of which an Alu exonization event was characterized in detail over the primate phylogeny, a series of nucleotide changes occurred during a period of approximately 33 million years between the initial Alu insertion event and the emergence of a functional alternative exon (24).
In this work, we analyzed human exons derived from a much older class of TEs, mammalian-wide interspersed repeats (MIRs). MIRs were actively transposed prior to the radiation of placental mammals (approximately 130 million years ago) (25). The human genome has approximately 368 000 copies of MIRs, some of which overlap with exons of protein-coding genes (11,18,26). Previous studies generated contradictory results on the exonization level of MIR elements in the human genome. Krull and colleagues (26,27) conducted phylogenetic analyses of five MIR exons and four Alu exons using DNAs and RNAs from various mammalian and primate species. Their results suggest that the splicing signals of Alu exons acquired in ancestral primate genomes are often lost in descendent primate lineages (27). In contrast, the splicing signals of MIR exons tend to be conserved over a long period of evolutionary time, suggesting stable exonization and potential acquisition of functional properties by MIR exons (26). However, since the studies of Krull et al. focused on a small number of exons, it is unclear whether their observations can be generalized to the entire classes of Alu- and MIR-derived exons. On the other hand, in a genome-wide analysis of exonized TEs in the human genome, Sela and colleagues (11) found that exonization of intronic Alus was far more frequent than exonization of intronic MIRs. One thousand and sixty intronic Alus (0.2% of all intronic Alus) overlap with human exons, when compared with 181 intronic MIRs (0.08% of all intronic MIRs). However, the analysis of Sela et al. was based on exon–intron structures inferred from sequence data (i.e. mRNAs and ESTs). It was difficult to distinguish functional exonization events from exons that were incorporated into transcripts owing to rare errors of the splicing machinery (28,29). To better understand the evolutionary and functional significance of exonized MIR elements, in the present study, we performed a large-scale analysis of MIR-derived exons in primate genomes. Here, we report our results from exon array analysis of 328 MIR exons, RT–PCR analysis of 39 MIR exons in 10 human tissues, and phylogenetic analysis of nine MIR exons in RNAs from primates and rodents.
We collected a list of 328 MIR-derived exons in the human genome, using annotations from the UCSC Genome Browser database (30) and Affymetrix human Exon 1.0 array (see details in Materials and Methods). For the purpose of comparison, we also collected 330 Alu-derived exons [the list of Alu exons analyzed in (23)], as well as 13 103 constitutive exons in the human genome.
We compared the splicing signals of MIR exons, Alu exons and constitutive exons. For each exon, we scored its 5′ and 3′ splice sites using consensus splice site models in MAXENT (31). The average 5′ splice site score of MIR-derived exons was 7.71, lower than constitutive exons (8.49; P = 0.001, Wilcoxon test) but significantly higher than Alu-derived exons (6.78; P = 4.7e−7; Fig. 1A). The same trend was also observed for 3′ splice site. The average 3′ splice site scores of MIR exons, Alu exons and constitutive exons were 7.82, 6.43 and 8.70, respectively (Fig. 1B). This trend is consistent with the observation of Sela et al. (11) on an independent data set of TE-derived exons. We also calculated the density of exonic splicing enhancers (ESEs) on these three classes of exons. ESEs are short RNA sequence motifs that promote exon recognition during splicing (32). Using a set of 238 ESE hexamers from Fairbrother et al. (33), we calculated the average ESE density in different classes of exons as the proportion of total exon length covered by ESEs. The average ESE density was 0.32 in MIR exons, which was similar to constitutive exons (0.34, P = 0.15) and significantly higher than Alu exons (0.16, P < 2.2e−16; Fig. 1C).
To investigate whether the difference in splicing signals between MIR exons and Alu exons is correlated with differences in their splicing profiles in human tissues, we analyzed a public Affymetrix Exon 1.0 array data set of 11 human tissues (breast, cerebellum, heart, kidney, liver, muscle, pancreas, prostate, spleen, testes, thyroid, with three replicates per tissue) (34). The Affymetrix human Exon 1.0 array is a high-density exon array platform designed for genome-wide analysis of pre-mRNA splicing, with approximately four probes per exon for well-annotated and predicted exons in the human genome (35,36). In our previous study of Alu-derived exons, we developed an exon-to-gene correlation metric to identify ‘correlated exons’ whose exon array probe intensities correlated strongly with the estimated expression levels of their host genes across a broad range of tissues [see details in Materials and Methods and (23)]. A strong correlation between exon signals and gene expression levels suggests stable inclusion of the exon in the transcript products. In this study, using the same correlation metric we identified 50 ‘correlated exons’ from a total of 328 MIR exons (15.2%). In contrast, of the 330 Alu exons analyzed in our previous work, 20 were ‘correlated exons’ (6.1%; Fig. 2). The percentage of correlated MIR exons was significantly higher than the percentage of correlated Alu exons (15.2% versus 6.1%, P-value = 1.3e−4; two-sided Fisher exact test). It should be noted that 15.2% is expected to be an underestimate for MIR exons with stable transcript inclusion, as noise in exon array probe intensities could even obscure the exon-to-gene correlation of constitutive exons (37). Nevertheless, as we applied the same analysis to MIR exons and Alu exons, these results indicate that a significantly larger fraction of MIR exons are spliced at substantial levels in human tissues.
Exon array data provide valuable information of an exon's splicing profiles, but cannot reveal its absolute transcript inclusion levels in individual tissues. For example, owing to variations in microarray probe affinity (38,39), we cannot predict whether a correlated exon is constitutively spliced, or alternatively spliced with similar transcript inclusion levels in various tissues. The existence of microarray noise also makes it difficult to identify the tissue-specific splicing patterns of certain correlated exons.
In order to uncover the detailed splicing patterns of correlated MIR exons, we analyzed 39 exons in 10 human tissues by RT–PCR. In selecting exons for RT–PCR analysis, we skipped exons for which RT–PCR primer design was difficult, e.g. MIR exons adjacent to very short terminal exons or MIR exons with complex alternative splicing events at adjacent exons. For each exon tested by RT–PCR, we first compared the observed sizes of PCR products to the expected sizes of all potential exon inclusion and skipping isoforms. In case of any ambiguity, all potential exon inclusion and skipping PCR products were confirmed by sequencing.
Our RT–PCR analysis identified MIR exons with a broad range of splicing patterns in human tissues (see Table 1, Supplementary Material, Fig. S1 for gel pictures of all 39 MIR exons analyzed). Prior to our analysis, two constitutively spliced MIR exons were reported in the literature (11,26). For example, a MIR-derived exon in MYT1L was shown to be constitutively spliced in a human neuronal cell line and mouse brain (11). In our work, we found this exon to be constitutively spliced in all tissues where its host gene was expressed (Fig. 3A). We identified a total of 15 MIR exons as constitutively spliced in all surveyed tissues. These data indicate that numerous MIR exons have acquired strong splicing signals that result in 100% transcript inclusion. It should be noted that in six of the 15 constitutive MIR exons, the MIR exon was present in multiple PCR products (see Supplementary Material, Fig. S1), owing to alternative splicing of flanking exon(s) as revealed by sequencing of PCR products. In addition to these constitutive exons, we identified three alternatively spliced MIR exons that were included in the major isoform products in all tissues (see Fig. 3B for an example of NUMB), and two exons that were included at medium levels in all tissues (see Fig. 3C for an example of TGFBR2).
We also identified 15 MIR exons showing tissue-specific shift in their transcript inclusion levels (see Supplementary Material, Fig. S1). In three genes (TTLL6, WSCD2, IL16), the MIR exons were skipped almost completely in most tissues, but appeared to have modest tissue-specific increase of their transcript inclusion levels in restricted tissues. For example, TTLL6 (tubulin tyrosine ligase-like family, member 6) is an apoptosis-related gene with preferential expression in testis. Previously, analyses of TTLL6 coding sequences from representative non-human primate species and worldwide human populations indicate strong positive selection on TTLL6 during recent primate and human evolution (40). Chen and colleagues (40) suggested that TTLL6 might contribute to the adaptive evolution of human male reproduction. In our RT–PCR analysis, an MIR-derived exon in TTLL6 had a detectable exon inclusion isoform as the minor transcript product in testis (Fig. 3D). Real-time qPCR analysis of this exon using isoform-specific primers indicated an inclusion level of approximately 20% in testis RNA (data not shown). Moreover, the splice sites of this exon were probably created during recent human evolution. Alignment of the human exon to its orthologous regions from non-human primates indicated that the ‘GT’ donor splice site downstream of the human exon was created by a human-specific nucleotide substitution, which converted an ‘AT’ dinucleotide in the common ancestor of human and chimpanzee to the ‘GT’ consensus splice site in the human gene. In the future, it would be interesting to examine the potential functional significance of this human-specific MIR exonization event in TTLL6.
Several exons showed strong switch between exon inclusion and skipping forms across various tissues. An example is the tissue-specific MIR exon in MICAL2 (microtubule-associated monoxygenase, calponin, and LIM domain containing 2), a gene involved in the regulation of cytoskeleton and repulsive neuronal guidance (41). The MIR exon is in the coding region of MICAL2 and encodes an in-frame 21-amino acid peptide within its protein product. RT–PCR analysis identified tissue-specific transcript inclusion levels—the exon was spliced as a constitutive exon in liver and testis, as the major splice form in heart, and as the minor splice form in cerebellum, muscle and thyroid (Fig. 3E). In HORMAD1 (HORMA domain containing 1), the MIR exon had comparable amount of inclusion and skipping forms in cerebellum and spleen, while in testes, the exon inclusion form was the predominant isoform (Supplementary Material, Fig. S1). In CHRNA1 (cholinergic receptor, nicotinic, alpha 1), the MIR exon was spliced at medium levels in muscle, prostate and testis, but had a much higher level of exon-skipping in cerebellum and spleen (Fig. 3F). The skipping isoform of this exon encodes a functional subunit of the acetylcholine receptor (AchR), while the exon inclusion isoform encodes a non-functional protein (42). Thus, the increased skipping of this MIR exon in cerebellum is expected to result in increased ratio of functional to non-functional AchR subunits. Although the physiological consequence of this tissue-specific splicing event is unknown, it could be a regulated process. In fact, a recent study shows that skipping of this MIR exon is mediated by binding of splicing regulator hnRNP H to an intronic UGGG motif upstream of the exon (43). A mutation within this UGGG motif, which significantly enhanced exon inclusion, was identified in a patient with congenital myasthenic syndrome (43). Together, these 15 exons represent the first examples of experimentally validated tissue-specific MIR exons in the human genome. As tissue-specific regulation of alternative splicing is a strong indicator of functional alternative splicing events (44), it is reasonable to speculate that these MIR exons may be involved in the fine tuning of mRNA level or protein function in a tissue-specific manner.
Of 39 correlated MIR exons that we analyzed experimentally, 12 (30.8%) were coding-region exons without introducing premature termination codons (PTCs; Table 1). This percentage is nearly twice as much as the percentage of coding-region Alu exons from our previous study (3/19, 15.8%) (23). Gotea and colleagues (45) hypothesized that exons created from old TEs are more likely to develop coding potentials and contribute to the proteome, while the roles of exons created from young TEs (e.g. Alus) are mostly regulatory [e.g. regulation of steady-state mRNA levels by induction of mRNA nonsense-mediated decay; also see (46)]. Despite the lack of statistical power in our analysis owing to the small number of Alu and MIR exons, the trend observed in our data set is consistent with the hypothesis of Gotea et al.
There is a significant enrichment of MIR exons in 5′-UTR when compared with 3′-UTR. This trend was consistent with the observation by Zhang et al. (10) in a genome-wide analysis of species-specific exons, as well as the observation in our recent study of Alu-derived exons (23). Zhang et al. (10) hypothesized that creations of new internal exons in 3′-UTR may be strongly selected against, because of the high likelihood of inducing mRNA nonsense-mediated decay. It should also be noted that these studies focused on creation of internal spliced exons from TEs. The impact of TEs on 3′ terminal exons was not investigated in our work. In fact, a recent study by Lee et al. (47) suggested a role of TEs in creating new polyadenylation sites at 3′-end of mammalian genes.
Our data on MIR exons and Alu exons suggest that a larger percentage of exonized MIR elements have high transcript inclusion levels. There are two possible explanations for this observation. It could be that the consensus sequence of MIR elements contains strong splicing signals, making MIR elements more prone to exonization. It could also be owing to the fact that MIR is a much older class of TEs, so MIR-derived exons had more evolutionary time to accumulate additional splicing regulatory elements that strengthened their transcript inclusion levels.
To assess the potential contribution from putative splice sites within MIR elements, we scanned the consensus sequences of MIR, AluJo (the oldest Alu subfamily) and AluSx (the most abundant Alu subfamily in the human genome) for existence of splice site signals. The consensus sequences of MIR, AluJo and AluSx were downloaded from Repbase (48). It has been demonstrated that both MIR and Alu tend to exonize from antisense orientation (11,26). For example, of the 39 ‘correlated exons’ analyzed by RT–PCR, 24 (61.5%) were exonized from the antisense strand of MIR. Therefore, we performed sliding window scans of antisense strands of MIR, AluJo and AluSx to identify high-scoring putative 5′ and 3′ splice sites. In each window, we calculated splice site scores using consensus splice site models in MAXENT (31). We found that the antisense consensus sequence of MIR had much stronger putative splice sites when compared with AluJo and AluSx (see Supplementary Material, Fig. S2). The antisense consensus sequence of MIR had two extremely strong 5′ splice sites scored at 9.65 and 8.78, respectively. Likewise, we found a high-scoring 3′ splice site on antisense MIR with a MAXENT score of 9.95. In contrast, the potential splice sites on antisense AluJo and AluSx were much weaker. The strongest putative 5′ and 3′ splice sites were scored at 2.44 and 6.53 on antisense AluJo, and at 4.28 and 5.92 on antisense AluSx. The existence of strong putative splice sites on antisense MIR could make MIR elements more favorable substrates for exonization.
However, most intronic MIRs do not exonize, and the splice sites of many exonized MIRs are not derived from high-scoring putative splice sites within the MIR consensus sequence. Twenty-four of the ‘correlated exons’ analyzed by RT–PCR were from antisense strand of MIRs, among which 15 were spliced as constitutive or major alternative splice form in at least certain tissues. For each of these 15 ‘strongly included’ exons from antisense MIR, we used the alignment of the exon sequence to the antisense consensus MIR to determine the origin of splice sites. In three constitutive exons (ST3GAL1, FXYD4, CSF3R), both 5′ and 3′ splice sites were derived from the high-scoring putative splice sites within antisense MIR. In six exons, either 5′ or 3′ splice site (but not both) was derived from the high-scoring putative splice sites. In the remaining six exons, including three constitutive exons (TYK2, MTX1, MTA2), none of the splice sites was derived from the high-scoring putative splice sites. Taken together, of the 30 splice sites in 15 ‘strongly included’ exons derived from antisense MIR, 12 (40%) were from high-scoring putative splice sites within the consensus MIR. These data imply that the strong putative splice site signals within consensus MIR are not the sole contributor to the strong splicing activities of exonized MIR elements. It is likely that the older age of MIR also plays an important role, as it provides more opportunities for newly exonized MIRs to strengthen splicing signals and/or acquire valid open reading frames through a series of subsequent evolutionary changes.
To investigate whether exons derived from other old TEs also have stronger splicing activities when compared with Alu exons, we analyzed exon array data of human exons derived from LINE1 and LINE2 elements (see Materials and Methods). Of 271 LINE1-derived exons, 38 (14.0%) were correlated with gene expression levels. Similarly, of 175 LINE2-derived exons, 25 (14.3%) were correlated with gene expression levels. These percentages were similar to the percentage of ‘correlated exons’ among MIR-derived exons (15.2%) but significantly higher than the percentage among Alu-derived exons (6.1%; Fig. 2).
Based on the RT–PCR results in 10 tissues, we divided the 39 experimentally analyzed MIR exons into two distinct sub-categories. The first category consisted of 26 ‘strongly included’ exons that were used as constitutive or major alternative splice forms in at least certain tissues. The second category consisted of 13 ‘weakly included’ exons that were always used as medium or minor alternative splice forms or had no detectable exon inclusion isoform in any tissue.
We found that the splice sites of ‘strongly included’ MIR exons were much more conserved across species than the splice sites of ‘weakly included’ MIR exons. Of the 52 splice sites in 26 ‘strongly included’ exons, 32 were conserved between human and mouse genomes according to the human–mouse pairwise genome alignment. In contrast, of the 26 splice sites in 13 ‘weakly included’ exons, only eight were conserved between human and mouse, a statistically significant difference (32/52 versus 8/26; P = 0.016, two-sided Fisher exact test). Moreover, the splice sites of these ‘strongly included’ exons were much stronger. The average 5′ splice site score of the 26 ‘strongly included’ exons was 9.16, compared with 6.55 for the 13 ‘weakly included’ exons (P = 0.002, two-sided Wilcoxon test). Similarly, the average 3′ splice site scores of ‘strongly included’ exons and ‘weakly included’ exons were 8.58 and 5.60, respectively, although the difference was not statistically significant (P = 0.12).
To directly assess the evolutionary conservation of MIR exon splicing, we selected nine MIR exons (Table 2) and analyzed the splicing patterns of their orthologous regions in fibroblast cell lines of three non-human primates (chimpanzee, rhesus macaque and marmoset) and mouse kidney. To ensure that the fibroblast cell line is representative of other tissues, in this analysis we focused on MIR exons with consistent splicing patterns across human tissues. Also, to simplify the interpretation of PCR products from non-human primates, we restricted our analysis to MIR exons flanked by constitutive exons at both 5′ and 3′ ends in human genes. If RT–PCR of fibroblast cells yielded no PCR products, we would also attempt RT–PCR of kidney RNAs of non-human primates.
In six constitutively spliced MIR exons we analyzed, the splicing pattern of the human exon was highly conserved across species (Table 2). For example, the MIR exons of five genes (SRRM2, CUL2, MTX1, MTA2, TYK2) were constitutively spliced in all species (Supplementary Material, Fig. S3). In IL6ST (interleukin 6 signal transducer), the MIR exon was constitutively spliced in human, chimpanzee, rhesus fibroblasts and mouse kidney (Fig. 4A), suggesting that the exon was constitutively spliced before the divergence of primates and rodents (approximately 90 million years ago). In marmoset, we could not design RT–PCR primers because the sequences of flanking exons were unavailable in the low-coverage marmoset genome assembly. Nevertheless, our RT–PCR analysis using primers designed for human exons detected both exon inclusion and skipping forms in marmoset RNA, with the exon inclusion form being the predominant splice form (Fig. 4A).
The other three exons we analyzed were alternatively spliced, with high or medium transcript inclusion levels in human tissues. These exons showed various patterns of evolutionary conservation of their splicing profiles. The MIR exon of NUMB had conserved alternative splicing profiles, with at least medium inclusion levels in all species we analyzed (Fig. 4B). In TGFBR2, the splicing pattern of the MIR exon was conserved in chimpanzee, marmoset and mouse, while the exon inclusion form was almost completely lost in rhesus macaque (Fig. 4C). In ARNTL, phylogenetic analysis suggested a gradual increase in the inclusion level of the MIR exon during primate evolution—the MIR exon was completely skipped in mice, included at a low level in marmosets and included at a medium level in rhesus macaques, chimpanzees and humans (Fig. 4D).
Taken together, these data suggest that for the majority of ‘strongly included’ MIR exons in the human genome, their splicing profiles in human tissues were present in the common ancestor of primates and rodents. The strong conservation of their splicing patterns in individual primate lineages implies that these exons are functionally important and that their splicing patterns have been subject to negative selection during primate evolution, as suggested by Krull and colleagues (26).
In this manuscript, we describe a large-scale analysis of MIR-derived exons in primate genomes. Our results significantly expand current knowledge about exonization of MIR elements. Prior to our study two constitutively spliced MIR exons were reported in the literature (11,26). In this work, we discovered and confirmed constitutive splicing of 15 MIR exons in a broad range of human tissues. We further analyzed a subset of these constitutive MIR exons in other species, and found that their constitutive splicing events were highly conserved between primates and rodents (Table 2). We also identified 15 MIR exons with tissue-specific shift in splicing patterns. The diverse splicing profiles of exonized MIR elements suggest that these exons can influence the regulation of gene expression or protein function through many ways—either constitutively or in a tissue-specific manner. Together, these experimental results provide many candidates for detailed functional studies of TE-derived exons.
We expect that the complete set of MIR exons with functional and/or regulatory roles is considerably larger than what we have identified in this study. Some strongly included MIR exons may be missed by our analysis owing to noise and artifacts in their exon array probe signals. It is also possible that some functional MIR exons are preferentially spliced in tissues or developmental states that are not investigated in this work.
Our phylogenetic analysis of MIR exons in DNA and RNA of primate and rodent species indicates that exonization can occur at any evolutionary time point. For the majority of constitutively spliced MIR exons, the exonization processes were likely completed prior to the divergence of primates and rodents (Table 2). In ARNTL (Fig. 4D), the MIR element was exonized during primate evolution. In TTLL6, the splice sites of a testis-specific MIR exon were not present in any non-human primates, suggesting that its splicing signal was acquired very recently during human evolution. It is possible that such newly created MIR exons are still evolving towards specified functions.
The comparison between MIR exons and Alu exons raises an interesting question of why exonized MIR elements tend to have stronger splicing signals and higher transcript inclusion levels. While several high-scoring putative splice sites are present in the consensus sequence of MIR, they cannot fully account for the difference in splicing levels of MIR and Alu exons, as only a minor fraction of splice sites of ‘strongly included’ MIR exons are derived from high-scoring putative splice sites within the consensus MIR. Moreover, splice sites alone are insufficient for proper exon recognition, as other splicing regulatory motifs (such as exonic/intronic splicing enhancers) also play important roles in pre-mRNA splicing (32). A key distinction between MIR and Alu is their evolutionary age. MIR was active prior to the radiation of placental mammals (approximately 130 million years ago), while Alu was created during primate evolution (approximately 60 million years ago). Previous phylogenetic studies of several TE-derived exons revealed that a series of independent evolutionary events were typically required for an intronic TE to evolve into a functional exon (24,27). For example, Singer and colleagues reconstructed the entire evolutionary history of the creation and functional establishment of an Alu-derived exon, which serves as the alternative first exon of tumor-necrosis factor receptor type 2 (p75TNFR) (24). Sequencing and phylogenetic analysis of orthologous DNAs of 13 primates suggests that the Alu element was inserted approximately 40–58 million years ago. Shortly after the Alu insertion, an A-to-G substitution created a start codon. However, a functional coding exon was not created until approximately 25 million years ago, when a C-to-T substitution created a splice donor site and a 7-bp deletion within the exonic region created a valid open reading frame. Considering the sequence changes required for turning an intronic TE into a functional exon, the older evolutionary age of MIR provides more opportunities for such sequence changes to occur, which could eventually lead to stronger splicing activities and higher protein-coding potentials of MIR-derived exons. In fact, exons derived from other classes of old TEs (such as LINE1 and LINE2) also had higher correlation with host genes' expression levels when compared with Alu-derived exons (Fig. 2). This further argues for the importance of evolutionary time in successful exonization of TEs.
Our work has important implications for understanding the evolution of new exons. The initial creation of new exons in functional genes (e.g. via exonization of TE, or de novo exonization from intronic regions) is generally considered detrimental, as new exons could disrupt functional elements within the protein products or cause frame-shifting and/or mRNA nonsense-mediated decay (9,49). A number of studies have shown that the majority of newly created exons in mammalian genomes are alternatively spliced, and most have low transcript inclusion levels according to EST and microarray data (10,15,50,51). Based on these observations, Modrek and Lee proposed that alternative splicing plays a major role in facilitating the creation and establishment of new exons. It allows functional genes to test new spliced isoforms (often produced at low levels), while keeping the majority of the original gene products intact (51). This evolutionary model describes the initial step of exon creation (i.e. relaxation of negative selection against exon creation via alternative splicing of the new exon). However, the subsequent evolution and eventual fate of new exons are far less understood. It is likely that most new exons are evolutionary intermediates dispensable for the organism, so the weak splicing signals of new exons can be lost again. On the other hand, over time a small percentage of new exons may acquire additional mutations that lead to stronger splicing regulatory signals and acquisition of functional properties. Our large-scale analysis of MIR exons in human tissues suggests that with sufficient evolutionary time, numerous new exons could evolve beyond the evolutionary intermediate state to acquire functional or regulatory roles. The phylogenetic conservation in the splice site signal and RNA splicing pattern of ‘strongly included’ MIR exons indicate strong functional constraints on these exons during recent evolution. These functional new exons could experience distinct modes of selection pressure over their entire evolutionary history—from weak negative selection or even positive selection after the initial creation of the new exon, to strong negative selection against loss of its splicing pattern after successful establishment of functional novelties.
We downloaded a public Affymetrix Exon 1.0 array data set on 11 human tissues (breast, cerebellum, heart, kidney, liver, muscle, pancreas, prostate, spleen, testes, thyroid) (34), with three replicates per tissue (http://www.affymetrix.com/support/technical/sample_data/exon_array_data.affx).
We compiled a list of Exon array probesets targeting exonized MIR elements. The locations of MIR elements in the human genome were downloaded from RepeatMasker annotation of the UCSC Genome Browser database (30). The locations of internal exons (i.e. exons flanked by both 5′ and 3′ exons) in human genes were taken from the UCSC KnownGenes database (30). This database combines transcript annotations from multiple sequence databases (30). To eliminate long exonic regions likely resulting from intron retention events, we removed probesets whose probe selection regions were over 250 bp as in Lee et al. (52). We then defined an exon as MIR-derived if the MIR element covered at least 25 bp of the exon and over 50% of the total length of the exon. We collected 363 exon array probesets targeting such MIR-derived exons. Since microarray probes targeting MIR repeats may cross-hybridize to off-target transcripts, we used a conservative approach to identify and remove individual probes showing abnormal intensities (see ‘Analysis of Exon array data’ below). After probe filtering, we collected a final list of 328 exon array probesets, with at least three reliable probes in each probeset to infer the splicing profiles of MIR-derived exons.
We used the same procedure to collect 330 Alu-derived exons, 271 LINE1-derived exons and 175 LINE2-derived exons. We also collected 13 103 high-confidence constitutive exons in the human genome from Lin et al. (23).
For each exon, we scored its 5′ and 3′ splice sites using consensus splice site models in MAXENT (31). For 5′ splice site, we analyzed three nucleotides in exons and six nucleotides in introns. For 3′ splice sites, we analyzed three nucleotides in exons and 20 nucleotides in introns. We also calculated the density of ESEs using 238 ESEs from Fairbrother et al. (33). For each exon, the ESE density was calculated as the number of nucleotides covered by ESEs, divided by the total length of the exon.
Briefly, we first predicted the background intensities of individual exon array probes, using a sequence-specific linear model (53,54) trained from ‘genomic’ and ‘anti-genomic’ background probes on the Exon 1.0 array (55). For every probe, the predicted background intensity was an estimate for the amount of non-specific hybridization to the probe. This background intensity was subtracted from the observed probe intensity before downstream analyses (54). Second, for each gene we used a correlation-based iterative probe selection algorithm to construct robust estimates of overall gene expression levels, independent of splicing patterns of individual exons (56). Third, as oligonucleotide probes for TE-derived exons may be more likely to cross-hybridize than typical exon array probes, we used a set of criteria to identify and remove individual probes with abnormal probe intensities, for example probes that cross-hybridize to off-target transcripts. These criteria were described in detail before (23). After probe filtering, we collected a final list of 328 exon array probesets, with at least three reliable probes in each probeset to infer the splicing profiles of MIR-derived exons.
For each exon, we calculated the Pearson correlation co-efficient of individual probes' intensities with the overall gene expression levels in 11 tissues [estimated from all exons of a gene, see (54,56)]. As in (23), we defined a probe to be ‘correlated’ with gene expression levels if the Pearson correlation co-efficient was above 0.6. We defined an exon to be ‘correlated’ if it had at least three probes correlated with gene expression levels.
Total RNA samples from 10 human tissues were purchased from Clontech (Mountain View, CA, USA). RNAs of three tissues (liver, pancreas and spleen) were from single donors. RNAs of all other seven tissues were pooled from multiple donors (cerebellum, 24 donors; heart, 3 donors; kidney, 14 donors; skeletal muscle, 7 donors; prostate, 32 donors; testis, 39 donors; thyroid, 64 donors). Single-pass cDNA was synthesized using High-Capacity cDNA Reverse Transcription Kit (Applied Biosystems, Foster City, CA, USA) according to manufacturer's instructions. For each tested MIR-derived exon, we designed a pair of forward and reverse PCR primers at flanking constitutive exons using PRIMER3 (57). Primer sequences are described in Supplementary Material, Table S1. Two micrograms of total RNA were used for each 20 µl cDNA synthesis reaction. For each candidate MIR exonization event, 1 µl of cDNA were used for the amplification in a 25 µl PCR reaction. PCR reactions were run for 40 cycles in a Bio-Rad thermocycler with an annealing temperature of 62°C. The reaction products were resolved on 2% TAE/agarose gels. Candidate DNA fragments corresponding to exon inclusion and exon skipping forms were cloned for sequencing using Zero Blunt TOPO PCR Cloning Kit (Invitrogen, Carlsbad, CA, USA).
Primary fibroblast cell cultures of chimpanzee, rhesus macaque and marmoset were purchased from Coriell Institute for Medical Research (Camden, NJ, USA). Frozen chimpanzee, rhesus macaque and marmoset kidneys were generously provided by Southwest National Primate Research Center (San Antonio, TX, USA). Human fibroblast cell culture was provided by Steven Moore (University of Iowa, IA, USA). Mouse kidney was collected from one wild-type male C57BL/6.
RNA was prepared using TRIzol (Invitrogen) according to the manufacturer's instructions. Single-pass cDNA was synthesized using the High Capacity cDNA Reverse Transcription Kit (Applied Biosystems). RT–PCR analysis was performed using primers designed from flanking exons of the target exon in chimpanzee, rhesus macaque, marmoset and mouse genomes. In any non-human primate, if the sequences of flanking exons were unavailable in the low-coverage genome assembly, we used RT–PCR primers designed for flanking exons in the human genome. Primer sequences are described in Supplementary Material, Table S2.
We thank David Eichmann, Ben Rogers and the University of Iowa Institute for Clinical and Translational Science (NIH grant UL1 RR024979) for computer support. This study used biological materials obtained from the Southwest National Primate Research Center, which is supported by NIH-NCRR grant P51 RR013986. This work was supported by the National Institutes of Health (1R01HG004634), Roy J. Carver Trust, and a research startup fund from the University of Iowa.
We thank Peter Stoilov, Song Liu, Russ Carstens for discussions and comments on this manuscript. We thank Mary Jo Aivaliotis, Jerilyn Pecotte and Anne Tye for assistance.
Conflict of Interest statement. None declared.