|Home | About | Journals | Submit | Contact Us | Français|
Human internal exons have an average size of 147nt, and most are <300nt. This small size is thought to facilitate exon definition. A small number of large internal exons have been identified and shown to be alternatively spliced. We identified 1115 internal exons >1000nt in the human genome; these were found in 5% of all protein-coding genes, and most were expressed and translated. Surprisingly, 40% of these were expressed at levels similar to the flanking exons, suggesting they were constitutively spliced. While all of the large exons had strong splice sites, the constitutively spliced large exons had a higher ratio of splicing enhancers/silencers and were more conserved across mammals than the alternatively spliced large exons. We asked if large exons contain specific sequences that promote splicing and identified 38 sequences enriched in the large exons relative to small exons. The consensus sequence is C-rich with a central invariant CA dinucleotide. Mutation of these sequences in a candidate large exon indicated that these are important for recognition of large exons by the splicing machinery. We propose that these sequences are large exon splicing enhancers (LESEs).
A major point of regulation during eukaryotic gene expression is the splicing of exons to yield a contiguous transcript that can be translated to a functional protein. The splicing machinery has to efficiently recognize small exons flanked by much larger introns many thousands of times during each cell cycle. In different tissues different exons are included, resulting in alternative splicing that yields a greater diversity of proteins (1). In humans, 90% of genes are alternatively spliced (2,3). Splicing regulates both the levels of gene expression and tissue-specific expression (4), with splicing deregulation often leading to disease (5).
Early studies of gene architecture revealed that most internal exons range from 50 to 200nt, while terminal exons can be much longer (6,7). In humans, the mean size of internal exons is 147nt, and nearly all appear to be <300nt (8). The conserved size of internal exons in many species suggests it may be an important factor in splicing. To test this hypothesis, exons were artificially expanded. Expansion of exon size >300nt diminishes splicing in most but not all cases (9–11).
A necessary step in correct pre-mRNA splicing is the identification of exons and introns by the recognition of splice sites. These recognition events are driven by splice site sequence complementarity to corresponding snRNAs that are part of splicing snRNPs. Although in lower eukaryotes this process is facilitated by the recognition of splice sites across a small intron, it is hypothesized that in higher eukaryotes splicing is facilitated by the recognition of splice sites across an exon (12,10).
Berget and coworkers proposed the exon definition model, which invokes the necessary communication of 3′- and 5′-splice site complexes across an exon for efficient splicing (10,13). This model is based, in part, on observations that a downstream 5′-splice site enhances splicing at the upstream 3′-splice site of the same exon (10,14,15). This model was supported by mutations in either splice site that caused the skipping of the internal exon (16,17). The requirement for communication across an exon and the apparent dearth of large exons in the genome led to the model that splice site recognition is more efficient when exons are small, and exon inclusion decreases with increasing exon size (13). To date, examples of internal exons >500nt have all been found to be alternatively spliced (18–21).
The communication of snRNP complexes across an exon is influenced by multiple sequences within both exons and introns that either promote or prevent the inclusion of a given exon (22–26). Exonic splicing enhancers (ESE) bind serine/arginine rich-proteins (SR proteins) to recruit the splicing machinery to nearby splice sites, thereby leading to inclusion of the exon (27). Similarly, other enhancer sequences present in adjacent introns called intronic splicing enhancers (ISE) and suppressors in exons and introns called ESS and ISS influence the splicing of a given exon by binding proteins that interact with complexes necessary for proper splicing (28). These sequences, along with splice site strength, local RNA structure, nucleosome density and rate of pre-mRNA synthesis, influence exon recognition (12,29–33).
We found a total of 1115 internal exons >1000nt in the human genome, and these were present in 1040 different genes (Refseq database). To investigate whether these large exons were frequently skipped, we analyzed 120 RNA-seq datasets deposited in the public sequence read archive (SRA). Surprisingly, we found that 42% of the large exons were expressed at levels similar to their corresponding upstream and downstream exons, indicating large exon inclusion. However, in general large exons were more frequently alternatively spliced than a random set of internal exons <250nt. Further analysis indicated many of these large exons had strong splice sites and many ESEs. Most of these large exons are evolutionarily conserved in placental mammals, and constitutive large exons are more conserved than alternative large exons. Furthermore, we identified 38 sequences enriched in large exons that promote their recognition and splicing and propose that they are large exon splicing enhancers (LESEs).
All the annotated genes from the human genome (hg19, February 2009) were extracted from the knownGene table in the UCSC genome browser. Exon and intron lengths were calculated using the exon start and exon end coordinates for each gene. Any redundant entries, pseudogenes and single exon genes were removed from the dataset. This yielded a total of 22539 genes with more than one exon. These 22539 genes were used in the following analysis. The entire sequence of each of these large exon-containing genes was extracted from the human genome. These sequences were used to build a Bowtie index, and this index was used to align all the RNA-seq datasets and ribosome footprint datasets using Bowtie (34). All Bowtie alignments were carried out with the following parameters to decrease false alignments, n=0, best and l=30.
In order to analyze large exon expression, 120 datasets were downloaded from the SRA database at NCBI. All of the datasets had been sequenced on one of the Illumina sequencing platforms. Each RNA-seq dataset was aligned to the large exon Bowtie index (34). The Bowtie output was then analyzed using custom scripts to calculate reads per exon of every exon in a large exon-containing gene. The reads per exon were then converted to reads per kilobase per million reads (RPKM) values by dividing the number of reads by the size of the exon in kb.
To investigate the distribution of RNA-seq reads across a large exon, the alignment position output of Bowtie was used to calculate the number of times each nucleotide is represented by an RNA-seq read. This value was then plotted against position in the large exon.
In order to analyze the presence of wild-type and alternate splice junctions, all possible splice junctions sequences were generated using the annotated splice sites. Fifty nucleotides from the end of an upstream exon were fused to 50nt from the beginning of the downstream exon. These sequences were used to create a splice junction Bowtie index, which was then used to align all RNA-seq datasets.
In order to investigate whether a given large exon was being translated, we aligned ribosome footprints from HeLa cells (35) to the large exon Bowtie index. Similar to the RNA-seq analysis, the alignment position output of Bowtie was used to calculate the number of times each nucleotide was translated.
To classify whether a large exon was under expressed relative to upstream or downstream exons, we calculated fold change of large exon expression relative to an upstream or downstream exon, whichever was greater. This fold change was then binned to five different exon inclusion indexes as follows, 0: gene not expressed, 1: 0<large exon<0.25, 2: 0.25<large exon<0.50, 3: 0.50<large exon<0.75, 4: 0.75 < large exon<1.25, 5: large exon >1.25. The average exon inclusion index was computed for all the datasets in which a particular large exon-containing gene was expressed.
Splice sites for all exons were extracted using custom scripts, and splice strengths were calculated using the maximum entropy model of MaxEntScan (36). The maximum entropy score for both 5′- and 3′-splice sites were used for further analysis.
In order to investigate the presence of internal splice sites in a large exon, all 9-mers starting at every nucleotide were generated for 5′-splice site analysis; 23-mers starting at all nucleotides were generated for 3′-splice site analysis. The strength of all of these sequences was calculated using MaxEntScan (36).
RESCUE–ESE sequences were obtained from Fairbrother et al. (24) and ESS sequences were obtained from Wang et al. (37). The number of ESE and ESS sequences were calculated and then scaled by the size of the exon in kb.
Conservation scores generated by phastCons (38) for primates and placental mammals were downloaded from the UCSC genome browser. Conservation scores for each nucleotide for the entire large exon gene were extracted for further analysis. We then extracted the conservation score for 500nt in the upstream intron, the first 500nt in the large exon, the last 500nt in the large exon, and 500nt in the downstream intron for every constitutive and alternative large exon dataset. These data were then averaged for the constitutive and alternative large exons and plotted against nucleotide position.
In order to identify large exon-enriched sequence motifs, we identified hexamers that were enriched in large exons compared with small exons. As in previous studies determining ESEs and ESSs (24), we focused on counting the number of times any given hexamer appeared in a dataset. We calculated the frequency of occurrence for all 4096 hexamers in all large exons, all internal small exons (<300nt) and all introns of multi-exonic genes. We then computed for each hexamer the difference between the frequency of a hexamer in large exons () and small exons (). The mean and standard deviations for this distribution (ΔLS=−) was computed. Large-exon enriched hexamers were determined using false discovery rate statistics (39). All hexamers were ranked based on Z-scores (number of standard deviations from mean). The probability that a given hexamer could be enriched by chance was calculated using inverse cumulative normal distribution. All hexamers that had a false discovery rate of <5% were deemed large exon-enriched hexamers.
In order to identify specific motifs in the large exon-enriched sequences, a dissimilarity index was calculated for all pairs of hexamers, similar to that of Fairbrother et al. (24). The sequences were then clustered using standard average hierarchical clustering as implemented in Python module SciPy v0.11.dev. Sequences for each cluster were aligned using ClustalW (40), and motif logos were rendered using enoLOGOS (41).
Exons 6–8 of the JARID2 gene along with 500nt of flanking internal introns were cloned into the pcDNA3.1(+) vector. For all 18 hexamer sequences in the JARID2 large exon 7, the internal 4nt were mutated to complementary nucleotides, and the mutated exon was synthesized as three gBlock fragments (IDT). The mutant large exon was cloned to replace the wild-type large exon in the mini-gene construct. All the original splice sites were maintained and each construct was verified by sequencing. Four micrograms of the constructs was transfected into HEK293T cells using Lipofectamine 2000 (Invitrogen). Total RNA was isolated 36h post-transfection using RNA-Bee (Tel-Test), and 1μg of total RNA was reverse transcribed using MMLV-RT (Promega). A vector-specific primer was used in this reverse transcription reaction to avoid amplification from endogenous JARID2 sequences. Splicing was analyzed by polymerase chain reaction (PCR), using primers to detect splice junctions between exons 6 and 7, 7 and 8 and 6 and 8 (skipping of exon 7). Ten percent of the reverse transcription reaction was used in each PCR reaction, and reactions were stopped after 21 cycles, previously shown to be in the linear range by real-time PCR. The PCR reactions were run on 1.5% agarose gels, and bands quantitated using GeneTools (Syngene, Inc).
Previous analyses of internal exon size in the human genome reported a mean size of 147nt and most exons were <300nt (8). This is consistent with the exon definition model of splicing that proposed an upper limit of 300–500nt for efficient splicing (9–11). In this analysis we searched the annotated human genome (hg19) for internal exons >1000nt. We found 1153 annotated internal exons >1000nt in length with a median size of 1503nt (Figure 1A). Although we did not analyze intermediate-sized exons, we observed 2226 exons between 500 and 1000nt in length. The remaining 98.5% of internal exons were <500nt long. Although these 1153 large (>1000nt) internal exons are a minor fraction (0.5%) of all internal exons, they are present in 1040 different genes making up 4.7% of all human protein-coding genes.
There are a small number of human genes with very large internal exons (>10000nt); all but one of these genes are in the mucin family of proteins that play a role in protecting the epithelium from the harsh external environment (42). The largest internal exon is exon 3 of human mucin 16 (MUC16), which is >21000nt. Other genes in the mucin family with internal exons >10000nt include MUC4 (12708nt), MUC12 (14889nt), MUC17 (12219nt) and MUC5B (10893nt). The gene with the largest number of internal large exons is titin (TTN) which has five large exons; its largest exon is >17000nt in length.
Although the majority of internal exons are small, the terminal exons of genes do not have any size restrictions (6,7). In fact the biggest terminal exon in the human genome seen in our analysis is 22000nt, and the average size of all terminal exons in the human genome is ~1000nt. Since terminal exons are so large, we investigated whether large internal exons could have once been terminal exons during an earlier time in evolution, although it is also possible that larger exons in internal positions other than second/penultimate position are more detrimental to organism fitness. If so, we hypothesized that these large exons would be enriched as either the second exon or the penultimate exon in the gene. Consistent with this, we found that 447 (38.7%) of the large exons are either the second exon or the penultimate exon, suggesting these genes may have picked up an additional exon during the course of evolution (Figure 1B). Since there are an average of 13 exons in a human gene, a random distribution of internal large exon positions would result in 15% at exon 2 or the penultimate exon. Of these 447 large exons, 288 are the second exon and the remaining 219 are the penultimate exon of their corresponding genes. The remaining number of large exons are spread across various exon positions.
Interestingly, only 12.5% (36/288) of genes with large exons in the second position had a start codon within the first exons (data not shown), which suggested that many of these new terminal exons are non-coding regulatory elements. Of the remaining genes with large exons in the second position, 52% of the start codons were in the large exon, while another 27% of the start codons were in the third exon (data not shown).
We asked whether the genes with large exons expressed proteins with particular functions. We found a strong enrichment of cytoskeleton and microtubule-associated proteins (21%), in comparison to 10% of the genes without large exons. Another large subset encoded nuclear proteins (27.8%) in comparison to 21% of genes without large exons. In addition, 35% of the genes with large exons had unknown functions, compared with 25% of genes without large exons. Thus, the genes with large exons represented a subset of all cellular functions.
We next wanted to look at the expression of large exons. A Bowtie index (34) was constructed using large exon gene sequences from RNA-seq data obtained from 120 datasets downloaded from the SRA database at NCBI. Analysis of the 1153 large exons showed that 1040 (out of 1078 expressed exons) were fairly evenly transcribed across the large exon, like the GT3C4 exon shown in Figure 2A (RNA-seq reads shown in green). This exon was also translated in HeLa cells, as shown by the ribosome footprints in Figure 2A (bottom, red). We also looked for the presence of internal splice sites and found several, but these were weaker than the splice sites flanking this exon (Figure 2A).
In contrast, three of the annotated large exons resembled the EFHC1 exon shown in Figure 2B. The RNA-seq reads were much more abundant at both ends of the exon (~200nt regions) than in the center; however, the entire exon was expressed in a fraction of cell types. There were strong splice sites flanking these higher RNA-seq reads, suggesting that they are two independent exons in some cell types. We think this is an example of an intron that is retained in some cells but not in most. These were removed from our dataset.
Lastly, we observed 38 annotated large exons that appeared to have transcripts and ribosome footprints representing only a portion of the exon. An example of this is shown in Figure 2C. These exons usually had a splice site flanking the RNA-seq reads, leading us to believe that these are misannotated large exons that are really small exons; they were removed from our dataset before further analysis.
In order to further assess the expression of large exons and large exon-containing genes, the sequences in 120 RNA-seq datasets were aligned to our Bowtie index of large exon-containing genes. The absolute number of reads aligning to each exon was extracted, and RPKM for each exon was calculated with respect to the total number of reads mapping to the entire gene. The ratio of the RPKM for the large exon relative to the RPKM of the upstream or downstream exon (whichever was larger) was calculated and binned in our large exon inclusion index. (0: gene not expressed, 1: 0<large exon<0.25, 2: 0.25<large exon<0.50, 3: 0.50<large exon<0.75, 4: 0.75<large exon<1.25, 5: large exon>1.25). The average exon inclusion index for each large exon was calculated over the 120 datasets. The internal large exons were only analyzed when both upstream and downstream exons were expressed.
Surprisingly, 42% of the internal large exons had an average exon inclusion index >4 (Figure 3). This means these exons were retained at around the same levels as the flanking exons, so they are considered constitutive exons. Another 37.5% of large exons had an average exon inclusion index between 3 and 4, which means they were present at levels between 50% and 75% relative to the flanking exons. About 15% of the large exons were expressed <50% of the time (bins 1 and 2). Around 5% of large exon-containing genes were not expressed in all 120 datasets at levels required for this analysis (Figure 3). Based on the exon definition hypothesis and previous data, the majority of these large internal exons would be expected to be excluded from transcripts. However, only 166 of the 1040 large exons have an average bin <3, meaning they were excluded more often than included in the mature transcript.
In contrast, an analysis of 2200 random internal exons <250nt showed a different distribution. Seventy percent of these small internal exons had an average exon inclusion index >4; that is they are constitutively spliced with respect to upstream and downstream exons. In addition, 25% of these had an average inclusion index between 3 and 4 and therefore were present 50–75% of the time in the mature transcript (Figure 3). This indicates that large exons are skipped more often than small internal exons.
The majority of large exons were expressed over a diverse range in different datasets and cell types. Although most of the genes had a maximum exon inclusion index of 5.0, many of them also had a minimum exon inclusion index of 1.0 (Supplementary Figure S1). This means that many of these large exons were expressed at the same levels as flanking exons in some datasets, while they were highly underrepresented in other datasets, indicative of exon skipping.
In order to make sure that these large exons were actually part of contiguous transcripts, we analyzed the presence of splice junctions with upstream and downstream exons. We constructed in silico all possible splice junctions between exons based on annotated splice sites. This was used to make a Bowtie index for alignments. For constitutive large exons, the wild-type splice junction with the upstream and downstream exon was the predominant junction observed (data not shown). Similarly, when the large exon was alternatively spliced, and the exon inclusion index was low, there were more alternative splice junctions fusing the upstream exon with other downstream exons or there was a complete absence of the wild-type splice junction (data not shown). There was very little correlation (Spearman rho=0.003) between the size of a large exon and the degree of inclusion of the exon in the transcript (Supplementary Figure S2). This analysis indicates that the splicing machinery is capable of splicing large exons, and exclusion or inclusion of large exons is probably determined by other cis- and trans-acting factors. We did not find any correlation between large exon inclusion and the total length of the mRNA (data not shown).
To determine whether large human exons or the flanking introns were conserved in mammals and if constitutive large exons (average exon inclusion index>4) were more conserved than alternative large exons (average exon inclusion index<4), we downloaded the UCSC genome browser phastCons (38) conservation scores for primates and all mammals and extracted the conservation score for every nucleotide in large exon-containing genes. We then extracted the conservation score for 500nt in the upstream intron, the first 500nt in the large exon, the last 500nt in the large exon and 500nt in the downstream intron for every large exon in our dataset. These data were then averaged for the constitutive and alternative large exons and plotted against nucleotide position.
As illustrated in Figure 4, the conservation score across the large exons (average phastCons score=0.6) was much higher than that in the flanking introns (average phastCons score=0.2), supporting the idea that these sequences are indeed exons. Also, constitutive large exons were more conserved than alternative large exons. The largest difference in conservation between these datasets was observed at the 3′-end (5′-splice site) of the large exon, although the constitutive exons were more conserved throughout.
One factor that could explain the inclusion of large exons is the size of the flanking intron. Previous work indicated that decreasing the size of an upstream intron to around 250nt led to inclusion of a large exon that would otherwise be excluded (10). Of our 1040 large exons, 66 exons were flanked by upstream and downstream introns <500nt. Another 220 large exons had either an upstream or downstream intron <500nt. This corresponds to around 27.5% (286 of 1040 exons) of large exons having at least one flanking intron <500nt. Interestingly, 27.6% of all internal exons were also flanked by at least one intron <500nt. These data suggest that there was no enrichment for small introns in the large exon dataset, indicating that intron size probably does not play a big role in inclusion of large exons.
We next investigated whether both upstream and downstream small introns could influence large exon inclusion. A total of 31 large exons had both flanking introns <250nt, and 28 of these large exons had an average large exon inclusion index of >4.5, demonstrating they are constitutively spliced (Supplementary Figure S3). In contrast to previous data where a small upstream intron was sufficient to promote inclusion of an artificial large exon (10), our analysis indicated that both flanking introns needed to be <250nt to promote inclusion. We did not see an effect on inclusion if only one of the flanking introns was <250nt. Furthermore, the remaining large exons with flanking introns <1000nt did not seem to show enrichment for any particular exon inclusion index.
Since the ends of the constitutive large exons were more conserved than alternative large exons, we asked if constitutive large exons have stronger splice sites. Splice sites are unique sequences that bind splicesomal snRNPs by sequence complementarity during splicing. Previous data have shown that a weak 5′- or 3′-splice site leads to exclusion of the exon from the transcript, when all other factors are controlled. To determine whether large exon inclusion is determined by splice site strength, we extracted all 5′- and 3′-splice sites of large exons and calculated strength using the maximum entropy model of MaxEntScan (36).
The distribution of splice site strength of large exons was similar to that of all internal exons, indicating that there was no selection to make large exon splice sites stronger to compensate for size (Supplementary Figure S4). In Figure 5, the average exon inclusion index is depicted in terms of a heat map with darker colors indicative of a higher index and lighter colors indicative of a lower index. As illustrated in the plot, some large exons with weak splice sites (maximum entropy score of <2.5) have a high average exon inclusion index, and conversely, some exons with very good splice sites have a low exon inclusion index (Figure 5).This indicates that strong splice sites are not the sole determinant of exon inclusion.
Since constitutive large exons were slightly more conserved than alternative large exons, we asked if there were specific sequences in the exons that might influence their inclusion. ESEs are short sequences that have been shown to increase splicing efficiency and exon retention. It is hypothesized that ESEs are bound by splicing enhancer proteins that help bridge the two splicing complexes at either end of the exon. Previous analysis has identified 238 6-nt RESCUE-ESEs (24) that were enriched in constitutively spliced exons with weak splice sites and absent in introns. We investigated whether there is an increased number of ESEs in large exons to promote their inclusion.
The absolute number of ESEs in all internal exons and all large exons was calculated and then normalized for size (Figure 6A and C). All the large exons had at least 1 ESE/kb; however, 10% of small exons did not have any ESEs. The distribution of ESEs in large exons (Figure 6C) was not significantly different than the distribution of ESEs in all internal exons (Figure 6A), indicating that there was no enrichment of ESEs in large exons to compensate for size. Furthermore, we saw no enrichment of ESEs in large exons with weak splice sites that were spliced efficiently (data not shown). We also did not find any enrichment of ISEs in the introns within 200nt of either splice site of these large exons (data not shown).
We wanted to investigate why some large exons with strong 5′- and 3′-splice sites had a very low average exon inclusion index. Apart from having positive sequence elements, exons also contain negative sequence elements called ESSs. These are sequences that are bound by proteins including hnRNPs that inhibit the splicing of the exon (43). First, we investigated whether large exons had a different distribution of ESSs compared with all internal exons. The distribution of ESSs in large exons is not very different than ESSs in all internal exons except at the lower end of the distribution (Figure 6B and D). Surprisingly, there are fewer large exons with very low numbers of ESSs. Next, we calculated the standard deviation from the mean of all large exons with good splice sites to see if those with low splicing were enriched for ESSs. However, there was no enrichment of ESSs in these large exons compared with all other large exons (data not shown).
The inclusion of internal exons is probably influenced by the ratio of ESEs to ESSs. We wanted to see if the ratio of ESEs to ESSs could determine the inclusion of a large exon. Similar to internal small exons, large exons have more ESEs than ESSs. On average there are four times more ESEs than ESSs in large exons (avg. ESEs: 108/kb in large exon, avg. ESSs: 28/kb in large exon) and only 45 of the large exons had more ESSs than ESEs. Interestingly this difference is almost identical to what is observed in small exons. Small exons on average have 110 ESEs/kb and 28 ESS/kb once again indicating that these large exons are indeed exons.
We next investigated whether the ratio of ESEs to ESSs may explain the inclusion of large exons. As illustrated in Figure 6E, some constitutive large exons have a much higher ratio (>5) of ESEs to ESSs in comparison to alternative large exons. Constitutive large exons (inclusion index>4) have an average of 4.25 times more ESEs than ESSs. Alternative large exons that have an inclusion index of >3 have 3.7 times more ESEs than ESSs and the rest of the alternative large exons have 3.2 times more ESEs than ESSs. This suggests that the ratio of ESEs to ESSs may be an important determining factor in large exon inclusion (Spearman rho=0.15).
Although there is a modest correlation between the ratio of ESEs/ESSs and the splicing of large exons, we wondered whether there were specific sequences or motifs that identified large exons and promoted their splicing. Since previous analysis to identify splicing enhancers and silencers used hexamers as their basic unit, we decided to use the same approach to try to identify hexamers that were enriched in the large exon dataset. We calculated the frequency of occurrence of all 4096 possible hexamers in the large exon dataset () and, in all internal small exons of <300nt () (Figure 7A). We then calculated the difference between the large exon frequency and small exon frequency, ΔLS. Using a false discovery rate of 5%, we identified 41 (1.0%) hexamers that were enriched in the large exon dataset (Supplementary Table S1). This represented all hexamers that were >3.2 SD from the mean of the ΔLS distribution and are plotted in red (Figure 7A). Three of these hexamers were previously identified as ESEs (24). We then asked whether any of the remaining 38 large-exon enriched hexamers were representative of intron enriched hexamers. None of the 38 large-exon-enriched hexamers were enriched in introns (Supplementary Figure S5), which means the probability of identifying any of these 38 large exon-enriched hexamers in introns is no greater than random. We also identified 12 sequences that were underrepresented in large exons in comparison to small exons (data not shown).
These 38 enriched sequences were present in all large exons, although constitutive large exons had, on average, a greater number of them (Spearman rho=0.3, Figure 7B). In order to identify specific motifs that were present in these 38 hexamers, we used standard average hierarchical clustering based on a dissimilarity index to cluster these sequences (Figure 7C). This analysis showed that all 38 hexamers were related to each other with a dissimilarity index of <3. All 38 hexamers were then aligned by ClustalW to calculate a frequency matrix, which was then used to calculate a motif (Figure 7C). The motif is C-rich and there is a central CA dinucleotide that is nearly invariant in all the hexamers. We propose that these sequences may play a role in enhancing large exon splicing.
We then analyzed the presence of these 38 large exon-enriched sequences in the constitutive dataset (average inclusion index>4) and the alternative dataset (average inclusion index<3) to try to identify constitutive splicing-enriched hexamers. Similar to previous analysis, the frequency of the 38 hexamers was calculated for both the constitutive large exon dataset and the alternative large exon dataset (data not shown). Constitutive large exon-enriched hexamers were determined based on false discovery rate statistics. This analysis indicated that there were no significantly enriched hexamers in the constitutive large exon dataset compared with the alternative dataset. This may indicate that large exon-enriched hexamers only identify and enhance the splicing of all large exons and do not play a role in alternative splicing of the large exon. The constitutive and alternative nature of a large exon may be determined by previously analyzed features.
To investigate whether this large exon-specific motif plays a role in large exon definition, we mutated the large exon-enriched hexamers in a candidate exon, the 1039-nt exon 7 of the JARID2 gene. This exon is a constitutively spliced in 293 cells. We constructed a mini-gene containing JARID2 exons 6–8 and truncated (to 1kb) flanking introns (Figure 8A). The JARID2 large exon had a total of 18 different large exon-enriched hexamers. We synthesized a mutant large exon in which the internal 4nt in each hexamer were mutated. This construct was cloned into the mini-gene to create a JARID2-Mutant mini-gene. Both wild-type and mutant constructs were transfected into HEK293T cells. Total RNA was extracted 36h later, and each exon–exon junction was assayed using PCR. We assayed the WT (6–7, 7–8) exon–exon junctions as well as the alternative (6–8) exon–exon junction (Figure 8B).
The JARID2-WT construct spliced as expected with 98% of expressed mRNA demonstrating wild-type splice junctions and 2% showing skipping of exon 7 (6–8 splice junction). This is consistent with this large exon being constitutive in 293 cells and demonstrated that the JARID2-WT mini-gene splices like the endogenous JARID2 pre-mRNA. The ratio between the two wild-type junctions (6–7/7–8) was 0.95 (Figure 8B), which also indicated that all the splice sites were recognized appropriately and spliced with almost equal efficiency.
Next, we asked whether the JARID2-Mutant construct would be spliced properly. The ratio between the two wild-type splice junctions in the JARID-Mutant construct was, 24 to 1, which was very different from that in the JARID2-WT (Figure 8B). While we observed efficient splicing of exon 6 to exon 7, the removal of the downstream intron was reduced (Figure 8B). In addition, we observed a small increase in the relative amount of skipping of exon 7 in the mutant construct. This suggested that the large mutant exon is not recognized as an exon, preventing its 5′-ss from being recognized and spliced properly. The large exon and the retained downstream intron appear to be seen as part of the 3′ terminal exon of this minigene construct (data not shown). This supports the idea that these hexamer sequences promote the identification of large exons and function as LESEs.
The majority of human exons are small, and it is thought that the splicing complexes at either end of the exon communicate with each other to define the exon and accurately splice (10). This is consistent with the exon definition model, which predicts that any large exon present in the human genome would be skipped (or alternatively spliced) during the splicing of the pre-mRNA transcript (13). However, we were surprised to find that 5% of all human mRNAs have at least one large internal exon >1000nt, and 42% of these were constitutively spliced. Our data indicate that the splicing machinery is fairly elastic and is capable of splicing large internal exons efficiently (11). In the cases where the large exon is excluded, this may be a mechanism to control the level of wild-type gene expression.
We observed a high degree of conservation of the large exons across 31 different mammalian species and a much lower degree of conservation of the adjacent introns (Figure 4). Interestingly, the constitutive exons were more conserved than the alternative exons especially at both ends. Because of the increased conservation at both ends of the large exon, we hypothesized that there maybe an interaction between the ends to bring the splice sites closer together. Using UNAfold (44) we analyzed the number of predicted interactions between the first 200nt and last 200nt of large exons in the lowest energy structure of the entire large exon (data not shown). We did not find any difference in total folding energy (data not shown) or any increase in interactions between the ends when compared with a sequence that was randomized with the same dinucleotide frequency (data not shown). Although we did not observe any structural elements in these conserved regions, it will be interesting to study this region further to determine whether regulatory proteins might bind in these areas. We also did not observe a difference between the nucleosome occupancy of small exons and large exons (Supplementary Figure S6), which is consistent with a previous analysis that looked at nucleosome density of exons>500nt (45). The average number of nucleosome footprints per size was not higher in large exons and the distribution of the nucleosomes with respect to splice sites was also similar.
We asked whether large exon inclusion can be explained by determinants involved in regulating the splicing of smaller exons. Although none of the variables alone seemed to correlate with the distribution of large exon expression, each variable probably contributes to the inclusion of the large exon. For example, many of the large exons with short flanking introns (<250nt) had a high average exon inclusion index (Supplementary Figure S2). Similarly, many large exons with strong splice sites also had a high average exon inclusion index (Figure 5).
Another variable that directly influences the inclusion and exclusion of any internal exons are the levels of SR and hnRNP proteins in each of the cell types (27). Since constitutive large exons seem to have a higher ratio of ESEs to ESSs (Figure 6E), the competition between these two protein families could determine the inclusion of a large exon. It would be interesting to identify the presence or absence of SR or hnRNP proteins when a particular large exon is included or excluded. A RIP-seq analysis of these proteins could explain more about the inclusion of many of these large exons.
Furthermore, we identified 38 LESEs and one motif that are enriched in large exons and may potentially regulate their inclusion or exclusion in an mRNA (Figures 7 and and8).8). We have made mini-gene constructs (Figure 8A), which will allow us to test which proteins are involved in large exon splicing. We analyzed these RNA sequences for potential protein-binding sites, using the RNA-binding protein data base (46). This analysis identified RBMX/hRNPG (CCCG,CCAC,CCAG) as a binding protein in 36 of the sequences and SFRS1 (AGGA) and SFRS9 (AGGAG) as a binding protein in two of the hexamers. Interestingly, all three of these proteins have been identified as splicing regulators, influencing many genes as in the case of SFRS1 (47–49). It is possible that these proteins, amongst others, could bind and identify large exons and promote their inclusion.
Supplementary Data are available at NAR Online: Supplementary Table 1 and Supplementary Figures 1–6.
Funding for open access charge: National Institutes of Health [R01CA103891 to K.L.B.].
Conflict of interest statement. None declared.
The authors thank Dr Nicholas Ingolia for review of this article and valuable suggestions regarding statistical calculations for hexamers. The authors also thank Dr David Zappulla, Dr Johanna Withers and Dr Thomas O’Hare for suggestions and review of this article.