|Home | About | Journals | Submit | Contact Us | Français|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions/at/oxfordjournals.org
Exonic splicing enhancers (ESEs) are pre-mRNA cis-acting elements required for splice-site recognition. We previously developed a web-based program called ESEfinder that scores any sequence for the presence of ESE motifs recognized by the human SR proteins SF2/ASF, SRp40, SRp55 and SC35 (http://rulai.cshl.edu/tools/ESE/). Using ESEfinder, we have undertaken a large-scale analysis of ESE motif distribution in human protein-coding genes. Significantly higher frequencies of ESE motifs were observed in constitutive internal protein-coding exons, compared with both their flanking intronic regions and with pseudo exons. Statistical analysis of ESE motif frequency distributions revealed a complex relationship between splice-site strength and increased or decreased frequencies of particular SR protein motifs. Comparison of constitutively and alternatively spliced exons demonstrated slightly weaker splice-site scores, as well as significantly fewer ESE motifs, in the alternatively spliced group. Our results underline the importance of ESE-mediated SR protein function in the process of exon definition, in the context of both constitutive splicing and regulated alternative splicing.
Processing of pre-mRNA is a fundamental aspect of gene regulation. Most eukaryotic genes comprise multiple relatively short exons that are separated by much longer introns. The basic mechanism of splicing involves exon recognition via the 5′ and 3′ splice sites and branch site at or near the intron ends, and the precise removal of intronic sequences and ligation of exons, generating mature mRNA (1). However, accurate exon definition by the spliceosome is complicated by the presence of numerous intronic pseudo exons flanked by sequences that conform to the splice-site consensus motifs at least as well as those utilized by many true exons (2). The additional information required for exon definition is contained at least partly in cis-acting regulatory enhancer and silencer sequences (3).
Exonic splicing enhancers (ESEs) participate in both alternative and constitutive splicing, and many of them act as binding sites for members of the SR protein family (4,5). The SR proteins are a family of related proteins that share a conserved domain structure. They have one or two copies of an RNA-recognition motif (RRM) followed by a C-terminal domain that is highly enriched in arginine/serine dipeptides (RS domain) (6). The RRMs mediate substrate recognition via sequence-specific RNA binding, whereas the RS domain is thought to be involved mainly in protein–protein interactions, but apparently also in protein–RNA interactions (7,8). Exon definition may occur through ESE-bound SR proteins recruiting components of the splicing machinery through their RS domains (9,10), and/or by antagonizing the action of nearby splicing silencer elements (11).
It has been estimated that at least 15% of point mutations that give rise to human genetic diseases cause RNA splicing defects (12). These mutations exert their effects upon the standard consensus intronic splice sites, and normally result in exon skipping, or less commonly in the creation of an ectopic splice site or activation of a cryptic splice site (12). The effects of exonic point mutations are less well understood. Until recently, it was normally assumed that nonsense mutations produce truncated protein isoforms or in some cases target the mRNA for destruction, whereas missense mutations were thought to identify amino acids that are important for protein structure or function. Translationally silent mutations were normally classified as polymorphisms and considered neutral. The generality of these assumptions is now being challenged, in part through the analysis of the mRNAs produced from mutant alleles, and this analysis is leading to the re-classification of a number of exonic mutations and to the realization that an even higher proportion of mutations affect splicing (3). One possible explanation for the effects of such mutations is that they interfere with the function of exonic regulatory sequences. Indeed, recent data implicate ESE inactivation by point mutations as a significant cause of genetic disease (13–26).
Several groups have employed functional systematic evolution of ligands by exponential enrichment (SELEX) for the purpose of identifying sequences that can function as ESEs. Functional SELEX, both in vivo (27) and in vitro (28–30), has led to the discovery that a diverse array of both purine-rich and non-purine-rich sequences can act as ESEs. A further refinement of functional SELEX allowed the identification of sequence motifs that can act as ESEs in response to specific SR proteins (31,32). The motifs identified are short (6–8 nt), degenerate and sometimes partially overlap. The frequencies of the individual nucleotides at each position were used to derive score matrices that can be used to predict the location of SR protein-specific putative ESEs (31,32). The nucleotide-frequency matrices are available in a web-based program called ESEfinder (33). Previously, the matrices were used to examine a limited set of exon sequences for the presence of ESE motifs. Exonic high-score motifs were often found to be clustered and also to be enriched in regions with known natural enhancers (31,32). In addition, the motifs were found to be present at a higher density within exons, compared with introns. The predictive power of ESEfinder has been demonstrated through the observation that a number of disease-associated point mutations that result in exon skipping reduce high-score motifs to below threshold values (13,14,17,20,22,24–26). Conversely, a mutation that results in activation of a cryptic 5′ splice site due to increased SC35 binding to an ESE, is consistent with the ESE scores predicted by ESEfinder (34).
Ab initio computational approaches to identify ESE motifs have recently been developed. RESCUE-ESE (35,36) identified putative ESE motifs by comparing hexanucleotide frequencies in constitutive exons with weak versus strong splice sites. Sequences preferentially associated with weak splice sites were clustered into several families and demonstrated to possess enhancer activity when functionally tested. A similar approach compared octamer frequencies from internal non-coding exons versus unspliced pseudo exons and the 5′-untranslated regions (5′-UTRs) of intronless genes, to identify putative regulatory sequences involved in splicing (37). This approach led to the discovery of both functional enhancer and silencer sequences.
We have undertaken a large-scale analysis of SR-protein-dependent ESE motif frequencies in the human genome using ESEfinder. A thorough survey of ESE prevalence was warranted, in light of the high percentage of mutations that cause genetic diseases through aberrant splicing. In addition, a genome-wide survey of ESE motifs in protein-coding genes can give an indication of their importance in constitutive and alternative splicing, and their overall contribution to exon definition and splice-site selection.
The EnsMart search engine (38) was used to retrieve human genomic sequence from Ensembl (version 24) (39). A set of 63218 constitutively spliced internal protein-coding exons plus 100 nt each of flanking upstream and downstream intronic sequence, was derived from a total of 12216 genes. Constitutive exons were defined from genes having definitive annotation in the NCBI Reference Sequence (RefSeq) collection, whose transcripts demonstrated no evidence of alternative splicing. Protein-coding exons were derived by BLAST searching of exons with cDNA sequences, allowing the elimination of non-coding and partially coding exons. We also created a database of 2620 alternatively spliced (cassette) exons from RefSeq genes with multiple transcripts, by mapping exons from these genes to their respective genomic coordinates. For comparison with the alternative exons, we created a set of 2880 constitutive exons selected to have a similar length distribution (same mean and standard deviation of exon lengths). A database of 20580 repeat-free intronic pseudo exons was kindly provided by Dr Lawrence Chasin (37). Sequence databases are available upon request.
ESE motif scores were calculated using the position weight matrices available in ESEfinder version 2.0 (http://rulai.cshl.edu/tools/ESE/) (33). The default threshold values from the program were used. For the purposes of this study, we considered only above-threshold (high-score) ESE motifs as being significant. These thresholds were defined previously as the median of the highest score for each sequence in a set of randomly chosen 20 nt sequences from the starting pool used for the functional SELEX experiments (33). Note that the motif scores for different SR proteins are not directly comparable (33). Shuffled exonic and intronic sequences were generated using the EMBOSS Shuffleseq program (http://emboss.sourceforge.net/apps/shuffleseq.html).
Splice-site scores were calculated using score matrices derived from the exon-finding program MZEF (40). The matrices are based on position-dependent triplet-frequency preferences for real versus pseudo splice sites in the window (−15, +3) for 3′ splice sites and (−3, +8) for 5′ splice sites.
Bootstrap sampling was used to determine the level of significance for the differences in average ESE motif frequencies between exons and their flanking introns, and exons and pseudo exons. The mean ESEs/nt from random selections of 10000 sequences from the exon, intron and pseudo exon groups were sampled and compared 5000 times to derive P-values. ESE motif frequency distributions were compared by quantile–quantile analysis, and median values were compared by the two-sample t-test. The significance of the overlap between motifs recognized by ESEfinder and RESCUE-ESE or the putative ESEs of Zhang and Chasin was defined by Fisher's exact test. Statistical tests with P-values <0.01 were deemed significant.
To date, most studies of ESE function have concentrated on their role in alternative splicing, although functional ESEs are also present in constitutive exons (13,14,41,42). Important questions remain unanswered, including the extent to which ESEs participate in the process of constitutive splicing. A large-scale analysis of ESE motif distribution in both exons and introns would give some indication of their functional relevance to splicing events of this nature.
We created a database of 63218 constitutively spliced internal protein-coding exons of lengths ≥100 nt from 12216 human genes. To standardize for differences in exon length, we created composite 100 nt exon sequences consisting of 25 nt from each end plus 50 nt from the center. To ensure that the exons were constitutively spliced, sequences were collected from single-transcript genes. Exonic sequences plus 100 nt each of flanking upstream and downstream intronic sequences were retrieved from Ensembl. For comparison, we calculated ESE motif frequencies from a database of 20580 repeat-free intronic pseudo exons (37) also standardized to 100 nt, plus 100 nt each of 5′- and 3′-flanking sequences. ESEfinder scores sequences for the presence of motifs matching the SELEX-derived consensus for four SR proteins: SF2/ASF, SRp40, SRp55 and SC35 (33). We calculated high-score ESE motif frequencies occurring at each position in consecutive windows of 10 nt. For the purposes of this study, all above-threshold values for a given motif were considered to be equivalent.
The ESE motif frequency distributions (ESEs/10 nt) were plotted separately for each SR protein (Figure 1). Points were plotted at the central position of the high-score motif. ESE motif frequencies were higher within exons than in the flanking intronic sequences for all four SR proteins. Sharp peaks and troughs at the exon/intron borders are a consequence of the conserved splice-site sequences. To avoid the contribution of the splice-site consensus motifs, we calculated the mean ESE motif frequencies (ESEs/nt) at the exact center of the exons and each of the flanking intronic regions (50 nt upstream of the 3′ splice-site, and 50 nt downstream of the 5′ splice-site) (Table 1). A bootstrap sampling strategy of the mean ESE motif frequencies revealed that the higher density of ESE motifs in exons than in introns was statistically significant for all four SR proteins, and the P-values were all <0.001 for comparisons with both upstream and downstream flanks. ESE motif frequencies were approximately constant within exons. By comparison, ESE motif frequencies in pseudo exons were significantly lower than in authentic exons for three of the four SR proteins (Table 1) (P-values <0.002 for SF2/ASF, <0.01 for SC35, <0.006 for SRp40 and <0.02 for SRp55). The frequencies of ESE motifs in intronic pseudo exons were similar to the frequencies found in the other intronic regions analyzed. As a control, we shuffled the exonic and intronic sequences, maintaining the nucleotide composition, and scored the resulting sequences with ESEfinder. The frequency of ESE motifs in the shuffled exonic sequences decreased for all four of the SR proteins, whereas the frequencies in shuffled intronic sequences were higher than in the real intronic sequences (data not shown). This provides further evidence for the functionality of the ESEfinder motifs when present at exonic locations.
We observed a wide variation in the absolute numbers of ESE motifs per exon when we analyzed the complete exons in our constitutive exon database (Figure 2). The exons ranged in size from 100 nt to ~6 kb, and there was a modal frequency of 14 ESE motifs per exon. Interestingly, a small number of exons (158) contained no ESE motifs, although it should be emphasized that the current version of ESEfinder searches for high-score motifs for only 4 of the ~10 SR proteins.
It has been postulated that one function of ESEs is the recruitment of spliceosomal components to weak 5′ or 3′ splice sites (43). Therefore, it is possible that exons with weak splice sites will have elevated frequencies of ESEs. This property was one of the criteria used to identify ESE motifs by RESCUE-ESE (35). We chose to investigate this hypothesis in the context of constitutive splicing, to eliminate as far as possible any complications arising from mechanisms regulating alternative splicing.
We calculated the 5′ and 3′ splice-site values for each exon in our constitutive exon database. We then ranked the exons as strong (top 15%) or weak (bottom 15%) for 5′ and 3′ splice sites independently. ESEfinder was used to calculate high-score ESE motifs from the four groups of exons. The number of high-score ESE motifs was divided by exon length to give ESEs/nt, and the frequency distributions were plotted as number of exons versus ESEs/nt. The ESE motif frequency distributions of the exons with strong and weak 3′ splice sites were compared, as were the distributions of the exons with strong and weak 5′ splice sites, by quantile–quantile analysis (Supplementary Figure 1). This type of analysis determines if two datasets come from populations with a common distribution. If the strong and the weak splice-site score exons have the same distribution of ESE motifs, then the points will fall approximately on the 45° reference line. Departure from the 45° reference line, either below or above, indicates higher ESEs/nt in exons with strong or weak splice sites, respectively. Differences in the ESE motif frequency distributions were observed between exons with weak and strong splice sites, for both 5′ and 3′ splice sites, for some of the SR proteins. A summary of the data is shown in Table 2. The correlation of ESE frequencies with splice-site strength reveals a complicated relationship. For most of the comparisons, there are no significant differences between exons with strong versus weak splice sites. However, exons with weak 5′ splice sites and exons with weak 3′ splice sites have more SRp55 and SF2/ASF motifs, respectively. In contrast, exons with strong 5′ splice sites have significantly more SRp40 motifs than their weak splice-site counterparts, and exons with strong 3′ splice sites have significantly more SC35 and SRp40 motifs.
We further classified our constitutive exon dataset into strong exons possessing both strong 5′ and 3′ splice sites (top 15%), or weak exons possessing both weak 5′ and 3′ splice sites (bottom 15%). Quantile–quantile analysis (Figure 3) of the ESE motif frequency distributions of these two groups of exons revealed significant differences in ESE motif prevalence for three of the SR proteins: exons with strong splice sites have more SC35 and SRp40 motifs, whereas exons with weak splice sites have more SRp55 motifs (Table 3). Therefore, there does not appear to be a simple correlation between ESE motif frequencies and splice-site strengths. When we combined the output of all four matrices for the exon datasets in Tables 2 and and3,3, the differences between strong and weak exons were averaged out (data not shown). Our observations with the individual matrices suggest a potential role for a subset of the motifs and corresponding SR proteins in the recognition of exons associated with weak splice sites.
Alternative splicing events have previously been documented to be associated with weak splice sites (44), traditionally on a single transcript basis. Such a correlation is limited by the lack of large-scale analyses. One recent report analyzed relatively large datasets of both 5′ and 3′ splice site scores from constitutive and alternative exons from a number of different species, and found consistently higher scores for the constitutive exons (45). However, the link between splice-site score and alternative splicing remains unclear, and may not reflect a simple relationship. The results of our studies of splice-site score and ESE motif frequencies in constitutive exons led us to investigate the corresponding frequencies in alternative exons, and their correlation with alternative splicing events. There are several forms of alternative splicing [reviewed in (46)], and for simplicity we chose to investigate the most common one, namely exon skipping/inclusion.
We created a database of 2620 skipped internal protein-coding exons from RefSeq genes with multiple transcripts, and scored them with ESEfinder. High-score ESE motifs were divided by exon length to give ESEs/nt. This analysis was repeated on a set of 2880 constitutive exons selected to have a similar length distribution (same mean and standard deviation of exon lengths). ESE motif frequency distributions were derived and compared by quantile–quantile analysis (Figure 4). Departure from the 45° reference line, either below or above, indicates higher ESEs/nt in constitutive or skipped exons, respectively. Scoring for all four SR proteins combined revealed that ESE motif frequencies were significantly lower in skipped compared with constitutive exons, with median values of 0.1466 and 0.1605 ESEs/nt, respectively (two-sample t-test, P < 0.00001). The same result was obtained when the exons were scored for individual SR proteins. For example, skipped and constitutive exons scored for SF2/ASF motifs had median values of 0.0384 and 0.0421 ESEs/nt, respectively (P < 0.0001).
The observation that skipped exons had significantly fewer ESE motifs than constitutive exons led us to examine the ESE motif frequency distribution in the flanking intronic regions of skipped exons. We used ESEfinder to score 100 nt each of flanking upstream and downstream intronic sequence. Mean ESE motif frequencies (ESEs/nt) at the exact center of the skipped exons and each of their flanking introns (50 nt upstream of the 3′ splice-site, and 50 nt downstream of the 5′ splice-site) were calculated (Table 4). Bootstrap resampling of the mean frequencies demonstrated that only the SF2/ASF motifs were significantly higher in the skipped exons compared with their flanking introns (P < 0.001 for comparison with upstream intron, P < 0.003 for comparison with downstream intron).
Calculation of the splice-site scores using position weight matrices (40) revealed that the skipped exons had significantly weaker splice sites than the constitutively spliced exons. The mean values with standard deviations were 84.2 ± 2.25 and 83.7 ± 2.7 for constitutive and skipped 3′ splice sites, respectively, and 46.93 ± 1.7 and 46.67 ± 2.09 for constitutive and skipped 5′ splice sites, respectively. These values were significantly different (all P-values <0.01) when analyzed by both parametric (one sample t-test) and non-parametric (Wilcoxon rank test) statistical methods. It should be noted that although the mean splice-site scores are significantly different, the distributions of splice-site scores for both exon types are very similar (Figure 5), and that splice-site scores alone are insufficient to identify an exon as one that is alternatively spliced.
Two recent reports employed ab initio computational methods to predict sequences that have ESE activity. RESCUE-ESE, developed by the Burge laboratory (35), identified 238 hexamers preferentially associated with constitutive exons with weak splice sites, whereas the methodology of Zhang and Chasin (37) identified octamers overrepresented in non-protein-coding exons compared with the 5′-UTR of intronless genes and pseudo exons. Both groups tested a number of candidate motifs and demonstrated enhancer function in transfected cells. Although these two methods and ESEfinder differ substantially, there may be some overlap in the sequences they recognize as putative ESEs.
The functional SELEX-derived consensus motifs are a hexamer for SRp55, heptamers for SF2/ASF and SRp40, and an octamer for SC35 (33). Because the sequences identified by RESCUE-ESE are hexamers, we expanded each RESCUE-ESE motif by the addition of either 1 (for SF2/ASF and SRp40) or 2 (for SC35) nt, and scored the resulting sequences with ESEfinder. As a control, we calculated the number of all possible ESEfinder high-score motifs for each SR protein. Of all 16384 heptamers, 678 (4.1%) are high-score SF2/ASF motifs; 669 (4.1%) of all heptamers are high-score SRp40 motifs; 2599 (4.0%) of all 65536 octamers are high-score SC35 motifs; and 133 (3.2%) of all 4096 hexamers are high-score SRp55 motifs. Using these percentages, we then calculated the expected number of ESEfinder high-score motifs from a complete random sample of all possible oligonucleotide sequences equal in length to the test set of RESCUE-ESE sequences. For example, for SF2/ASF and SRp40 (heptamer consensus motifs), the addition of 1 nt at either the beginning or the end of the RESCUE-ESE hexamers results in (4 × 238) × 2 = 1904 sequences. We then calculated the expected number of ESEfinder high-score motifs from 1904 random heptamers. The results of the comparison (Table 5) indicate that the sequences recognized as ESE motifs by RESCUE-ESE and ESEfinder do not overlap beyond what is expected by chance.
A similar strategy was employed to investigate the extent of overlap between the sequences recognized by ESEfinder and the 2069 putative ESEs (PESEs) identified by Zhang and Chasin (37). The PESEs were downloaded (http://www.columbia.edu/cu/biology/faculty/chasin/xz3/octamers.txt) and high-score ESE motifs calculated with ESEfinder. As a control, we calculated the expected number of high-score ESEfinder motifs from a random sample of all possible oligonucleotide sequences equal in length to the test set of sequences. For example, there are two possible heptamers contained within any one octamer; therefore, for SF2/ASF and SRp40, we calculated the expected number of high-score ESEfinder motifs from 2 × 2069 = 4138 random heptamers. The results for SC35, SRp40 and SRp55 (Table 6) reveal that high-score ESE motifs for these three proteins are not enriched within the PESE set. However, there are significantly more SF2/ASF motifs (Table 6) than would be expected by chance (P < 0.00001, Fisher's exact test), supporting the conclusion that there is some overlap between the sequences identified as ESE motifs by these two very different methods.
The importance of cis-regulatory sequences for accurate splice-site recognition and exon definition is well documented. However, most experimental studies to date have focused on the regulation of single splicing events. A more global understanding of pre-mRNA splicing requires some knowledge of the distribution of both splicing enhancers and silencers. Using ESEfinder (33), we have undertaken a large-scale genomic analysis in an attempt to uncover relationships between ESE motif frequencies and splicing regulation. Many of the experimental studies of ESE function have involved examination of their role in the regulation of alternative splicing, and as such little is known about their functional relevance to the process of constitutive splicing. Our studies implicate ESE participation in the regulation of both constitutive and alternative splicing.
Previously, the SR protein-specific matrices utilized by ESEfinder were used to search a limited set of genomic sequences for ESE motifs, which were found to occur more frequently in exons versus introns (31,32,47). We have greatly expanded these initial observations, and demonstrated a significant enrichment for ESE motifs in >60000 internal constitutive protein-coding human exons. The motifs identified by the RESCUE-ESE technique (35) and the PESEs of Zhang and Chasin (37) also occur more frequently in exons versus introns. ESEfinder motif frequencies within exons were approximately constant, supporting the hypothesis that ESEs function to activate splicing from varying distances from the splice sites, an observation also made for the exonic distribution of PESEs (37). In addition, constant ESE motif frequencies along exons may be a consequence of the ability of single enhancer motifs to influence recognition of both 3′ and 5′ splice sites (43,48,49). The functional SELEX experiments used to derive the ESEfinder matrices were dependent upon the ability of sequences to enhance splicing of a 3′ terminal exon (31,32). However, numerous studies have implicated ESE motifs identified by ESEfinder in the splicing of internal exons (13–17,20–26,34) and our new data support the conclusion that these ESE motifs play a role in the splicing of internal exons, in addition to terminal exons.
ESE motif frequencies for three of the four SR proteins were significantly higher in exons versus pseudo exons, supporting a role for ESEs in exon definition, and consistent with previous studies of genomic ESE motif distributions (37,47). Zhang and Chasin (37) found fewer PESEs in the same set of pseudo exons that we analyzed with ESEfinder, but identification of the PESE motifs was conditional on their overrepresentation in exons versus pseudo exons. Therefore, the observation that the PESE motifs were more frequent in a second test set of exons versus pseudo exons was a logical expectation (37). The functional SELEX experiments used to derive the ESEfinder motifs imposed no such a priori criteria; therefore, the fact that these motifs are present at significantly higher frequencies in exons versus pseudo exons supports the conclusion that they are involved in exon definition. In addition, there is evidence supporting a role for silencers in the suppression of pseudo exon splicing: a subset of pseudo exons with a relatively high frequency of ESEfinder motifs was found to have increased frequencies of elements capable of silencing splicing (47); and Zhang and Chasin (37) also observed overrepresentation of putative exonic splicing silencers in pseudo exons.
Experimental evidence demonstrated a role for ESEs in constitutive splicing (13,14,41,42), a function supported by our bioinformatic analysis. One ascribed function of ESEs is facilitating the recognition of suboptimal splice sites. Indeed, improving weak 3′ splice-site polypyrimidine tracts negates the enhancer requirement for a number of substrates (50,51). However, there is no evidence that all exons with weak splice sites have an increased dependence upon ESEs. Our comparison of ESE motif frequencies in constitutive exons with weak and strong splice sites implicates ESE involvement in splice-site recognition of all exons. We observed significant differences in some ESE motif frequencies when constitutive exons with strong and weak 3′ or 5′ splice sites were compared independently, or when exons with both strong 3′ and 5′ splice sites were compared with their counterparts with weak sites. However, there was not a simple relationship between splice-site score and ESE motif frequency, as in some instances exons with strong splice sites were found to contain more ESE motifs. In addition, when we repeated this analysis using Zhang and Chasin's PESEs, we observed no difference in the frequency of PESEs in exons with weak splice sites compared with those with strong splice sites (data not shown). It remains possible that weak splice sites tend to be associated with stronger ESEs, rather than with an increased number of ESEs, although it is known that multiple ESEs in the same exon act additively (52). This hypothesis remains to be tested, and will require a more quantitative version of ESEfinder.
A recent survey revealed an increase in the number of ESE motifs identified by RESCUE-ESE in the vicinity of the splice sites of constitutive exons (53). We only observed this trend with SF2/ASF and SRp55 motifs in exons with weak 3′ and 5′ splice sites, respectively. As described above, ESE motifs for some of the SR proteins are actually higher in exons with strong splice sites. These differences in ESE motif distributions may be a consequence of the very different methods used in their identification. The motifs identified as putative enhancers by RESCUE-ESE were constrained by the requirement to be enriched in constitutive exons with weak splice sites, whereas the sequences identified by functional SELEX were selected by their ability to activate exon inclusion in the presence of a particular SR protein. It is possible that RESCUE-ESE identified a set of enhancer sequences involved in the recognition of a restricted set of exons, and that ESEfinder recognizes enhancers involved in a more general aspect of exon definition.
Alternative splicing serves to greatly expand the proteome, with one recent report estimating that up to 74% of multiexon human genes are alternatively spliced (54). ESEs, and the SR proteins that bind them, have well defined roles in regulating the process of alternative splicing [reviewed in (1,4,5,44,55)]. A commonly held assumption states that exons that undergo alternative splicing have weaker splice sites, by comparison with those that are constitutively spliced. Our previous analysis of a limited set of alternatively spliced exons supported this assumption (56). In addition, a recent report found significantly higher splice-site scores for constitutive versus alternative exons in five species, including humans (45). We derived large datasets of constitutive and alternatively spliced (included or skipped) protein-coding human exons, and again demonstrated that alternatively spliced exons as a set have significantly weaker splice-site scores. However, the splice-site score distributions are surprisingly similar and largely overlapping, such that the splice-site scores alone are not sufficient to define a given exon as constitutive or alternative.
Intriguingly, we found that skipped exons have significantly fewer ESE motifs than constitutively spliced exons. In addition, skipped exons, unlike those that are constitutively spliced, do not have increased ESE motif frequencies in comparison with their flanking intronic regions, except for one of the four SR proteins tested, SF2/ASF. Zhang and Chasin (37) likewise reported finding fewer PESEs in alternative exons compared with constitutive exons, and a comparable number or slightly fewer RESCUE-ESE motifs were observed in skipped exons (35). One can speculate that fewer ESEs per exon may result in less efficient exon definition, and subsequently lead to exon skipping. However, this remains a hypothesis that will require appropriate experimental validation. Two recent publications (57,58) reported significant conservation of the flanking intronic regions of alternatively spliced exons, perhaps implying a function for intronic motifs in the control of alternative exon definition.
ESE motif identification by functional SELEX, and the computational methods of RESCUE-ESE or Zhang and Chasin's octamer analysis rely upon different methodologies. However, the motifs identified share some commonalities, namely overrepresentation in exons versus introns, and in constitutive versus alternatively skipped exons. Interestingly, our analysis revealed that the ESE motifs recognized by ESEfinder and RESCUE-ESE do not significantly overlap. Nevertheless, experimental data proved the ability of both methods to define functional enhancers (31,32,35), and as described above, these differences may arise at least in part from the constraint of association with weak splice sites inherent in RESCUE-ESE. Over 80% of the RESCUE-ESE hexamers are found in the collection of PESEs (37). However, in contrast to the analysis of RESCUE-ESE motif distribution (53), there was no increase in PESE frequency near the splice sites (37). This difference may be due to differences in the exonic databases analyzed, or it may be a consequence of a small subset of the RESCUE-ESE motifs accounting for the observed increase near splice sites. Our scoring of Zhang and Chasin's PESEs with ESEfinder revealed no enrichment for high-score SC35, SRp40 or SRp55 motifs. However, we did find an increase over the expected number of SF2/ASF motifs within the PESE group, indicating some overlap between the two methods. It should be noted that our analysis is limited to four SR proteins, and it is highly probable that both the set of RESCUE-ESE hexamers and the PESE octamers contain enhancer sequences recognized by other SR and non-SR proteins, though these methods do not identify the factors responsible for motif recognition.
ESEfinder scores sequences for the presence of putative enhancers, and we emphasize that experimental validation is required for definitive proof that any given motif is a bona fide ESE in its natural context. Other factors may influence the ESE potential of any given motif. These include sequence context, e.g. the presence of nearby silencers, secondary structure effects and tissue-specific splicing factor concentrations. Experimental efforts are underway to refine the original matrices. Future improvements will include experimental refinement of threshold values, and additional SR protein-specific matrices.
Supplementary Data is available at NAR Online.
We thank Lawrence Chasin for kindly providing the pseudo exon database. This work was supported by NIH grants GM42699 to A.R.K. and HG01696/CA88351 to M.Q.Z, and by a postdoctoral fellowship from the US Army Medical Research and Matériel Command to P.J.S. The Open Access publication charges for this article were waived by Oxford University Press.
Conflict of interest statement. None declared.