Alternative splicing is an important regulatory mechanism for many species, allowing them to generate multiple variant proteins from the same primary transcript. In order to predict the complete protein complement of any eukaryote, we need to detect alternative splice sites and put them together in the correct combinations. Algorithmic approaches to splice site prediction have relied mainly on the consensus patterns found at the boundaries between protein coding and non-coding regions [
1]. However the sequence conservation found at the splice site junctions is not strong enough to accurately differentiate between introns and exons [
2]. Additional sequences, residing at variable distances from splice sites, have been shown to function as
cis-acting factor binding sites that regulate splicing either
in vivo or
in vitro. Although such splicing regulators have been identified in both exons and introns, exonic splicing regulators (ESRs) are generally better characterized, and are probably more common [
3,
4]. Such ESRs either enhance or suppress the utilization of both 5' and 3' splice sites. Much attention has been given to exonic splicing enhancers (ESEs) which promote the inclusion (as opposed to skipping) of the exons in which they reside. The first ESEs to be characterized were short, purine-rich motifs containing repeated GAR (GAA or GAG) trinucleotides, but subsequently many other sequences have been shown to have enhancer activity [
5,
6].
In animals, many exonic splicing enhancers are bound and activated by one or more of several related splicing factors known as SR proteins. The relationship between sequence-specific binding by SR proteins and the activation of splicing by exonic splicing enhancers is complex and incompletely understood. Although only a dozen or so splicing events have been shown to be enhancer-dependent, the existence of exonic splicing enhancers within constitutively spliced exons [
7], the frequency of ESE motifs [
8] and the absolute requirement for SR proteins by in-vitro splicing systems suggest that ESEs are ubiquitous, and required for all splicing events. It is estimated that as many as 15–20% of randomly appearing 20-mers contain a splicing enhancer [
3] and computational methods have predicted hundreds of ESE motifs [
9,
10]. Thus, it appears likely that many sequences may act to affect splicing. What is clear is that the motifs recognized by SR proteins are short (8 or fewer nucleotides) and degenerate [
6,
11,
12].
Several computational approaches have been undertaken to find the motifs characteristic of these splicing regulatory elements. In a recent study, Goren and colleagues [
13] introduced a computational method that identifies ESRs based on conservation of wobble positions between orthologous human and mouse exons. Their method identified 285 putative ESRs, from which a sample of ten elements were shown experimentally to induce different levels of regulatory effects on alternative splicing. RESCUE-ESE, another computational approach, identifies potential ESEs based on the theory that exons with weak splice sites are more likely to require ESE activity for splicing [
9]. The original study identified 283 exonic hexamers that were significantly enriched both in human exons relative to introns and in exons with weak splice sites relative to exons with strong splice sites;
in vivo tests of these hexamers confirmed ESE activity. In another study, Zhang and Chasin [
10] predicted human ESR motifs by comparing the frequency of 8-mers in internal noncoding exons versus unspliced pseudo exons and 5' UTRs of transcripts of intronless genes.
Previous computational work on detecting ESEs has focused almost exclusively on mammalian species. There are compelling reasons to believe that ESEs play an important role in plants as well. Early research on plant pre-mRNA splicing emphasized the role of AU-rich or U-rich sequences within introns [
14,
15]. These U-rich sequence elements play important roles in intron definition, and plants lack the very large introns that are associated with the need for exon definition [
16]. On the other hand, a number of reports describe a role for exon sequences in the selection of plant splice sites [
17-
19]. SR proteins, the mediators of ESE activity in vertebrates, are highly conserved in plants [
20,
21]. This pattern of conservation includes reactivity with the monoclonal antibody mAb104 [
22] and extends to function. A mixture of Arabidopsis SR proteins [
23], and atRSZp22 in particular [
24] can complement SR-deficient mammalian splicing extracts. Furthermore, plant SR proteins can influence splice site choice in mammalian nuclear extracts [
25], and can regulate alternative splicing
in planta [
26,
27].
The focus of this study is a new computational approach to identifying ESE motifs in the model plant
Arabidopsis thaliana, and their use in improving splice site prediction accuracy. First we apply a similar approach to RESCUE-ESE to identify putative ESE hexamers in the flanking ends of a large set of known internal exons from
Arabidopsis. Then we use a Gibbs sampling program called ELPH to identify statistically conserved motifs representing these hexamers in our data. In the end we show how these motifs can be used to improve splice site prediction. A significant improvement in specificity is obtained by incorporating the hexamer motifs into two leading splice site prediction programs, GeneSplicer [
28] and SpliceMachine [
29].