Reads from RNA-Seq give information about how exons are connected, which can be explored in the investigation of AS. RNA-Seq also provides more accurate measurement of expression levels of transcripts and their isoforms across a very broad dynamic range than other methods such as microarray (20
). Capitalizing on these two advantages of RNA-Seq, we identified ASEs from the mouse RNA-Seq data set (19
) and calculated the expression levels of isoforms of the genes containing the selected ASEs. This enabled us to determine reliable positive and negative data sets for SREs and then to employ a powerful discriminative approach to identify enhancers and silencers regulating alternative splicing. We chose the RNA-Seq data for three mouse tissues (19
) rather than more comprehensive RNA-Seq data for 15 human tissues and cell lines (1
) due to the following two reasons. First, unlike the human RNA-Seq data (1
), the mouse RNA-Seq data (19
) have not been explored to predict any SREs. Second, as demonstrated in (19
), the RNA-Seq reads generated from the protocol using RNA fragmentation provide more uniform coverage along the transcripts than those generated from the protocol using cDNA fragmentation (1
), and thus, the mouse RNA-Seq data can be used to calculate the expression level of each isoform of each gene more accurately.
As shown in (16–18
), a discriminative approach using reliable positive and negative data can significantly increase the power of detecting motifs that are over-represented in the positive data set relative to the negative data, without increasing the false positive rate. However, most computational methods for identifying SREs do not employ the discriminative approach. These include the ones used to identify RESCUE-ESEs from constitutively spliced exons (8
) and tissue-specific SREs from microarray data (12
) as we discussed in ‘Introduction’ section. Similar to the method used to identify RESCUE-ESEs, intronic sequences flanking constitutively spliced exons were used as background data to identify brain-specific SREs (14
). The putative ESEs and ESSs (PESEs/PESSs) were identified by comparing the frequencies of octamers in constitutively spliced non-protein-coding exons with those in a negative control set including the pseudo exons and 5′ untranslated regions of intronless gene (9
). Although this negative set may be more reliable than the one used in identifying RESCUE-ESEs, it may not be as reliable as the negative data in our method due to the following arguments. Pseudo exons are good negative sequences for identifying ESEs because they are never spliced. However, although the ASEs in our exclusion set are also not spliced in a tissue or under certain condition, they are spliced in other tissue(s) or under other conditions. This is a stronger indication that these ASEs in our exclusion set may lack the ESEs that assist the splicing of ASEs in the positive data. Similar arguments hold for other enhancers or silencers. In the identification of ESS from pseudo exons (10
), constitutively spliced exons and their flanking intronic regions were used as the negative data set, which is again not as reliable as the ASEs and their flanking intronic regions in our inclusion set because these ASEs can also be skipped under different conditions.
Another advantage of our discriminative approach is that it can identify both common and tissue-specific SREs. This is an important feature because both tissue-specific splicing factors and tissue-specific expression of constitutive splicing factors may play a role in regulating alternative splicing. If we use constitutively spliced exons as the negative data as used in (8
), we would not only lose detection power as shown in the ‘Results’ section, but also miss those common SREs present in constitutively splice exons. As a side note, similar to the method used to identify PESE/PESS (9
), our method do not have problem of sequence bias such as codon or CpG bias, since our positive and negative data sets have similar sequence composition. If a sequence is abundant in both inclusion and exclusion sets, our discriminative approach generally will not predict it as an SRE, but the non-discriminative approach will likely predict it to be both an enhancer and a silencer, which obviously is a conflictive and confusing result. On the other hand, if an SRE is abundant in both the data set from which we try to identify the SRE and the background data set, non-discriminative approach cannot identify such an SRE, but our discriminative approach using negative data set is very likely able to identify it.
To reduce the false positive rate without losing detection power, we used a validating process to determine the cutoff P
-value, which was chosen to be 0.03. Specifically, we first used a cutoff P
-value equal to 0.05. This gave 799 SREs, 200 of which could be found at least one match in the 992 hexamers selected from SpliceAid (22
) containing experimentally identified SREs. We plotted the distribution of the P
-values of these 200 SREs and of the remaining 599 SREs, as shown in . It is seen that at a P
0.03, the probability of experimentally validated SREs is generally higher than the probability of SREs without experimental validation, and that this trend is reversed at P
0.03. Therefore, we selected 0.03 to be the cutoff P
Probability density of P-values of the SREs with or without experimental validation. SREs are computationally identified at a significance level of 0.05.
About 26% (118/456) of 456 SREs we identified can be found in database with experimentally validated SREs. This percentage is slightly higher than that for the SREs identified by Wang et al.
) and Castle et al.
) from human tissues. About 48% (221/456) of our SREs are tissue-specific, which shows that tissue-specific SREs play an important role in regulating alternative splicing as observed early. Although only 10% (45/456) SREs are common to all three tissues in this study, it does not imply that common SREs are less important, because 45% (207/456) SREs were common to two tissues but unsure to the other tissue. If more data are available, we may identify these SREs as common or tissue-specific SREs. Only 18% (20/108) of our ESEs are included in RESCUE-ESE identified from constitutively spliced exons, and only 14% (15/108) of our ESEs are annotated as common to three tissues. This shows that much more tissue-specific ESEs are involved in regulating tissue-specific splicing than constitutive ESEs.
It worths some discussions on three SREs: CUCUCU (us' ISS−LM
), UGCAUG (ds' ISE–M
and ds' ISS−L−
) and UCUAUC (ds' ISS−−M
and ds' ISE
). The first two have been repeatedly identified as an SRE in both experimental and computational approaches (12
), but our study reveals some new information. Specifically, our position analysis showed that CUCUCU appears at 15–30
nt upstream of the ASE skipped in liver and muscle but not brain with much higher frequency than any other locations. Since these locations are in the polypyrimidine tract, CUCUCU most likely functions in the polypyrimidine tract as a tissue-specific silencer. While previous results showed that an SRE can be an enhancer or silencer depending on its location. For example, UGCAUG can be a ds' ISE or a us' ISS. Our analysis showed that UGCAUG and UCUAUC can function as an enhancer in one tissue but a silencer in another tissue from the same intronic region downstream of the ASE, which calls further investigation about the mechanism that these two SREs function.