Microarrays for whole-transcript profiling
We designed microarrays monitoring 203,672 exons and 178,351 exon-exon junctions in 17,939 human genes. These microarrays report expression of both splicing event isoforms from 8,000 cassette exons, 3,950 alternative 5' splice sites, 3,672 alternative 3' splice sites, 3,770 multiple cassette exons, 3,123 mutually exclusive exons, and 1,890 inserted introns. The 'whole transcript' design used here is different from exon arrays
19, junction arrays
1, and cassette exon splicing arrays (e.g.
16) in that it includes a constellation of probes targeting every exon and every junction, similar to the approach of Griffith et al.
20 (). Although exon arrays provide an unbiased survey of transcript structure, they do not monitor connections between individual exons or many alternative 5' and 3' splice sites. Because they lack junction probes, they also generally monitor only one form of an alternative splicing event. The inability to monitor both of the two mutually exclusive forms prevents accurate measurement of expression ratios of the two forms. Cassette exon splicing arrays, with probes designed to monitor the inclusion and exclusion of known cassette exons are an economical option for profiling cassette exon splicing events but lack probes to profile other types of alternative splicing (e.g. alternative 5' and 3' splice sites), do not monitor alternative 5' and 3' exons, and do not permit discovery of novel alternative splice forms.
Compendium of human alternative splicing events
A set of 48 human tissues and cell lines with dissimilar expression patterns were hybridized to the arrays, and 11,700 of 18,000 genes monitored by the array showed > 3-fold change in gene expression level relative to the pool (p < 0.01) in at least one tissue (
Figure S1, Methods). Across the 48 tissue panel, 9,516 alternative splicing events were also significantly differentially expressed in at least one tissue (
Figure S2; Methods), similar to the number of differentially expressed genes. Microarray data are available as GEO dataset GSE11863, and the entire compendium of alternative splicing expression, gene expression, probe and splicing event nucleotide sequences, genome browser tracks, and individual splicing event figures are available at
http://rulai.cshl.edu/Rosetta_AS_supp/, with additional data at
http://rulai.cshl.edu/cgi-bin/dbCASE/dbcase.cgi?process=home.
As an example of the data available for each gene, results for A2BP1 (Fox-1) are shown in . A2BP1 regulates alternative splicing and is itself alternatively spliced, containing alternative 5’ exons, a pair of mutually exclusive exons, a cassette exon, and an alternative 3’ splice site
6,21 (). A2BP1 is upregulated in muscle and brain (). The four alternative 5’ exons of transcript NM_018723 were detected in brain while the 5’ exon found in transcript NM_145893 was detected in muscle. Within the coding region, brain cells preferentially include the mutually exclusive exon found in NM_018723 whereas skeletal muscle cells include the NM_145893 form. Across all tissues, A2BP1 gene expression is highest in heart, skeletal muscle, and the nervous-system (). The first (5’) mutually exclusive exon is expressed in nervous-system tissues while the second (3’) mutually exclusive exon is found in heart and skeletal muscle. Expression of the gene, the first mutually exclusive exon, and the second mutually exclusive exon can be combined to estimate the relative abundance of each form across the tissue panel (, right; Methods). Of the two forms represented by the mutually exclusive exons, brain has a higher proportion of form NM_018723.
9,516 splice events are differentially expressed
Across the 48 tissue compendium, 9,516 alternative splicing events were significantly differentially expressed in at least one tissue (
Figure S2; Methods), similar to the total number of differentially expressed genes (11,700). Of the monitored cassette exons, 42% were differentially expressed in at least one tissue (), while a much lower fraction (26%) of monitored alternative 3' splice sites were differentially expressed and a higher proportion (55%) of monitored inserted introns were differentially expressed. The samples with the highest number of differentially expressed alternative splicing events () relative to the pool include the cell lines HeLa and HCT116, peripheral leukocytes, fetal brain, testis, mammary gland, and skeletal muscle, while the samples with the fewest include adipose, adrenal gland, and fetal lung. These rankings parallel those for gene expression; i.e. the same samples types were among those with the most and fewest differentially expressed genes.
Cassette exon inclusion/exclusion rates varied among the samples (). The ratio between the number of cassette exons differentially included to the number excluded was highest for SW480, breast and lung tumor, retina, and brain samples excluding medulla oblongata. To test whether this result might be due to the composition of the reference pool (normal adult tissues), we re-ratioed to each individual sample and to the average of all 48 samples, generating 49
in silico reference pools (Methods), and computed inclusion rates, exclusion rates, and ratios. While the number of cassette exons included or excluded depended on the reference sample, the inclusion-to-exclusion rankings of samples were similar regardless of the pool, with retina, and certain brain, tumor, and cell lines consistently ranking highest. These results agree with previous findings that brain tissues express the greatest number of tissue-specific exons
19.
To provide an estimate of the accuracy of the quantitative predictions made by the splicing arrays and analysis tools, as part of a larger study
22, we tested a sample of 23 predictions by semi-quantitative RT-PCR and found that 74% of the splice events called differentially expressed by the microarray at p-value < 0.1 show changes by PCR of at least 15% in differential expression, and correlations between the microarray and RT-PCR splice event proportionalities of r=0.88 (Methods,
Table S1).
Transcription and alternative splicing regulation act on different genes
Similar to results from Pan et al.
16, we did not detect significant enrichment of the differentially expressed genes in a given tissue and the genes with differentially expressed splicing events (p = 0.4). As one specific example, CLK1 (NM_004071) and CLK2 (NM_003993) contain cassette exons whose exclusion generates a protein that dimerizes but inhibits kinase activity
23. Here, CLK1 and CLK2 showed uncorrelated gene expression (r=0.13) across the 48 samples but significantly correlated splicing expression (r=0.69) ().
Gene expression correlations are often used to define 'edges' of co-expression networks and infer functional associations. Here, gene-gene edges determined from correlated gene expression (r > 0.75) and those determined from correlated splice event expression show only 2% overlap (
Supplementary Note). Although this overlap is statistically significant, the small overlap demonstrates these are largely different regulatory networks. Finally, we observed that splicing-event expression and gene expression clustered samples similarly but with a few exceptions, such as lung and breast tumor samples. These cluster with parental tissues (lung and breast) using gene expression but with cell lines using splicing expression, demonstrating that gene expression in these tumor samples is more similar to their normal parental tissues while splicing expression is more similar to cell lines (
Supplementary Note).
Expression of specific splicing events
Similar to A2BP1 and CLK1/2, all splicing event tissue profiles are available at
http://rulai.cshl.edu/Rosetta_AS_supp/. Ten alternative splicing profiles are highlighted in the
Supplementary Note, including several associated with nervous system tissues (GSK3B, MAPT/Tau, APP, and CACNA1B), muscle tissues (MEF2C, CAMK2D, TPM1, and TPM2), splicing (NOVA1), and cancer (FGFR2).
The human gene CD44 encodes ten variable exons, one of which, CD44v6, is the target of bivatuzumab mertansine, an antibody for patients with advanced carcinoma
24 that was discontinued due to skin toxicity in Phase I trials
25. Although in our data CD44v6 is upregulated in tumors and cancer cell lines, it is expressed highest in skin, highlighting the potential value of this compendium as a public resource. Most microarray experiments use a probe or probe-set near the 3' end of each gene and consequently would not detect the isoform-specific variation of CD44.
De novo identification of 143 splicing regulatory elements
We next sought to discover regulatory elements in sequences in and adjacent to the tissue-regulated cassette exons. We extracted nucleotide sequence in eight regions ("neighborhoods") around regulated exons, and searched for over and under represented nucleotide "words" of size 4–7 nt, using neighborhood-specific sequences adjacent to all monitored cassette exons as a background set (, Methods). Examined neighborhoods were 200 nt for intronic regions and 39 nt exonic regions, and the hypergeometric distribution was used to calculate word enrichment p-values. In total, 33.5 million enrichment p-values were calculated.
Using a single tissue (skeletal muscle) to highlight the results, we observed that eight of 1,024 pentamers have a Bonferroni-corrected p-value < 0.01 for enrichment in the 200 nt intronic region upstream of cassette exons that are upregulated in skeletal muscle (). UCUCU is the most enriched, followed by other pyrimidine motifs. In the 200 nt intronic region following cassette exons, three of the 4,096 hexamers are significant. Strikingly, UGCAUG is most enriched, followed by GCAUGU and UGUGUG. No words were significantly enriched or depleted in the intronic region preceding the downstream 3' splice site.
The compendium can also be used to find enrichment of specific motifs across all 48 tissues and cell lines. For example, the motif UCUCU, which has been associated with PTB
3–5, is enriched in the intronic region preceding upregulated cassette exons occurs in brain (p-value < 10
−26 in cerebellum), spinal cord, retina, heart, and skeletal muscle (). In this data set, PTBP1 gene expression shows a dramatic anti-correlation with UCUCU enrichment, corroborating the inhibitory role of PTBP1. The UGCAUG motif, associated with Fox proteins A2BP1 and RBM9
6, is enriched in the intronic region following upregulated cassette exons in skeletal muscle (p-value < 10
−16) and heart, with limited enrichment in other tissues, including brain, adipose, and colon. A2BP1 expression correlates highly with UGCAUG enrichment, corroborating the splicing enhancer role of A2BP1. Finally, while we showed above that muscular and neuronal tissues express different A2BP1 isoforms, we observed UGCAUG enrichment in both tissues, although highest in muscle and heart.
The systematic analysis of all words, in all samples, in all neighborhoods identified 143 significant motifs (p < 1e-3, Bonferroni-corrected; Methods). Two prominent motifs exist upstream of cassette exons (). A UC-rich cluster, exemplified by UCUCU, is most enriched in brain and muscle tissues and the AG-rich cluster, including GAGG, AGAGG, and AGGG, is depleted in cassette exons upregulated in brain. Other clusters are enriched in brain (UGCU) and in several tissues (UGCAUG). Smaller clusters include motifs AGAA, CGCCU, and UGAA.
Motifs in the intronic neighborhood following upregulated cassette exons cluster into five groups (). UGUGUG is enriched in muscle and brain subsections, UGCAUG is enriched primarily in heart and skeletal muscle, and UUUU is enriched in brain tissues. AG-rich motifs are depleted downstream of cassette exons upregulated in several tissues, including brain, peripheral leukocytes, bone marrow, and striated muscle. UACUA is enriched in hypothalamus.
Enrichment occurs in only two other neighborhood and regulation combinations. Immediately upstream of downregulated cassette exons, UCUCU enrichment occurs in all samples except brain, spinal cord, muscle, heart, and leukocytes (online material). Coupled with PTBP1 gene expression, these data corroborate the role of PTB in silencing downstream exons. In the intron downstream of the exon preceding upregulated cassette exons (the 5' splice site), AU-rich motif enrichment occurs in brain and G-rich motif enrichment occurs in HeLa and HCT116, while G-rich motif depletion occurs in brain. G-rich motifs are highly enriched at the 5' splice site preceding cassette exons relative to constitutive exons. G-rich motifs may act to silence cassette exons in brain when positioned in the 5' splice site following a cassette exon and may be a generic 5' splice-site defining factor
8,9. This suggests a similar role for G-rich motifs near the 5' splice-site upstream of cassette exons.
This set of predicted alternative splicing motifs is in excellent agreement with existing studies, including tissue-specific roles for UGUGU, UCUUUC, UGCAUG, UGUGUC, UGCAUG and UUUUU
18,26–30. In other cases, motifs identified in the literature, such as ACUAAC
30,31 lie just below our threshold of statistical significance. Fagnani et al.
17 identified pyrimidine-rich intronic motifs adjacent brain-enriched cassette exons, but with enrichment spread over several genomic regions and with other motifs, such as UGCAUG, scoring lower.
High resolution map of RNA alternative splicing regulation
We expected motif enrichment to occur non-uniformly across the neighborhoods, especially within the 200 nt intronic regions. To more precisely predict where individual motifs exert a regulatory impact, we examined each motif's frequency at each nucleotide position in each of the eight neighborhoods, using a Gaussian wavelet for smoothing (Methods).
UGCUAG enrichment occurs 10–80 nt downstream of cassette exons upregulated in muscle
The motif UGCAUG has been associated with Fox proteins (below,
6). We found UGCAUG enrichment highest in the intronic neighborhood following cassette exons upregulated in striated muscle (). When examined at higher resolution, the regulatory influence of UGCAUG in muscle varies across and within the eight neighborhoods (). A broad, highly significant enrichment of UGCAUG occurs from 10 to 80 nt downstream of cassette exons upregulated in heart and skeletal muscle. A second enrichment peak occurs from 65 to 15 nt preceding cassette exons downregulated in heart, while less enrichment occurs in skeletal muscle in this region (online material). Using RT-PCR, RNAi and cDNA over-expression, we validated the position-specific influence of UCGAUG on cassette exon inclusion, identified Fox alternative splicing targets, and found that the targets are enriched for genes involved in neuromuscular function
32.
UCUCU enrichment occurs from 110 to 35 nt preceding regulated cassette exons
UCUCU has been associated with PTB proteins (below,
3–5). The motif UCUCU frequently occurs directly upstream of 3' splice sites as part of the polypyrimidine tract. Indeed, when we examined constitutive and cassette exons, both showed high UCUCU occurrence at the 3' splice site relative to the average intronic rate. However, enrichment upstream of constitutive exons extends only 35 nt from the 3' splice site into the upstream intron, whereas tissue-varying cassette exons show extended enrichment (). Exons upregulated in cerebellum show marked UCUCU enrichment from −95 to −15 nt (). Fetal kidney shows UCUCU enrichment in a similar location, −75 to −20 nt, but intriguingly, this enrichment surrounds downregulated cassette exons. Almost every tissue shows UCUCU enrichment in the similar location from −110 to −35 nt preceding cassette exons (). To our knowledge this is the first high-resolution map of UCUCU location and splice regulatory influence. Brain tissues, spinal cord, retina, and striated muscle show enrichment preceding upregulated cassette exons while other samples show enrichment preceding downregulated cassette exons. Thus, we find that UCUCU enrichment occurs in the polypyrimidine tract immediately preceding the 3' splice site upstream of all exons but enrichment from −110 to −35 nt occurs only upstream of tissue-regulated cassette exons.
Exonic splicing elements act to define exons (e.g.,
33–35). While we examined both introns and exons for motifs associated with tissue-varying cassette exons, only intronic motifs passed the p-value cutoff. While this might suggest intronic motifs play a greater role in tissue-specific cassette exon regulation, we searched only 39 nt of each exonic region compared to 200 nt for each intronic region, and the smaller window will lessen the significance of exonic motifs. Indeed, we see evidence for exonic UCUCU enrichment at the 3' edge of cassette exons expressed at higher levels in brain and heart but at less significant p-values (). We also did not build cross-species conservation into our statistical model for motif detection. To assess whether more complex motif-detection methods would significantly alter our results, we tested several other published methods, both pre-filtering sequences using cross-species conservation (e.g.,
36), and filtering motifs based on frequency in neighborhoods adjacent to alternate and constitutive exons (e.g.,
37). The results were largely similar in terms of the motifs identified, associated samples, and associated locations (
Supplementary Note). For example, the p-value for UGCUAG enrichment downstream of skeletal muscle enriched exons changes from 1e-16 to 1e-12 when using conservation and is still the most significant word. The next enriched word in both cases is the related word GCAUGU at 1e-09 and 1e-07, respectively, while exonic motifs remain below significance.
UGCU, UGUGU, and AG-rich motifs show different localization
Other previously established relationships between RNA binding proteins and motifs include muscleblind proteins binding UGCU
10, and CELF
12 and SUP-12 (RBM38)
12,13 proteins interacting with UGUGU. Here UGCU enrichment occurs upstream of cassette exons upregulated in brain, from −100 to −5 nt upstream of the 3' splice site (online material). Less significant enrichment is found 5 to 50 nt downstream of brain-upregulated cassette exons. Skeletal muscle and heart show enrichment 30 to 110 nt following upregulated cassette exons but not upstream. UGUGU shows consistent enrichment 10 to 100 nt following brain and spinal cord upregulated cassettes. hnRNP A/B and hnRNP F/H bind purine-rich motifs GGGG and AGGGG, respectively
8,9. In our data, AGGG and similar purine motifs are depleted preceding brain upregulated cassette exons (). When examined at higher resolution, depletion of AG-rich motifs occurs over a wide region. In cerebellum, the depletion ranges from 150 nt preceding the cassette exon to over 100-nt beyond the exon.
FOX and PTB act on non-overlapping sets of cassette exons
As both the UCUCU and UGCAUG words appear to regulate alternative splicing in some of the same samples, but only when positioned in precise regions around exons, we explored whether they occur around the same cassette exons. In cassette exons upregulated in skeletal muscle, we counted occurrences of UCUCU in the region −110 to −35 nt upstream of the cassette exon and UGCAUG in the region from 10 to 80 nt downstream of the cassette exon (
Figure S3). At least one of the motifs occurs adjacent to 33% of cassette exons upregulated in skeletal muscle. However, we were surprised that they co-occur adjacent to only 2% of the exons, not significantly below the 3% expected by chance (p= 0.2), suggesting these two motifs, both enriched in muscle, represent independent regulatory mechanisms. In cerebellum, as in skeletal muscle, UGCAUG and UCUCU are both associated with upregulated cassette exons. 10% and 23% of the cassette exons upregulated in cerebellum contain the motif in the identified region, respectively, and 31% of the exons contain at least one. The motifs co-occur adjacent only 2% of the exons, the percentage expected by chance, suggesting independent action as in skeletal muscle. In fetal kidney, on the other hand, UGCUAG is not enriched adjacent to up- or down-regulated cassette exons, likely reflecting the lower levels of Fox-1 expression, while UCUCU enrichment occurs upstream of down-regulated cassette exons, likely reflecting higher levels of PTB expression.
De novo predicted relationships between RNA-binding proteins and binding elements
The large number of samples and splicing events in the compendium can be leveraged to predict associations between
trans-acting splicing regulatory proteins and binding elements. For 135 of the motifs identified ( & ), we calculated a signed, normalized rank-order metric representing the similarity between the motif enrichment/depletion (p-value) tissue-profiles and the gene expression profiles of RNA-binding proteins (
Figures S4–5, Methods). The results of this calculation represent
de novo predictions of potential protein-motif partners and capture many known relationships. PTBP1 has the highest absolute score, with a negative value, for UCUCU upstream of cassette exons and is also negatively associated with pyrimidine-rich motifs downstream of upregulated cassette exons, as previously established
4,38. Upstream of upregulated exons, UCUCU is enriched when PTBP1 expression is low but not when PTBP1 is high (online material). Upstream of downregulated exons, UCUCU is more often enriched when PTBP1 is expressed. Thus, cassette exons preceded by UCUCU are downregulated when PTBP1 is expressed and upregulated when PTBP1 is absent. Upstream of cassette exons, HNRPF has the highest positive score for the purine motif AGGG, and HNRPF and HNRPH1 are known to interact with AG-rich and G-motifs
8,9. Downstream of cassette exons, the CELF protein CUGBP2 ranks highest for exon inclusion using UGUGUG, and Fox proteins A2BP1 and RBM9 rank first and second, respectively, for UGCAUG. SFRS6, SFRS7, and HNRPF score positively for AGGG. The fact that expression of Fox, CELF, and HNRPF are positively associated with their motifs surrounding up-regulated cassette exons implies that the exons are preferentially included when the protein is expressed, and suggests a mechanism where the protein acts to enhance inclusion of exons that are otherwise skipped.
The RNA-binding protein NOVA1 targets YCATY and YCAY motifs, with a position-dependent influence
39,40. We examined our data for this degenerate motif, summing the values for all four words of YCATY (CCATC, CCATT, TCATC, TCATT) and calculating the hypergeometric probability against the count of the similar set of words in the background. Exons downregulated in brain are enriched in YCATY upstream whereas exons upregulated in brain are depleted of YCATY upstream and enriched downstream (
Figure S6). We observe NOVA1 levels almost 10-fold higher than the pool in brain and spinal cord, corroborating a regulatory mechanism in which NOVA1 expression causes exon skipping when bound to upstream YCATY and causes exon inclusion when bound to YCATY downstream of the exon. These results are fully consistent with previous reports on the position-specific role of NOVA1
40.