|Home | About | Journals | Submit | Contact Us | Français|
The control of RNA alternative splicing is critical for generating biological diversity. Despite emerging genome-wide technologies to study RNA complexity, reliable and comprehensive RNA-regulatory networks have not been defined. Here we used Bayesian networks to probabilistically model diverse datasets and predict the target networks of specific regulators. We applied this strategy to identify ~700 alternative splicing events directly regulated by the neuron-specific factor Nova in the mouse brain, integrating RNA-binding data, splicing microarray data, Nova-binding motifs, and evolutionary signatures. The resulting integrative network revealed combinatorial regulation by Nova and the neuronal splicing factor Fox, interplay between phosphorylation and splicing, and potential links to neurologic disease. Thus we have developed a general approach to understanding mammalian RNA regulation at the systems level.
RNA-binding proteins (RBPs) regulate alternative splicing (AS) and processing of RNA to generate biological complexity (1). Inferring RNA target networks regulated by these splicing factors may provide general insights into the mechanisms of regulation and their role in disease (2–5). Several global approaches have recently been applied toward this aim (2), including bioinformatic predictions driven by analysis of RBP motifs (6–8), profiling of RNA isoforms based on splicing microarrays (9–11) or RNA-Seq (12–14), or biochemical footprints derived from high-throughput sequencing of RNA isolated by crosslinking immunoprecipitation (HITS-CLIP) (9, 15). These methods have been applied to identify and genetically validate ~90 alternative exons regulated by Nova1/2 (9, 10), a family of neuron-specific splicing factors. Nova regulates a biologically coherent set of transcripts encoding synaptic proteins (10), and an RNA-regulatory map predicts that Nova-regulated splicing is position-dependent, such that alternative exons are included when Nova binds to downstream introns and are excluded via binding within the exons or to upstream introns (9, 16).
Each of these methods is limited in their signal-to-noise and scope. RBP motifs generally bear very low sequence specificity (e.g. YCAY for Nova, ~1 site per 64 nt), microarray or RNA-Seq data are noisy at the exon level beyond a small set of top candidates and are correlative in nature, and biochemical protein-RNA interactions do not necessarily imply functional regulation. Consequently, only a small set of targets have been confidently identified for most splicing factors (4, 17). An alternative strategy is to integrate multiple sources of information, so that individually weak bits of evidence can be combined to generate confident predictions, as demonstrated in studies of protein-protein interactions (18) and transcription factor networks (19). Here we set out to develop such an integrative approach to probabilistically model a diverse set of genomic, experimental and evolutionary data, using Bayesian networks to define and understand the function of RNA networks.
We studied the Nova splicing-regulatory network as an exemplar and compiled four types of data important for inferring direct Nova-RNA interactions coupled with defined Nova-dependent AS events: (i) 279,631 CLIP tag clusters, ranked by peak height (PH), derived from 20 independent HITS-CLIP experiments (figs. S1 and S2, table S1, and datasets S1 and S2); (ii) 841,501 Nova-binding sites (YCAY clusters) bioinformatically predicted and scored from the clustering, accessibility and conservation of YCAY elements (fig. S3); (iii) four splicing-microarray datasets comparing wild-type (WT) and Nova knockout (KO) brains, which detected 1,331 exons showing significant Nova-dependent splicing, in addition to many exons with moderate but potentially functional changes (fig. S4 and table S2); (iv) evolutionary signatures of regulated splicing, including conservation of AS in human or rat, and preservation of reading frame (20). Each individual dataset suggested a large number of informative but noisy candidates, arguing for the importance of rigorous data integration.
To probabilistically weigh and combine these datasets, we designed a Bayesian network for each of seven types of AS events—cassette exons (an exon is included or skipped), tandem cassette exons, mutually exclusive exons (one of two exons is included), alternative 5´ and 3´ splice sites, and alternative polyA usage coupled with 5´ or 3´ splice site choices (table S5); each AS event represents an observation of the Bayesian network. Using cassette exons as an example, the network included 17 nodes (variables) connected by edges reflecting the causal relationships between variables (Fig. 1A and table S3). The strength of YCAY clusters determines Nova binding (a binary hidden variable) in upstream introns, exons, or downstream introns, which in turn determines the probability of observing binding footprints by HITS-CLIP. The combinatorial action of Nova binding in one or more regions dictates the splicing outcome (another hidden variable), as reflected in microarray measurements and evolutionary signatures. With this pre-determined network structure, the parameters of the model (conditional probability distributions) were learned from a subset of training cassette exons, including 50 previously validated targets (20).
Unlike “black box” predictions, the learned model parameters provide interpretable and novel insights into Nova splicing regulation (Fig. 1B–E and fig. S5). For example, the model confirmed and extended the previously defined RNA-regulatory map (9, 16), quantifying it and predicting the combinatorial action of Nova binding in multiple regions: Nova binding in exons or upstream introns alone results in exon exclusion with a probability of ~0.6, while the chance increases to >0.9 if Nova binds to both regions (Fig. 1D).
We prospectively applied the model to 13,357 annotated cassette exons (20). Each exon was assigned three probabilities that measure Nova-regulated exon inclusion, exclusion, and absence of direct regulation, respectively, from which a false discovery rate (FDR) was estimated. After ensuring that the model was not over-fit by 10-fold cross validation (fig. S6), we predicted 363 cassette exons as direct Nova targets, with a stringent FDR ≤ 0.01, and more broadly, 588 Nova-regulated AS events (table S5) when applied to all types of AS events (fig. S7).
We also searched novel exons with high sequence conservation, and exons whose AS pattern was missed in our database (fig. S8 and table S4) (20). This conservatively identified 76 additional exons as Nova targets. Hence the final Nova target network included 698 AS events from 358 genes, among which 610 events (87%) represent novel predictions (table S5).
To evaluate the quality of the network, we performed unbiased experimental validation. Intersecting the Bayesian network-predicted exons with a collection of well studied alterative exons (AEDB) (21) yielded a manageable set of 31 non-redundant exons, whose confidence scores are distributed very uniformly (median rank: 288 of 588; Fig. 1F and table S6). Among these, nine are previously validated Nova targets, and 19 of the remaining 22 exons were validated by comparing AS in WT and Nova KO brains (P<0.05, t-test, n=6) (Fig. 1F and fig. S9). In addition, we validated 8/9 novel exons tested (fig. S10), yielding an overall validation rate of ~90% (28/31 or 36/40). Combined with its high sensitivity in predicting 58/77 (75%) previously validated targets (39/50=78% for cassette exons), the accuracy of our network is remarkable and compares very favorably with previous studies, which obtained substantially lower validation rates or more limited sets of candidates (9, 10, 22, 23).
The Bayesian network analysis successfully integrated information from multiple types of data, predicting a substantial portion of targets missed by analysis of individual datasets or by other machine learning algorithms (20). For example, analysis of the 363 top target cassette exons predicted from microarrays, CLIP clusters or YCAY clusters alone achieved 49–54% sensitivity and an estimated validation rate of 54–61%, compared to 75–78% sensitivity and ~90% validation of the Bayesian network (Fig. 1G and fig. S11). Integration of microarray data, CLIP clusters, and YCAY clusters by naïve Bayes or logistic regression made clear but relatively moderate improvement, with 61% sensitivity and an estimated validation rate of 65–67% (Fig. 1G, and fig. S11C and E). These observations underscore the effectiveness of our integrative strategy.
The comprehensive list of Nova targets makes it possible for the first time to correlate the positions of Nova-binding sites with sequence conservation profiles (Fig. 2 A and B, and dataset S3). While Nova-binding sites are conserved in general (2), unexpectedly, we identified additional conserved regions in regulated exons outside Nova-binding sites (Fig. 2A and B), suggesting the presence of additional regulatory elements. To search for specific RBPs that might dictate coordinated combinatorial regulation with Nova, we examined putative splicing-regulatory elements derived from brain-specifc AS exons (12). The well characterized Fox-binding element (UGCAUG) (24) was enriched in both ends of downstream introns (1.7 fold, P=0.002 for 5´ end; 2.1 fold, P=4.7×10−6 for 3´ end; 03C72 test) bordering cassette exons showing Nova-dependent inclusion, and upstream introns near 3´ splice sites (2.3 fold, P=1.8×10−5; χ2 test) of cassette exons showing Nova-dependent exclusion (Fig. 2C and fig. S12). Furthermore, 106 of the 698 Nova target AS events were candidate Fox targets in the brain, with highly conserved UGCAUG elements (7), indicating that roughly 15% of Nova targets may be under Nova and Fox combinatorial control (5.5-fold enrichment, P<10−46, Fisher’s exact test). As Fox-regulated splicing is defined by a similar position-dependent RNA-regulatory map as Nova (7, 25), these observations suggest that additive or synergistic actions of Nova and Fox may be favored over antagonistic actions.
To experimentally address the bioinformatic prediction of Nova and Fox combinatorial regulation, we examined alternative splicing of several candidate exons (20). One of these, Gabrg2 exon 9, is regulated by Nova through a strong YCAY cluster (score=20) ~80 nt upstream of the 3´ splice site of intron 9 (Fig. 3A), as determined by mutagenesis analysis (26). Interestingly, an independent mutation ~30 nt downstream of exon 9 disrupted the basal level of exon 9 inclusion independent of Nova expression (26). Further examination revealed that this mutation fortuitously disrupted a very conserved Fox-binding element (Fig. 3A). To test if Nova and Fox exhibit combinatorial regulation on this exon, we transfected increasing amounts of Nova1 and Fox2, alone or in combination, into human embryonic kidney 293T cells, together with a minigene consisting of sequences between exon 8 and exon 10 (Fig. 3B). Either protein alone induced a dosage-dependent inclusion of exon 9, confirming that this exon is regulated by Nova and Fox individually. Simultaneous expression of lower amounts of both proteins dramatically increased the inclusion level from <5% to 26%, indicating a synergistic effect of Nova and Fox in splicing regulation. This synergistic regulation is direct, because mutations of their binding sites reduced exon 9 inclusion to basal levels even in the presence of both proteins. These observations suggest a model in which the binding of Fox and Nova in cis is able to synergize, perhaps by inducing a looping of the intron and thus the tethering of exons 9 and 10.
Altogether, we validated seven exons showing splicing regulation by both proteins, through synergistic (Gabrg2 and Mtap7d2), additive (Numb, Syne2 and Pbrm1), or antagonistic (Arhgef12 and Alcam) actions (Fig. 3B and C, and fig. S13). In all seven cases, the splicing outcomes can be predicted from a combinatorial RNA-regulatory map derived by superposing the map for each individual protein (fig. S13), offering a means of understanding the spatial and temporal control of RNA complexity.
Nova regulates AS of transcripts encoding synaptic proteins that themselves interact with each other (10). The comprehensive network confirmed and extended this observation, using GO term and KEGG pathway analysis (tables S7 and S8). Nonetheless it has been unclear exactly how Nova-regulated AS might impact such interactions. Analysis of protein annotations revealed that about half of Nova target transcripts encoded phosphoproteins, a 1.7-fold enrichment compared to brain-expressing AS genes (P<10−13, Fisher’s exact test) (Fig. 4A) (20). Furthermore, Nova-regulated exons within these transcripts themselves encoded experimentally determined phosphorylation sites much more frequently when compared with constitutive or overall alternative exons (>2.4 fold, P<10−12; Fisher’s exact test; Fig. 4B), or more strictly with non-target brain-specific AS exons, (1.7 fold, P<0.0004; Fisher’s exact test; Fig. 4B) (12, 20). Similar observations were obtained with a more stringent subset of phosphorylation sites experimentally determined in the brain and thus most relevant for Nova function (20) (fig. S14). Moreover, Nova target genes included 25 kinases and 9 phosphatases, a 2.5-fold enrichment compared to all brain-expressing genes (P=10−5, Fisher’s exact test). Thus, Nova directly affects the in vivo phosphorylation patterns of brain proteins via AS regulation, a two-layered control to modulate downstream protein-protein interactions and physiological functions (Fig. 4C and table S6).
Finally, the comprehensiveness of the network suggests new relationships to physiology and disease. A subset of newly-predicted Nova-regulated exons are known to be functionally significant, and in some cases are essential for viability (e.g. the switch of Snap25 exon 5a/5b (27); table S6). 88 of the 358 Nova target genes are currently implicated in genetic diseases (1.5-fold enrichment compared to brain-expressing genes, P<5×10−4, Fisher’s exact test; table S9) (20), including neurologic disorders such as mental retardation, epilepsy and autism. Fox1 (A2BP1) itself is an autism susceptibility gene (28). Moreover, 8.5% genes predicted to be regulated by both Nova and Fox (on the same or different exons) are implicated in autism, compared to 3.3–3.4% for genes targeted by Nova or Fox alone (P<0.02, χ2 test), and 1.2% in all brain-expressing genes (P=10−7, Fisher’s exact test) (20). Thus coordinated RNA regulation may be susceptible to disruptions in complex multigenic neurologic diseases. While placing discrete exons and genes in the Nova (and Fox) target networks already points ways toward greater understanding of RNA regulation and disease mechanisms, the functions encoded by most AS exons remain to be characterized.
Recent advances in machine learning using sequence motifs and other RNA features are beginning to derive general rules relevant to tissue-specific splicing regulation (8). These efforts are complemented and extended by the network analysis presented here, which sums multiple types of data to generate highly accurate and global predictions of specific RBP-target regulatory interactions. This strategy should improve splicing code fidelity and provide a guide to prioritize further functional studies. Taken together, the integrative network analysis has the potential to fill gaps between the delineation of alternative RNA processing, its underlying regulatory mechanisms, and its biological significance.
We thank A. R. Krainer for Fox 1/2 expression constructs, S. Dewell for Illumina sequencing, and all Darnell lab members for helpful discussions. This work was supported by grants from the NIH to RBD (NS34389) and the Rockefeller University Hospital CTSA (UL1 RR024143). RBD is an HHMI Investigator. The HITS-CLIP and microarray data have been deposited to the NCBI SRA (SRA019982) and GEO (GSE22115) databases, respectively.