|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: RM GK. Performed the experiments: RM. Analyzed the data: RM. Wrote the paper: RM GK.
Eukaryotic genes are typically split into exons that need to be spliced together to form the mature mRNA. The splicing process depends on the dynamics and interactions among transcription by the RNA polymerase II complex (RNAPII) and the spliceosomal complex consisting of multiple small nuclear ribonucleo proteins (snRNPs). Here we propose a biophysically plausible initial theory of splicing that aims to explain the effects of the stochastic dynamics of snRNPs on the splicing patterns of eukaryotic genes. We consider two different ways to model the dynamics of snRNPs: pure three-dimensional diffusion and a combination of three- and one-dimensional diffusion along the emerging pre-mRNA. Our theoretical analysis shows that there exists an optimum position of the splice sites on the growing pre-mRNA at which the time required for snRNPs to find the 5′ donor site is minimized. The minimization of the overall search time is achieved mainly via the increase in non-specific interactions between the snRNPs and the growing pre-mRNA. The theory further predicts that there exists an optimum transcript length that maximizes the probabilities for exons to interact with the snRNPs. We evaluate these theoretical predictions by considering human and mouse exon microarray data as well as RNAseq data from multiple different tissues. We observe that there is a broad optimum position of splice sites on the growing pre-mRNA and an optimum transcript length, which are roughly consistent with the theoretical predictions. The theoretical and experimental analyses suggest that there is a strong interaction between the dynamics of RNAPII and the stochastic nature of snRNP search for 5′ donor splicing sites.
The DNA encoding most eukaryotic genes is interrupted by long sequences called introns. These introns need to be removed through the process of splicing to produce the mature messenger RNA. The process of splicing plays a critical role in determining the exact aminoacid content of the ensuing protein. Several molecules denominated small nuclear ribonucleo proteins (snRNPs) are involved in finding the appropriate 5′ donor splicing sites for splicing. Transcription and splicing occur simultaneously and the ultimate product depends on the relative speed of transcription and the stochastic dynamics underlying splicing. Here we propose a biophysically plausible theory that describes the ongoing interactions between transcription and splicing. We show that the theoretical predictions are consistent with experimental measurements of the abundance patterns of different exons and transcripts across tissues.
Transcription of eukaryotic genes by the RNA polymerase II complex (RNAPII) produces a primary mRNA transcript (pre-mRNA) that contains both exons and introns. Introns are removed by splicing , ,  via the assembly of a spliceosomal complex including small nuclear ribonucleo proteins (snRNPs) , , , . Recent studies show that the majority of genes in higher eukaryotes are alternatively spliced and, therefore, contribute significantly to the structural as well as functional complexity and diversity of organisms , , . The process of splicing can start as soon as the pre-mRNA begins to emerge from RNAPII. Cis-acting regulatory elements such as splicing enhancers and silencers generally determine the splicing pattern of a given multi-exonic gene especially when transcription is not kinetically coupled to the splicing , , , . However, when transcription is coupled to splicing, inclusion or exclusion of an exon in the final transcript will also be strongly influenced by the transcription elongation rate as well as the local concentrations of various factors involved in the spliceosomal assembly and their interactions , , , .
Two basic models have been proposed to explain the various differences in the alternative splicing patterns of a given gene. According to the kinetic model , inclusion or exclusion of an exon in the final transcript is determined by the transcriptional elongation rate associated with the corresponding pre-mRNA in addition to the cis-acting regulatory elements. Exons are classified as ‘strong’ or ‘weak’ depending on whether they possess cis-acting regulatory elements associated with them or not. The inclusion of ‘strong’ exons is favored at higher transcriptional elongation rates whereas ‘weak’ exons may be included in the final transcript only when the transcriptional elongation rate is comparatively slower. Since the concentration of snRNPs in the vicinity of the transcriptional machinery is fixed under steady state conditions, a strong exon that has emerged recently from the transcriptional assembly will have a better chance of interacting with the snRNPs as compared to a weak exon that emerged earlier. Therefore, a weak exon will have a better chance to interact with the snRNPs only when there is a decrease in the rate or a pause in the transcriptional elongation process. According to the recruitment model , inclusion or exclusion of an exon is also decided by the interaction of the C-terminal domain (CTD) of RNAPII with a set of gene and exon specific DNA binding proteins and the snRNPs ,  in addition to cis-acting regulatory elements. The CTD of the RNAPII interacts directly with the snRNPs and other factors, increasing the local concentrations of these factors in the vicinity of the emergence of a weak exon and thus enhancing the probability of weak exons to interact with the snRNPs.
There are four basic variables involved in the definition of an exon: (1) cis-acting regulatory elements , ,  (2) transcription elongation rate  (3) interactions between the CTD of RNAPII and the snRNPs, hnRNPs and SR proteins ,  (often referred to as ‘recruitment’) and (4) the stochastic dynamics involved in the recognition of the 5′ donor splice sites by U1 snRNPs while the pre-mRNA is evolving from the transcription assembly. Variables 1 and 3 are specific to each exon whereas variables 2 and 4 are generic and affect all the exons across various transcripts of an organism.
Most of the current splice pattern prediction algorithms consider mainly the cis-acting regulatory elements (variable 1) , , , the kinetic model focuses on variable 2  and the recruitment model considers mainly variable 3 , . None of the current algorithms or models considers the stochastic dynamics associated with the snRNP search process (variable 4). Here we propose a biophysically plausible theory from first principles to describe the coupled dynamics of transcription and splicing. This work presents initial steps towards capturing the basic relationship between transcriptional elongation and splicing; the simplified model that we propose does not include multiple critical components that affect the splicing outcome including cis-acting pre-mRNA sequence motifs, trans-acting interactions with different proteins and variable rates of RNAPolII transcription. We focus on the stochastic dynamics whereby snRNPs locate the 5′ donor sites and how this search influences the outcome of splicing. We evaluate the theoretical predictions by analyzing expression data at the exon level from exon microarrays and RNAseq experiments across different tissues in mice and humans.
Recent single cell studies have revealed , ,  that small nuclear ribonucleoproteins (snRNPs) and other splicing proteins are freely diffusing inside the entire volume of various nuclear and splicing factor compartments of within the eukaryotic cell nucleus. Splicing is kinetically coupled to transcription when the time required to generate a complete transcript is longer than the time required for the assembly and catalytic activity of the spliceosomal proteins. Under such coupled conditions, we must simultaneously consider at least two different types of dynamical processes: (i) transcription elongation by the RNA polymerase II transcription complex (RNAPII) and (ii) the search process whereby snRNPs locate the 5′ donor splicing sites (DSS) on the emerging pre-mRNA to initiate the spliceosomal assembly ( Figure 1 ). The freely diffusing U1 snRNP can locate the donor splicing sites via two different types of mechanisms: a pure three-dimensional diffusion-controlled collision route (3D) and a combination of three-dimensional and one-dimensional diffusion dynamics as in the case of typical site-specific DNA-protein interactions (3D+1D) , , , . Upon successful binding of the U1snRNP molecule to the 5′ donor site, a cascade of molecular processes involving multiple snRNPs ensues, culminating in the formation of the spliceosomal complex and intron removal , , . Except for the binding of U1 snRNPs at the 5′ donor site, all the other steps involve the hydrolysis of ATPs. This means that the binding of U1 is a purely thermally driven process and here we focus on the dynamics involved in this rate-limiting step. All the other binding events and reactions, including transcription elongation, involve ATP hydrolysis and we therefore assume that the effects of thermal induced fluctuations are minimal in these reaction steps. We ignore the thermal induced fluctuations over these reaction steps while describing the search dynamics of snRNPs along the pre-mRNA. The overall probabilities associated with the interaction of snRNPs with various DSSs depend on the type of search mechanism followed by the snRNPs.
We start by considering the model illustrated in Figure 1 where the U1 snRNP has bound the emerging pre-mRNA via non-specific interactions facilitated by 3D diffusion and it scans the concomitantly emerging pre-mRNA for the presence of DSSs via 1D diffusion. At a given time t, let y(t) denote the length of the emerging pre-mRNA and let x(t) denote the position of the non-specific bound U1 snRNP on the pre-mRNA chain. The DSS under consideration is located at position x=n (DSSn), which has not been transcribed at time t (or is currently not reachable by the snRNP due to steric hindrance). Such coupled dynamics of snRNPs and RNAPII, represented by the set of dynamic position variables x and y () on the same pre-mRNA, can be described by the following set of Langevin type stochastic differential equations :
The transcription elongation rate is denoted as kE (bases s−1). xd (bases2s−1) is the 1D diffusion coefficient associated with the searching dynamics of U1 snRNPs towards the DSSn and is the delta-correlated Gaussian white noise with and . The movement of RNAPII along y is energetically driven via the hydrolysis of ATPs. As a result, the fluctuations in y are negligible and we use a deterministic description for RNAPII in Eq. 1 .
Let denote the joint probability of finding the snRNPs at position x and RNAPII at position y at time t given initial conditions x0, y0. The Fokker-Planck equation associated with the temporal evolution of can be written as follows :
Here the initial condition is , ensuring that at time t0, the probability of finding x0=0, y0=0 is normalized to one. The boundary conditions are as follows:
Here x=0 as well as x=y (y<n) act as reflecting boundary conditions for the dynamics of snRNP. Whenever the snRNP tries to visit x≤0 or x≥y it is reflected back into x [0, y]. Here acts as absorbing boundary condition whenever .
Let indicate the probability that RNAPII and snRNP are between position 0 and n at time t (given starting points x0, y0). Let denote the mean first passage time (MFPT) associated with the binding of snRNP at DSSn starting from initial conditions (x0, y0). From the definition of MFPT, . Noting that before time n/kE, the DSSn has not emerged yet, we have:
and therefore obeys the following backward type Fokker-Planck equation :
with the following boundary conditions:
We assume that the residence time associated with dissociation of the non-specific bound snRNPs from the pre-mRNA is much higher than the time required by the snRNPs to locate the 5′ donor splicing sites. As a result, we have introduced a reflecting boundary condition at x=0 in the first boundary condition. The other boundary conditions can be directly derived from Eq. 2′ . The second boundary condition describes the conditions where RNAPII transcription elongation is the limiting step and the third boundary condition describes the conditions where snRNP diffusion is the limiting step. The particular solution to Eq. 3 for the boundary conditions in Eqns 3′ can be written as follows:
Considering x0=0 and y0=0 (both RNAPII and snRNP start at the origin), we have . The first term is the time required to generate a pre-mRNA of n bases and the second term is the time required by the snRNPs to completely scan this pre-mRNA length via 1D diffusion. The validity of this equation for the MFPT under various values of n and kE is illustrated in Figure 2A–B using random walk simulations.
In line with site-specific DNA-protein interactions –, we assume that snRNP molecules locate their respective DSS binding sites on the growing pre-mRNA via a combination of 1D and 3D diffusion-controlled collision routes. Under such conditions, from Eq. 4 we find the average overall search time () required by the snRNPs to locate DSSn (x0=0;y0=0):
Here (units of seconds) is the 3D diffusion-controlled collision time required for non-specific binding of U1 snRNP with the pre-mRNA of length n. Eq. 5 suggests that there exists an optimum position of DSSn on the emerging pre-mRNA such that the search time required by the snRNPs to locate this DSSn will be a minimum. This optimum value can be obtained by solving for n. The explicit real solution of the resulting cubic equation is:
where . Upon substituting nopt in Eq. 5 we find the minimum search time .
In line with the prediction of the kinetic model, when the snRNPs locate the DSSn via a purely 3D diffusion-controlled collision route, the overall search time is:
In this equation, c (units of bases) is the sequence length within which the snRNPs can be captured at the 5′ donor site. A precise and tight binding would correspond to c=1. Upon comparing this expression with Eq. 5 we find that there exists a critical position on the pre-mRNA (nc) such that τS,1D3D=τS,3D. Solving the cubic equation for n ( Figure 2C ):
While deriving Eq. 5 we have assumed that the non-specific bound snRNP does not dissociate from the pre-mRNA chain until it reaches DSSn. We relax this assumption by modeling the search dynamics of snRNPs as multiple cycles of dissociation-scan-association events. In this modified version of the model, the non-specific bound snRNP can dissociate after scanning an average pre-mRNA length of L bases and then it re-associates back at the same or different location of the pre-mRNA chain. In this way, snRNPs are required to undergo at least (n/L) such association/dissociation events to scan the entire length of n bases. Under such conditions, the expression for the overall search time () can be written as follows:
Here L2/6xd is the average time required by the non-specific bound snRNPs to scan an average of L bases of pre-mRNA before the dissociation event. The scan length L depends on the magnitude of the interaction between the snRNPs and the pre-mRNA. When L=n, Eq. 9 reduces to Eq. 5 . When , there exists an optimum value of L in Eq. 9 at which is a minimum: . The corresponding minimum achievable search time is:
One should note that the optimum 1D scanning length can be achieved by the diffusing U1 snRNPs only when the inequality condition holds since by definition . Further analysis shows that will reach a minimum only when . Upon comparing Eqns 5 , 7 and 9 we find that when n<nc, then both and will be lower than . In the range L (0, nopt) the cubic equation has two real solutions for n (n1~L and n2, marked in Figure 2C ) for n. When n (L, n2), we find that . The relationship among these different search times is shown in Figure 2C . These results suggests that among the three possible modes of searching (pure 3D, 1D3D with multiple dissociations and 1D3D without dissociation), the 1D3D search mode of search without any dissociation event will be the most efficient and preferable one in the range n (L, n2) where L is the possible 1D scanning length associated with diffusion of U1 snRNPs along the emerging pre-mRNAs. We find from Eqs. 9 – 10 that similar to the pure 3D diffusion mediated search time (), is also a monotonically increasing function of n. On the macroscopic level, the interactions of snRNPs with DSSn can be described by the following chemical reaction scheme I:
Here (bases−1s−1) is the bimolecular type forward on-rate constant associated with the site-specific interaction of snRNP with the DSSn and (s−1) is the respective dissociation or off-rate constant. The sequence of DSSn plays critical role in determining the value of the off-rate. The number of snRNPs will be higher than the number of DSSs of a particular pre-mRNA transcript. In this situation, the thermodynamic probability of finding DSSn () to be bound with snRNPs is:
Here N0 is the total number of the freely diffusing snRNPs inside the nucleus. It follows from Eqns 5 – 6 that the probability is maximized when n=nopt irrespective of the value of the intra nuclear concentrations of snRNPs or the amount of time for which the completely transcribed pre-mRNA chain stays inside the nuclear compartment for further post-transcriptional processing. On the other hand, when the snRNP search mode is purely via 3D routes then the probability () is a monotonically decreasing function of n ( Figure 2D ):
From Eqs 11 – 12 , we find (all DSSn bound by the snRNP given infinite concentration). Those splicing sites located closer to the optimum position () approach this limit faster. Using Eq 11 we define the overall splicing efficiency of a transcript of length n as follows:
The value of the splicing efficiency (between 0 and 100%) indicates how well exons present in a given pre-mRNA transcript of length n interact with the available pool of snRNPs, are subsequently spliced and hence get included in the final transcript. This means that the overall levels of the final transcript should be directly proportional to this splicing efficiency. There exists an optimum length of pre-mRNA transcript (μ) at which achieves a maximum. The optimum μ can be obtained by numerical solving for n. The overall level of the final transcript will be maximum at since the overall average probabilities associated with all those exons of the given pre-mRNA transcript of length μ to interact with the available snRNPs will be a maximum. We consider a transcript c of length n and its expression in tissue k. We define the overall signal as where is the signal from the exon located at position i in transcript c in tissue k. With this definition we find that the maximum gene signal value of n occurs at which means that when the equality holds. This follows from the fact that .
We compare the theoretical predictions outlined in the previous section with two different types of experimental measurements: (i) experiments based on exon microarray data and (ii) experiments based on high-throughput RNA sequencing data (RNAseq) (“Materials and Methods”). Upon substituting the parameters τt, kE and xd into Eq. 6 for the optimum position of the DSS on the pre-mRNA we find bases and the minimum achievable overall search time required by the snRNPs . This search time is significantly higher than physiologically relevant timescales (for example, the cell's generation time). One should note that this higher timescale corresponds to the interaction of a single snRNP molecule with a single splicing site. The search time will be proportionately scaled up/down depending on the number of freely available snRNPs and nascent splicing sites inside the nucleus as . There are ~2×104 genes in the human genome, and there are on average ~10 exons per gene. This means that there are d0~4×103 such splicing sites at any given active region of the chromosome (corresponding to ~1% of the total pre-mRNAs being processed). With these values we find . These results suggest that the appearance of the speckles where snRNPs are concentrated inside the nucleoplasm of higher eukaryotes is mainly to scale down the search time required by snRNPs to locate the splicing-sites on the pre-mRNA.
We conclude from the expression for the probability of finding the snRNP at position n (, Eq. 11 ) that the DSS located at position of the growing pre-mRNA will have more chances to interact with the available snRNPs. Here the minimization of the overall search time is achieved mainly via the enhancing effects of the increasing numbers of non-specific interactions of snRNPs with the growing pre-mRNA. We learn from Eq. 8 that the inequality condition will hold whenever . The current parameter settings yield bases. Various single-cell studies using fluorescence recovery after photo bleaching (FRAP) provide an empirical estimate for the dissociation rate of snRNPs from the pre-mRNA chain: , , . This is an overall off-rate that includes dissociation of snRNPs from both the non-specific and specific binding sites (the off-rate of snRNPs from the splicing sites will be lower than the off-rate from non-specific binding sites.) Using this value of , the limiting behavior of pn,1D3D and pn,3D as is demonstrated in Figure 2D . This figure suggests that the optimum position of DSS will spread into a wider range as the total concentration of snRNPs increases inside the nucleoplasm. Single molecule studies suggest an average 1D scanning length of L~100 bases for the DNA-binding proteins under in vivo conditions . With this value, upon solving the cubic equation for n we find that n1=100 and n2=2×106 bases. Since within this range , this result suggests that the dominating mode of searching of U1 snRNPs for the 5′ splicing sites is likely to be via the combination of 1D and 3D without dissociation for most of the pre-mRNAs.
We considered microarray data evaluating exon levels in different tissues and species (Materials and Methods.) Examples of mouse and human constitutively spliced multi-exonic genes across various tissues are shown in Figure 3A–B . These examples, identified using the ranking metric defined in Eq. 14 , suggest that there exists a broad optimum position of splicing sites on the pre-mRNA at which the probability associated with the inclusion of the associated exon is maximized. This position is approximately independent of the tissue analyzed. In these particular mouse and human genes (Dtnb dystrobrevin beta in mouse and VIT vitrin in human), this optimum exon number occurs at the pre-mRNA position of n~5×104 to 105 bases (arrow in Figure 3A–B ). Other examples are included in supplementary materials (Figure S1, S2). The position of the maximum splicing index value, independently of the tissue, occurs around nopt~7×104 bases as predicted by Eq. 6 , with an error margin of ~25%.
Overall analysis of the multi-exonic genes present in both human and mouse genomes revealed an average intron length of ~4×103 bases with a median of ~103 bases. Here the average length of exons is ~2×102 bases with a median of ~102 bases. Results of genome wide analysis of the median of exon positions on pre-mRNAs of human and mouse is shown in Figure 3C–D which reveals the following approximate scaling relationships between the positions (n) and the exon numbers (ε):
The standard error (SE) in such transformation is approximately 5 to 25% of the mean (n) for ε in the range 1 to 100 ( Figure 3C–D ). This suggests that the optimum positions nopt and min τ S,1D3D may be observed anywhere in the ±25% of the predicted values upon a genome wide averaging across exon numbers ε.
The computed first exon normalized average signal (FENAS, defined in Eq. 15 ) associated with various mouse tissues (kidney, brain, liver, muscle and heart) and human tissues (cerebellum, kidney, liver, heart, muscle and normal and cancerous colon) is shown in Figure 4A–B . This figure indicates a maximum at approximately . This value corresponds to the optimum position of the Affymetrix annotated exon on the pre-mRNA at bases, which is broadly consistent with our theoretical predictions. We also compared the theoretical predictions with experimental data obtained from RNAseq experiments (Materials and Methods). The data from the exon level and transcript level signals obtained from RNASeq data of mouse brain and human T293 cells are shown in Figure 4C–D . The results from the RNASeq data are comparable to those from the microarray data and also reflect an optimum exon position, approximately around .
Upon substituting molecules, s−1 and the empirical values of τt, kE and xd into Eq. 13 and numerically solving it for the optimum transcript length n=μ we find bases ( Figure 5 ). This value corresponds to approximately exons. From the theoretical analysis, we learn that the overall transcript signal of a given gene is maximized when the number of exons present in that gene is closer to this value. We find from Figure 5 that the splicing efficiency is >95% whenever the length of the pre-mRNA transcript falls inside the range of ~(102–107) bases. The distribution of transcript lengths both in humans and mouse is well within this broad range. Furthermore, we calculated the genome level averaged transcript signal across various mouse and human tissues using Eq. 16 . Figure 6 suggests that there is a broad maximum in the transcript signal approximately centered around both based on the microarray data ( Figure 6A–B ) as well as the RNAseq data ( Figure 6C–D ). Within the expected error range of ±25%, these distributions and the location of the maxima are consistent with the theoretical predictions.
To further evaluate whether the experimental data are consistent with the existence of optimal exon positions, we computed the distribution of FENAS values for two separate broad ranges: (1) (i.e. around the theoretical optimum) and (2) or (i.e. far from the theoretical optimum). The distributions of FENAS signals were significantly different for these two ranges (t-test, p<0.05, Figure 7 ).
While the RNA polymerase II complex (RNAPII) is producing the pre-mRNA, multiple splicing factors diffuse inside the nucleus and initiate the recognition steps required in the process of splicing. Therefore, the ultimate mature mRNA product depends on several variables that affect the kinetics of these chemical and diffusion processes. These variables include RNAPII elongation speed and the presence of pausing events during transcription, the steric availability of splicing signals along the emerging pre-mRNA, exon and intron lengths, the abundance of different splicing factors and the sequence and hence affinity of those sequences for the splicing factors. Here we develop a simple theoretical framework that aims to capture the key interactions between transcriptional elongation and splicing.
The biophysical model proposed here can explain the effects of the stochastic search dynamics of small nuclear ribonucleo proteins (snRNPs) on the splicing pattern of eukaryotic genes. We considered two different ways to model the dynamics of snRNPs in the process of locating the splicing sites on the concomitantly evolving pre-mRNA: a pure three-dimensional diffusion process and a combination of three- and one-dimensional diffusion along the pre-mRNA. Our theoretical analysis on the coupled dynamics of transcription elongation and splicing revealed that there exists an optimum position of the splice sites on the growing pre-mRNA at which the time for snRNP binding is minimized ( Figure 2 ). The minimization of the overall search-time is achieved mainly via increasing non-specific type interactions between the RNA binding domains of snRNPs and the pre-mRNA. The theory further revealed that there is an optimum transcript length that maximizes the sum of the probabilities for the exons in the transcript to interact with the snRNPs. This suggested that the overall transcript signal should be maximized at this transcript length.
We evaluated the theoretical predictions by analyzing exon microarray data from various mouse and human tissues ( Figures 3 – 6 ). The empirical data revealed that the optimum position of the splice sites on the growing pre-mRNA occurs at ~4.5×104 bases and the optimum length of the transcript occurs at ~7.5×104 bases (corresponding approximately to the ~11th and ~20th exon in the genome wide first exon normalized average signal space.) The empirical data are broadly consistent with the theoretical predictions and the model captures, to a first approximation, some of the variability in exon level signals and splicing patterns.
Several computational algorithms have been developed to attempt to predict splicing patterns from DNA sequence. Most of the current splicing pattern prediction algorithms are solely based on cis-acting regulatory elements , , . Typically each exon of a given pre-mRNA transcript is assigned a score depending on the presence or absence of exonic and intronic enhancer or silencer elements and their degree of conservation across different species :. Using these exon level scores, transcript level scores are computed. Our work points out that, before computing the exonic scores for the presence of cis-acting elements, the ‘backbone’ of the scoring scheme assumes that all the exons are probabilistically equivalent. This uniform distribution of exon probabilities may hold only when the snRNP search mode is via pure 3D diffusion ( Figure 2D ) or the nuclear concentration of snRNPs is infinite. In more general scenarios, instead of a uniform distribution, our theoretical model suggests that the backbone of the scoring scheme should be given by the probability functional as defined in Eq. 12 – 13 . In other words, the backbone of the scoring scheme is determined by the generic variables 2 (transcription elongation rate), 3 (interactions between RNAPII and snRNPs) and 4 (stochastic dynamics of snRNP search processes) as highlighted in the introduction. The model suggests that a modified scoring scheme would include the background model that accounts for the coupled kinetics of transcription and splicing in addition to the exonic scores for the presence of cis-acting regulatory elements.
The theoretical framework presented here provides initial steps to describe the coupled chemical and diffusion process that underlie transcription and splicing. While we focused here on generic variables that affect all transcripts and genes, a lot of the transcript-to-transcript and gene-to-gene variability depends on sequence specific factors, gene-specific transcription pausing events, regulation of transcriptional termination and the speed at which the mRNA is transported to the cytoplasm. The theory proposed here constitutes a starting point to build more sophisticated models that further incorporate important aspects of the biology that were not considered in this initial examination.
To compare our theoretical predictions with experimental observations, we considered two different types of publicly available data: (i) exon microarray data and (ii) RNAseq data.
We analyzed mouse and human exon microarray data collected using Affymetrix arrays , . We used exon level signal data collected in triplicate from five different mouse tissues (brain, kidney, muscle, liver and heart; mouse Mo-Ex 1.0) and five different human tissues (cerebellum, kidney, muscle, liver, heart; human Hu-Ex 1.0). We also considered the available sample microarray data from normal and cancerous human colon , .
We analyzed BOWTIE generated RNASeq datasets , . The data sets come from mouse brain (GSM672532, GSM672537, GSM672528, GSM672534 and GSM672547), and human 293T cells (GSM860026, GSM860020, GSM860017, GSM860001 and GSM9685994). The mouse annotations are based on the mm8 genome build and the human annotations are based on the hg18 genome build and the data were obtained from the GEO database , . We used the information on sequence type annotation, sequence, and genomic alignment from the GEO files.
Experimental artifacts are introduced in the exon microarray data by factors such as cross-hybridizing probes, signal heterogeneity due to variation in the base composition of probes and signal variation due to fluctuations in the spot size of probes during microarray design. The cross-hybridization problem was solved by removing those probes showing hybridization at more than one location. Since the variations in probe level signals due to base composition, spot size and RT reaction are approximately random in nature, we assume that these errors are ameliorated by averaging over the scale normalized and background subtracted probe level signals of a probe set id, exon cluster id or transcript cluster id..
Exon level signals are computed by averaging the probe-set id level signals contained in an exon-cluster id and transcript level signals are computed by averaging the exon level signals contained in a transcript cluster id. Only the Refseq annotated transcript cluster ids were considered for all the subsequent calculations. We used the standard Tukey biweight algorithm  to remove the outlier probe signals before computing the average. We considered multiple transcripts (indexed by c) and different tissues (indexed by k). Let sε,c,k denote the log2 of the expression level of the εth exon in transcript number c and tissue number k. The relative probability associated with the ε th exon to get included in the final transcript was defined as where mc is the total number of exons in transcript c. The probability is directly related to the splicing-index () of the associated exon which is a measure of the extent of alternative splicing in that transcript, defined as where gc,k is the overall level of transcript c in tissue k. In addition to the stochastic component, other splicing variables such as the presence of cis-acting regulatory elements including splicing enhancers and suppressors can significantly modify the probabilities defined here.
To evaluate the expression derived in Eqns (11 – 12) we need a splicing probability profile of a pre-mRNA transcript that contains multiple exons spliced in a ‘constitutive’ manner across various tissues. Here we use the term ‘constitutive splicing’ to indicate the splicing pattern of a given pre-mRNA that is conserved across various tissues in a given organism. We use the following variance-based scoring metric to rank and select such constitutive transcripts from the pool of multi-exonic pre-mRNAs of a given genome:
We ranked the transcripts based on and we considered the top 25 transcripts to evaluate the theoretical predictions (these 25 transcripts represent the ones with minimal variation in the splicing index across different tissues as defined by the index ). For a single-exon transcript, . Earlier studies show that the majority of multi-exonic pre-mRNAs are spliced alternatively , . This suggests that the number of constitutively spliced examples available to evaluate our model is limited.
We assume that the effects of cis-acting elements associated with a given exon number of various genes across the genome is approximately a symmetric random variable. That is, we assume that both the cis-acting enhancers as well as silencer elements are found on the genome with equal probabilities. Under this assumption, we expect that averaging over the first exon normalized signals (FENAS) of a given exon number across all the available multi exonic genes in the entire genome of an organism will essentially reduce up- and down-regulatory effects of the cis-acting elements apart from a local normalization of the exon signals within a gene. While carrying out this averaging process, the start and stop positions of each ε th exon of the pre-mRNA of different gene transcripts is also averaged out in such a way that in the overall averaged signal space the exons of average length are equally separated or flanked by the average length of introns of the genome. We define the FENAS metric as follows:
Here is the genome level FENAS (±%) of the ε th exon in tissue k. To compare Eq. (15) with Eqns (11 – 12) , we use the genome-wide scaling , that is, the position of DSSn is a function of the exon number ε (). We note that and . To evaluate Eq. (11 – 12) , the average signals associated with the final transcripts with various numbers of exons at the genome level were calculated as follows:
Here is the genome level average signal of those transcripts with m exons in the k th tissue; b(m) is the total number of transcripts with m exons.
Exon microarrays possess very few probe sets per exon cluster id. Therefore, we also analyzed the number of sequence reads from RNASeq data (see datasets above). For this purpose we considered the start and end position of each transcript and exon and summed over the number of reads from RNASeq data. These signal profiles were used to compute the first exon normalized average signals FENAS as described in Eqn 15 . To compute the transcript level signal we considered the start and stop position of each transcript and summed over the number of reads from RNASeq data within this range.
In order to compare the theoretical predictions with experimental measurements we estimate the kinetic and diffusion parameters required to quantitatively evaluate the theoretical equations from experimental studies. Single molecule data from the human U2OS osteosarcoma cell line shows an in vivo transcription elongation rate for RNAPII of bases s−1 . Single cell studies on BAC HeLa and E3 U2OS cell lines suggest that the overall diffusion coefficient for the U1-70K snRNP inside the nuclear splicing region is on the order of µm2/s (~8×106 bases−2s−1) , , . This value is close to the 3D diffusion coefficient associated with the dynamics of protein molecules inside the cytoplasm of prokaryotic systems . The 1D diffusion coefficient associated with the diffusion dynamics of snRNPs on the pre-mRNA chain is not clearly known. Single molecule studies in E. coli  showed a numerical value of bases2s (~0.092 µm2/s) for the 1D diffusion coefficient associated with the dynamics of transcription factors along the DNA. This value is approximately 10 times smaller than the experimentally observed overall diffusion coefficient of U1 snRNP inside the nucleus. The experimentally observed fast diffusion coefficient can be attributed to the more flexible nature of single stranded pre-mRNAs compared to the double stranded DNA chain. The nuclear diameter of a typical human cell is ~6 µm and the corresponding volume will be ~10−16 m3. The concentration of a single snRNP molecule or its single DSS binding site on the pre-mRNA in this volume will be ~20 pM. When the length of the pre-mRNA is n bases, there should be at least ~n non-specific binding sites for snRNPs. Single cell experimental studies suggested the timescale required by the snRNPs to non-specifically interact with the pre-mRNA is about ~0.1 s , , . This value suggests an overall off-rate . There are approximately N0~108 snRNPs inside the nuclear volume  which means that the number of non-specific collisions that can happen between a single snRNP molecule and the growing pre-mRNA chain will be in the order of .
Mouse. This supplementary figure provides further examples showing the splicing index as a function of the annotated exon number (the format is the same as the one in Figure 3A ; see Figure 3A caption for details). A. Affymetrix Transcript ID: 6747308 Gene: Lypla1, lysophospholipase 1, NM_008866 B. Affymetrix Transcript ID: 6865573 Gene: Cep120, centrosomal protein 120, NM_178686 C. Affymetrix Transcript ID: 6770693 Gene: Osbpl8, oxysterol binding protein-like 8, NM_175489 D. Affymetrix Transcript ID: 6770718 Gene: Nap1l1, nucleosome assembly protein 1-like 1 NM_015781 E. Affymetrix Transcript ID: 6839871 Gene: Hira, histone cell cycle regulation defective homolog A, NM_010435. F. Affymetrix Transcript ID: 6814200 Gene: Mus musculus mRNA for mKIAA0947 protein. ENSMUST00000043493//ENSEMBL//hypothetical protein LOC218333 isoform 1 gene: ENSMUSG00000034525 G. Affymetrix Transcript ID: 6915559 Gene: Fggy, FGGY carbohydrate kinase domain containing, NM_029347 H. Affymetrix Transcript ID: 6825511 Gene: NM_028032, Ppp2r2a, protein phosphatase 2 (formerly 2A) regulatory subunit B (PR 52) alpha isoform.
Human. This supplementary figure provides further examples showing the splicing index as a function of the annotated exon number (the format is the same as the one in Figure 3B ; see Figure 3B caption for details). A. Affymetrix Transcript ID: 2477073, NM_016441, CRIM1, cysteine rich transmembrane BMP regulator 1 (chordin-like). B. Affymetrix Transcript ID: 2481379, NM_172311, STON1-GTF2A1L, STON1-GTF2A1L read through transcript. C. Affymetrix Transcript ID: 2482505, NM_003128, SPTBN1, spectrin beta, non-erythrocytic 1. D. Affymetrix Transcript ID: 2639552, NM_003947//KALRN//kalirin, RhoGEF kinase. E. Affymetrix Transcript ID: 2639734, NM_007064//KALRN//kalirin, RhoGEF kinase. F. Affymetrix Transcript ID: 2829171, NM_003202//TCF7//transcription factor 7 (T-cell specific, HMG-box). G. Affymetrix Transcript ID: 3179975, NM_005392//PHF2//PHD finger protein 2. H. Affymetrix Transcript ID: 3183604, NM_021224//ZNF462//zinc finger protein 462.
This figure provides complementary data to Figure 4 . A–B. Standard error of the FENAS signal for mouse (A) and human (B). There is one line for each tissue but the curves overlap. C–D. Number of transcripts (count) with a given exon number for mouse (C) and human (D).
This figure provides complementary data to Figure 6 . A–B. Standard error of hm,k for mouse (A) and human (B). C–D. Number of transcripts (count) with a given number of exons in mouse (C) and human (D).
List of variables defined in the text.
This work was funded by NSF grant #0954570 and NIH grant #DP2OD006461-01 and #1R21NS070250-01A1. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.