|Home | About | Journals | Submit | Contact Us | Français|
The occupancy of nucleosomes along chromosome is a key factor for gene regulation. However, except promoter regions, genome-wide properties and functions of nucleosome organization remain unclear in mammalian genomes. Using the computational model of Increment of Diversity with Quadratic Discriminant (IDQD) trained from the microarray data, the nucleosome occupancy score (NOScore) was defined and applied to splice junction regions of constitutive, cassette exon, alternative 3′ and 5′ splicing events in the human genome. We found an interesting relation between NOScore and RNA splicing: exon regions have higher NOScores compared with their flanking intron sequences in both constitutive and alternative splicing events, indicating the stronger nucleosome occupation potential of exon regions. In addition, NOScore valleys present at ~25 bp upstream of the acceptor site in all splicing events. By defining folding diversity-to-energy ratio to describe RNA structural flexibility, we demonstrated that primary RNA transcripts from nucleosome occupancy regions are relatively rigid and those from nucleosome depleted regions are relatively flexible. The negative correlation between nucleosome occupation/depletion of DNA sequence and structural flexibility/rigidity of its primary transcript around splice junctions may provide clues to the deeper understanding of the unexpected role for nucleosome organization in the regulation of RNA splicing.
In eukaryotic cells, genomic DNA is highly compacted in the nucleus into several levels of chromatin structures that ultimately make up the chromosomes. At the lowest level of compaction, a ~147 bp DNA sequence is tightly wrapped around the histone-octamer core (composed of pairs of the four core histones H2A, H2B, H3 and H4) into the elementary structural unit of chromatin, known as nucleosome (1). The packaging of DNA around the histone-octamer not only facilitates the storage and organization of the long eukaryotic chromosomes, but also plays a crucial role in diverse DNA-related biological processes, such as transcriptional regulation, replication, repair and recombination (2–4). Several reports have demonstrated that transcription start sites and transcription factor binding sites are selectively devoid of nucleosomes in the Saccharomyces cerevisiae genome (5–7). These findings were also confirmed for promoter regions of human genes using genome-tiling microarrays (8) and by analysis of high-throughput sequencing data (9). However, it was suggested that nucleosomes adopt more random positions in the interior of genes (10). Thus, revealing the detailed nucleosome organization beyond promoter regions will provide novel insights into the full understanding of subsequent steps of gene expression.
The general link between nucleosome and gene exon–intron architecture has been suggested in previous studies (11,12). These authors observed the common sequence periodicity around splice junctions. On the basis of the observed dinucleotide periodicities, Kogan and colleagues proposed that nucleosomes are positioned at splice sites due to a protection mechanism (12). Recently, by analyzing the ChIP-Seq data (9), Andersson et al. (13) have reported that nucleosomes are well positioned in exons and carry characteristic histone modifications, and Schwartz et al. (14) have described that the chromatin organization marks the exon–intron structure. They represent an important step toward the understanding of genome-wide nucleosome organization and function. However, mechanisms responsible for the relation between nucleosome positioning and transcriptional splicing remain unclear. To solve this problem, more detailed analyses on nucleosome positioning in different splicing events—constitutive, cassette exon, alternative 3′ and 5′ splicing events, etc., respectively—are required. The aim of this study is to investigate the nucleosome organization along DNA sequences around the exon/intron boundary and explore its relation to RNA splicing.
The sequence dependent nucleosome occupancy along DNA has been demonstrated and lots of work has been done to elucidate nucleosome occupancy signals that determine the preference of a particular region to bind to histones and form a nucleosome (15–17). Based on these studies, a new computational model, called Increment of Diversity with Quadratic Discriminant (IDQD), trained using human microarray data is presented to identify DNA sequences that either favor or inhibit nucleosome occupation. Using this model, we have calculated the nucleosome occupancy score (NOScore) around the splice junction sequences and found the distinct NOScore trends between exon and intron regions in constitutive, cassette exon, alternative 3′ and 5′ splicing events in the human genome. These results are in accord with the experimental nucleosome positioning data (13,14). Moreover, we have defined folding diversity-to-energy ratio (FDE) to describe RNA structural flexibility and found the negative correlation between nucleosome occupation/depletion of DNA sequence and RNA structural flexibility/rigidity around splice junctions. These observations will provide important clues to the deeper understanding on the relationship between the nucleosome occupancy along DNA and the splicing of primary RNA.
The top (nucleosome occupancy) 1000 and bottom (nucleosome depleted) 1000 probes (length of 50-mer) were picked out by Gupta et al. (18) from the DNA microarray data (19) to train their support vector machine (SVM) model for nucleosome occupancy sequence prediction. To avoid prediction bias in this study, a sequence alignment for each of the probes against the human cDNA (http://atidb.cshl.org/maize/Homo_sapiens.NCBI36.50.cdna.all.fa) was carried out with a local installation of BLSAT using default parameters. Of the 2000 probes, 322 probes that have sequence identities with cDNA were removed. The new dataset contains 762 nucleosome occupancy and 916 nucleosome depleted probes, constituting the positive and negative training samples, respectively.
Human splice junction sequences are downloaded from the latest AltSplice Database (Human release 3) of EBI (20,21), which is available at http://www.ebi.ac.uk/asd/altsplice/index.html. Splice sites obeying the GT–AG rule and with exons ranging from 140 to 160 bp, 160 to 180 bp and 190 to 210 bp were all collected for constitutive, cassette exon, alternative 3′ and 5′ splicing events, respectively. Two data sets of splice junction sequences were formed for AG- and GT-junction containing sequences, and each set consists of 1000-bp-long sequences with the splice site in the middle (AG at positions 499, 500; GT at positions 501, 502, respectively). For each of the splicing events, the number of splice junction sequences sorted by exon-size is listed in Supplementary Table S1.
The Increment of Diversity with Quadratic Discriminant (IDQD) method was proposed and successfully applied in the prediction of exon–intron splice sites for several model genomes (22). The method has also been used in the prediction of transcription start sites (23). In this method, the sequence features are converted into the increment of diversity (ID), defined by the relation of sequence X with standard source S (24),
Given a sequence X with r sets of feature variables (ID1 to IDr), we obtain an r-dimensional feature vector R = (ID1, ID2, …, IDr) and need to integrate the vector into a non-linear discriminant function ξ through quadratic discriminant analysis (QD) (25). QD can be deduced from Bayes’s theorem and under the assumption of ID1 to IDr obeying the r-dimensional normal distribution (26). The parameter ξ gives the decision for sample X. As ξ > ξ0, the sample X is classified into the positive group, or into the negative group as ξ ≤ ξ0 (see Part I in Supplementary Data for more information).
In this study, the feature vector R is designed by the following consideration. The rotational and translational settings of DNA around the histone core are determined by the sequence-dependent structure of DNA (27,28). The absolute occurrences of the six k-mers (k = 1, 2, …, 6) are responsible for the discrimination between nucleosome occupancy and depleted sequences (18,29), and therefore they construct the diversity sources. Based on this, 12 increments of diversities ID1(ID2), ID3(ID4), …, ID11(ID12) between sequence X and the standard source in positive (negative) training set can be easily introduced as the feature parameters for the computational model (see Part I in Supplementary Data for more information).
Given the ξ value for a probe-sized (50 bp, the same length as given in ref. 18) DNA sequence, we can assess its nucleosome occupation/depletion potential by calculating the NOScore, S,
here ξMax and ξMin are the maximum and minimum of the ranked ξ of sequences in the training dataset. When ξ equals ξ0, the corresponding NOScore S is designated as the threshold value S0. If the NOScore of a probe-sized DNA sequence is larger than S0, it is predicted as the nucleosome occupancy sequence; otherwise, as the nucleosome depleted one.
The nucleosome occupation/depletion potential of an L bp-long DNA sequence can be measured by the average NOScore ,
Here Si is the NOScore of the ith (i = 1, 2, …, m) 50-bp window, and the summation is over the m windows with a width of 50-bp and step size of 10-bp along the DNA sequence.
The dynamic programming algorithm RNAstructure V4.5 (30,31), available for download from http://rna.chem.rochester.edu, was employed to predict RNA secondary structures and folding free energies with a temperature parameter setting of 37°C.
We folded the L bp-long (L = 150, 170 or 200 bp in this study) RNA sequence as a whole using this program and defined FDE ratio to describe its structural flexibility.
Here Ej is the folding free energy of the jth (j = 1, 2, …, n) structure. The parameter n is the number of the optimal secondary structures with the least folding free energies and n = 20 is taken in this study; SdEnergy is the standard deviation of diverse folding free energies, representing the structural variability of RNA folding.
To estimate its performance, we applied the IDQD model to the training dataset and compared its prediction quality with that of the SVM model (18) by measuring the area under the receiver operating characteristic (ROC) curve. By this metric, a random classifier achieves an area under ROC curve (auROC) of 0.5, while 1.0 corresponding to a perfect one (32). Our IDQD model obtained a mean auROC of 0.958 in the 10-fold cross-validation, superior to the SVM model with an auROC of 0.907 (Figure 1a) for the discrimination between nucleosome occupancy and depleted sequences in the same dataset.
As a further evaluation, the IDQD model trained based on the ChIP-chip data was tested on the independent high-resolution nucleosome occupancy data (see Part II in Supplementary Data for more information) which has been analyzed in recent studies (13,14). It gives an ROC curve with the auROC of 0.935 (Figure 1b), indicating the excellent performance of IDQD model in distinguishing between nucleosome occupancy and depleted sequences.
By using IDQD model, the NOScore was calculated in a 50-bp sliding window with a step size of 10-bp along the 1000 bp regions around the donor and acceptor splice sites. Figure 2 shows the NOScore against the position of sliding windows over the regions around the acceptor (AG) and donor (GT) sites of constitutive splicing event.
The NOScore peaks in the range of exons, with the value higher than the threshold S0 = 0.188 (see Part I in Supplementary Data for more information), indicating their strong affinities for nucleosome occupation, while the NOScore is lower than the threshold value S0 in the flanking intron regions both for donor and acceptor junctions, suggesting their weaker potential to wrap on the histone-octamer compared with exons. These results obtained from our computation model are of prediction nature, but they are consistent with experimental nucleosome positioning data as indicated in recent reports (13,14). Moreover, we found that the lowest NOScore is at ~25 bp upstream of the acceptor site, corresponding to the location of the branch site in the RNA transcript.
In parallel, we also analyzed the atlases of nucleosome occupancy in splice junction regions for the three major alternative splicing events (cassette exon, alternative 3′ and 5′) and found that the NOScores still peak in the exon regions and the contrasts of nucleosome occupation/depletion potentials between exons and their flanking intron regions remain obvious (Supplementary Figures S1–S3), basically same as those in constitutive splicing event. In addition, the region between two alternative acceptor sites in alternative 3′ splicing event and that between two alternative donor sites in alternative 5′ splicing event also exhibit nucleosome occupation potential, with the NOScore higher than the threshold value S0.
The above findings were obtained from the IDQD model trained on the ChIP-chip data (18). To verify these uncommon results, we trained the model based on the high-resolution ChIP-Seq data (9) and reanalyzed the NOScore in intron/exon junctions for constitutive, cassette exon, alternative 3′ and 5′ splicing events. The result of reanalysis confirmed the predictions that there exists a higher NOScore in exons for all the splicing events (see Part II in Supplementary Data, Supplementary Figures S4–S7).
The higher nucleosome occupancy potential in exon regions has been observed in a sample of individual human genes (PRRC1, UHRF1BP1, EXOSC9 and PKM2) by analyzing the H3K36 tri-methylation signal (13). Here, we applied the IDQD model trained on the ChIP-chip data to predict nucleosome occupancy profiles in exon and surrounding intron regions of these genes. In most cases, exons exhibit higher NOScores than their surrounding introns (Figure 3 and Supplementary Figure S8). Simultaneously, the predicted nucleosome profiles are in accord with experimental nucleosome positioning signals indicated by the number of tags (Figure 3 and Supplementary Figure S8), demonstrating the reliability of the above statistical prediction.
Human intronic regions tend to be AT-rich when compared to exons. Since AT-rich sequences have a looser association with nucleosome (29), it is necessary to check if the observed nucleosome depletion within introns is fully caused by AT content. We therefore generated 909 (equal to the number of sequences in constitutive splicing event analyzed above) control sequences by randomization of bases under the same nucleotide composition and length as natural intron and exon sequences in constitutive splicing event, and deduced the average nucleosome occupancy profile using the IDQD model (Figure 4 and Supplementary Figure S11). We found that the nucleosome profile of control sequences is markedly different from that of natural ones, and does not show lower NOScore in intron regions compared to exons. This result ruled out the possibility that the observed NOScore was contaminated by AT content.
Next, we studied the dinucleotide distribution around splice junctions. With the same window length and step size as the plots of NOScore, we depicted dinucleotide profiles (Supplementary Figures S12–S13) and noticed some peculiarities of frequency distribution in splice junction region. However, by using IDQD model, we found that the dinucleotide information alone only gives an auROC of 0.53 for nucleosome positioning sequences prediction (Figure 5), much lower than that given by combination information of all k-mers (k = 1, 2, …, 6). These results indicate that apart from AT content and dinucleotide composition, the oligomer (k > 2) information is also important for understanding the pattern of nucleosome organization observed in exon and intron regions. As an illustration, we listed the top tri-, tetra-, penta- and hexamers (ranked by one-side P-values from the Mann–Whitney’s U test) which provide significant contributions to nucleosome positioning around splice junctions in the human genome (Supplementary Table S2).
In this section, the validity of the definition of RNA FDE ratio was evaluated. Recently, Watts et al. (33) reported the structure of an entire HIV-1 RNA genome using SHAPE experiment, and suggested that SHAPE reactivities are indicative of local nucleotide flexibility and disorder (34,35). Regions with median reactivities below 0.25 indicate domains with substantial base-paired secondary RNA structure, whereas median SHAPE reactivities of 0.5 and greater indicate regions of largely unstructured nucleotides. So, high median reactivity means relatively flexible while low median reactivity means relatively rigid (33). Since FDE is defined in a region spanning ~150-bp, we observe the SHAPE reactivities in a 150-bp sliding window with a 10-bp offset. If the number of nucleotides with SHAPE reactivities <0.25 (>0.5) is larger than 60 and that with SHAPE reactivities >0.5 (<0.25) is smaller than 40, then the 150-bp long sequence is classified into the rigid (flexible) group of low (high) median SHAPE reactivity (MSR). There are 84 samples belong to low MSR group and 76 samples to the high MSR group in the HIV-1 RNA genome. We studied if the low/high MSR classifications are consistent with those given by FDE. By calculating FDE for each sample, we differentiated the low and high MSRs (rigid and flexible sequences) and obtained an ROC curve with the auROC of 0.88, Figure 6, indicating the reasonability of our definition of FDE. The connection between FDE and SHAPE reactivity is understandable since the latter can be converted to RNA free-energy change terms that constrain the thermodynamic folding algorithm.
The pre-mRNA splicing proceeds through a series of reactions completed in RNA sequence. Together with some splicing factors and other additional proteins, the snRNPs (including U1, U2, U5, U4, U6 and others) form the spliceosome. The snRNA components of the splicing apparatus interact both among themselves and with the substrate RNA. These interactions allow for changes in structure that may bring reacting groups into apposition and may even create catalytic centers (36). So, any deeper understanding of splicing events should require the knowledge on pre-mRNA structure.
Since there is no direct experimental data on RNA structural flexibility, we employed FDE to analyze the structural flexibility of RNA sequences around splice junction regions. Considering the length of the shortest exon in each of the four (constitutive, cassette exon, alternative 3′ and 5′) splicing events (given in Supplementary Table S1), we extracted the exon-sized (L = 150, 170, 200 bp) sequences in the exon region (directly downstream of the acceptor sites AG), upstream of the acceptor site (AG) and downstream of the donor site (GT), and designated them as Exon, Iup and Idown. The intron–exon junctions (AG and GT) were included in Iup and Idown, respectively. For alternative 3′ (or 5′) splicing event, the most upstream acceptor site (or downstream donor site) was considered. Following Equation (5), we calculated FDE for exons of different sizes and found that the structural flexibility of RNA transcribed from the up- and down-stream intron regions are significantly larger than those from the exon regions on average (Table 1). Since the nucleosome occupation potential is different among Iup, Idown and Exon sequences, the results show that the nucleosome occupation potential of DNA sequences is correlated with the structural flexibility of RNA transcribed from them. The higher the nucleosome occupation potential of the DNA is, the less the structural flexibility of the RNA is.
The anti-correlation between the nucleosome occupation potential of DNA and the structural flexibility of RNA can be proved by direct statistics. We investigated the relationship between nucleosome occupation potential and structural flexibility in splice junction regions. Since the FDE is defined for an L-bp (L = 150, 170 or 200) long RNA sequence, it is necessary to use the NOScore of DNA averaged over the same long sequence, namely use [Equation (3)] in the computation. The obtained relationship between FDE and is given in Figure 7 for constitutive splicing event. It shows a significant negative correlation between the average NOScore and FDE (Pearson correlation coefficient r = −0.56; P < 0.0001) with the regression coefficient of −0.03 in the linear regression equation. For three major alternative splicing events, the results are the same but the regression coefficient equals −0.02 (Supplementary Figures S15–S17).
The negative correlation existing between FDE and indicates that the nucleosome occupancy along DNA indeed influence RNA structural flexibility in splice junction regions. RNA sequences transcribed from nucleosome-occupied regions are relatively rigid, while those transcribed from nucleosome-depleted regions are relatively flexible. This gives a clue to understand why the difference of the NOScore occurs between exons and introns in splice junction regions and why the difference may regulate the process of pre-mRNA splicing.
Nucleosome positioning is a combinational process, including rotational and translational settings. The periodical presence of certain dinucleotides (2-mer) presumably provides a rotational setting of DNA on the histone surface because AA or TT dinucleotides tend to expand the major groove of DNA (10), while other sequence combinations (other k-mers) could create subtle bends in the DNA or alter the flexibility of DNA to contribute to the rotational and translational settings of nucleosomal DNA (6,37). The pentamers and other longer oligomers may also provide the binding sites for chromatin remodeling complexes or spacing complexes of the imitation switch, which play important roles in nucleosome organization (10,38–41).
We have developed a novel computational model for nucleosome positioning recognition and to predict known nucleosome occupancy and depleted sequences (see Part I in Supplementary Data) by use of k-mer information (k = 1, 2, …, 6). The prediction was made in a 10-fold cross-validation experiment and evaluated using ROC curve method. The predictive results (Figure 1a and b) demonstrate the excellent performance of IDQD model in distinguishing between nucleosome occupancy and depleted sequences. The success of the prediction is primarily due to the appropriate synthesis of dinucleotide information and oligomer information of binding sites extracted from both nucleosome occupancy and depleted sequences. The ID defined in our model is essentially a measure of entropy increase as a sequence merged to a standard set of nucleosome occupancy or depleted sequences. The advantage of the ID method is the utilization of content measure, signal measure and similarity measure synthetically in a simple and unified approach. The different types of sequence information extracted by IDs are then integrated into one non-linear discriminant function through quadratic discriminant (QD) analysis. The accuracy of IDQD method in solving classification problem is comparable with SVM (42). As the information about nucleosome positioning beyond promoters is still rare, our IDQD model provides a useful tool for the genome-wide analysis of nucleosome distribution.
Knowledge of how nucleosomes are distributed in particular genomic regions could result in valuable and novel insights into DNA replication, mRNA splicing and so on. There is emerging evidence suggesting that splicing in most cases is initiated co-transcriptionally and that introns are removed while the nascent transcript is still tethered to the DNA by the RNA polymerase II (43). Thus, it is not impossible that the nucleosome positioning at DNA level may affect the exon recognition at RNA level. We have utilized the IDQD model trained by the top- and bottom-scoring probes in the dataset of Gupta et al. (18) which were taken from DNA sequences upstream of genes to predict the nucleosome organization in exon/intron regions within a gene. We found that the nucleosome occupancy potential is significantly higher in exons than in adjacent introns. To avoid the possible bias due to the particular dataset choice, we checked and reconfirmed the result by use of the high-resolution nucleosome occupancy data from human CD 4+ T cells (Supplementary Figures S4–S7). Our prediction can be proved by direct observation of the nucleosome-positioning data in splice junction region and the evidence has been offered in literatures (13,14). The success of the prediction of nucleosome organization in exon/intron regions by use of data taken from promoter sequences means both DNA regions from gene or promoter obeys the same law of nucleosome positioning.
We have studied nucleosome positioning in more detail around splice sites in different kinds of splicing events. In addition to constitutive splicing event, we have studied cassette exons, alternative 3′ and 5′ splicing events. We found the common feature that the nucleosome occupation potential is significantly stronger in exons but weaker in their adjacent intron regions and a sharp valley of NOScore locates at ~25 bp upstream of the acceptor site. The higher NOScore in exon regions may be related to the selection operating on DNA to impact codon usage, amino acid composition and synonymous and non-synonymous rates of evolution in coding sequences that has been analyzed in the Saccharomyces cerevisiae genome (44). It may also be related to enriched histone modifications existed in exons (13,14). However, the sharp NOScore valley immediately proceeding exons may be related to the signal for the branch point. Recent study has demonstrated that most of the branch point in the human genome are located 21–34 nucleotides upstream of the acceptor site and the polypyrimidine stretch spanning 4–24 nucleotides downstream of the branch point are enriched of ‘U’ nucleotide (45). These A+T frequencies present important features responsible for nucleosome inhibition in DNA sequence (29), which may be helpful to the function of branch point. This explains the nucleosome score valley ~25 bp upstream of the acceptor site.
Due to the presence of repetitive elements, introns may display a higher level of chromatin compaction than exons (46). We determined whether repetitive elements introduce bias into the above conclusions. All the splice junction sequences were annotated using the RepeatMasker program (47) and then grouped into two categories according to whether they contain repetitive elements. We computed the NOScores of the annotated sequences using the nucleosome positioning recognition model and found no significant nucleosome occupancy discrimination between repetitive-element-containing and repetitive-element-less sequences (Supplementary Figure S14), ruling out the possibility that repetitive elements would confound our observations.
The lower NOScore presents not only in splicing intron regions but also in distal intronic sequences. The frequency of exons taking higher NOScore (>0.24) is twice more than corresponding distal introns, while the frequency of exons taking lower NOScore (<0.22) is less than distal introns by a factor 2 (Supplementary Figure S9). On the other hand, it was reported that the nucleosome occupation potential in promoter regions is weaker than that in coding regions for the yeast genome (29,48,49). The systematic comparison of the NOScores in different regions of human genes shows a particularly unambiguous behavior: the highest in exon regions, intermediate in intron regions and lowest in promoter regions on average (Supplementary Figure S10a). However, the distribution of the standard deviation of NOScores is: the largest in promoters, intermediate in introns and lowest in exons (Supplementary Figure S10b), indicating the higher variability of the nucleosome occupancy in promoter regions. The above results show that promoter regions are more inhibitory on average and the complex regulatory functions require some regions of them to be nucleosome free and others to be bound by nucleosomes (29). The nucleosome organization of introns, to some extent, near that of promoters possibly supports the hypothesis that many undiscovered cis- and trans-acting regulatory elements reside in introns (50).
Nucleosome occupation or depletion occurs in DNA sequences. How do they influence pre-mRNA splicing occurred at RNA level? Recently, several new insights were gained into the mechanism for nucleosome organization in exon/intron junction region. Chromatin is a highly regulated environment and its association with the pre-mRNA would provide many opportunities to guide and modulate splicing. For example, information present on the chromatin could facilitate the detection of introns (46). The consensus sequences located at the intron–exon boundaries are highly degenerate and additional epigenetic information may be required for the accurate selection of splice sites. It was demonstrated that H3K36me3 and H3K79me1 in internal exons are related to splicing (13,51), as the internal exons might be visible to the spliceosome by well-positioned nucleosomes carrying H3K36me3 and/or H3K79me1. Simultaneously, nucleosomes bound within exons could improve the selection of exons by slowing the rate of RNA polymerase II (46,51–53), as a reduction in transcription rate has been shown to increase the inclusion of alternatively spliced exons (54).
As an alternative possible mechanism, we propose that the structural flexibility divergence between exons and introns also may be an important factor influencing pre-mRNA splicing. The accurate exon/intron discrimination requires splicing signals, including the splice sites, a polypyrimidine tract, a branch site and other splicing-regulatory elements that have been optimized in the evolution of RNA landscape (55). The interaction of the spliceosome with substrate RNA is also a basic event for the splicing process. It is influenced by local RNA structure, since the sequestration of RNA into double strands could prevent its binding to regulatory proteins (56).
The binding and motion of splicing apparatus along RNA sequences requires enough flexibility of these sequences. However, the flexibility of pre-mRNA is difficult to define. The flexibility of DNA and RNA duplexes in aqueous solution and crystal form was studied in refs (57,58) through molecular dynamics simulation and database analysis. These authors have pointed out that ‘the flexibility is a dangerous word for describing macromolecules, unless it is clearly defined’. The circumstance for pre-mRNA is more complicate, since different from the helix structure of B-DNA and A-RNA duplexes, the secondary structure of single-stranded RNA has more structural components, including base-pairs, bulges, interior loops, end loops, multi-branch loops, etc., and their complex combinatoriality.
In view of this, we defined FDE ratio directly from RNA folding energy data. The FDE, defined in Equation (5), means the relative folding flexibility of an RNA sequence. It contains two factors, the average and the deviation (SdEnergy) of RNA folding free energies. The former roughly describes how many base pairs form stems and how many bases are in loops in the secondary structure, while the latter describes the variability of the folding for an L-bp long RNA sequence. Both two factors influence the effective interaction and binding of the spliceosome with substrate RNA on some sites of the sequence.
Since RNA folding free energy deduced from the RNA-structure program (30,31) is dependent on the sequence length, we set L equal to the characteristic length of the splicing process, namely L equals the exon size. Table 1 gives the ratio of the FDE for up- and down-stream intron sequences to that for exons of different sizes. We found that the FDE for up- and down-stream introns is ~25, 19% higher than that for exons, respectively.
The discrepancy of FDE between exon and adjacent intron splicing regions is helpful to the recognition of splice sites and the accurate removal of introns. First, the high variability of the RNA structure and the small stem/loop ratio contained in intron sequences may facilitate the recruitment of snRNPs and many additional proteins to form the spliceosome in these regions and therefore promote the removal of the intron sequences through RNA–RNA, RNA–protein, RNA–chromatin (51) and protein–protein interactions. Secondly, the sequences transcribed from exons are relatively rigid, and due to the larger stem/loop ratio, their local structures are relatively stable, difficult to be bound by the spliceosome. It means that they can generally tolerate stronger mechanical stress, except at their two ends linking to introns. So it is likely that the breaking of pre-mRNA sequence in the splicing process always happens at the exon–intron boundaries.
The comparatively higher structural flexibility of up- and down-stream introns and lower structural flexibility of exons has been statistically tested by regression analysis. We found that the stronger anti-correlation does exist between the NOScore of DNA and the structural flexibility of RNA in splice junction regions (Figure 7 for constitutive splicing event, correlation coefficient r = −0.56, P < 0.0001; Supplementary Figures S15–S17 for alternative splicing ones, correlation coefficient r = −0.34 to −0.46, P < 0.0001). Simultaneously, the same but weaker anti-correlation was also found in the distal intron regions (Supplementary Figure S18, correlation coefficient r = −0.26, P < 0.0001). This means that FDE is an appropriate definition introduced for splicing analysis and this quantity bears some relation with nucleosome organization along DNA sequence.
In conclusion, our results have added a new layer to the understanding of nucleosome positioning and pre-mRNA splicing.
Supplementary Data are available at NAR Online.
The manuscript was submitted to Nucleic Acids Research in January, 2009, in which we primarily investigated and predicted the relation between nucleosome positioning and RNA splicing. In revision of the manuscript, we found the same views were proposed independently by Andersson et al. in Genome Research (August 2009) and by Schwartz et al. in Nature Structural and Molecular Biology (August 2009). Moreover, they demonstrated the relation by directly comparing with experimental nucleosome positioning data. So, in this revision, the related discussions are added in reference to newly published literatures, but our original prediction and all computational results on the nucleosome positioning around splice sites in different splicing events remain unchanged.
National Natural Science Foundations of China (No. 90403010 to L.F.L., No.10447003 and No. 60963015 to L.R.Z.). Funding for open access charge: Distinguished Scientist Award 2008 of Inner Mongolia Autonomous Region, China.
Conflict of interest statement. None declared.
We wish to express our gratitude to executive editor and three anonymous reviewers whose constructive comments were very helpful in strengthening the presentation of this article. We would like to thank Prof. Noble William Stafford of the Department of Computer Science and Engineering at University of Washington for providing the processed microarray data. We thank Dustin Schones at National Institutes of Health for his helpful discussions. We thank Xiaoli Li at Beijing Language and Culture University for her help in language correction of the manuscript. We are also grateful to Yongqiang Xing and Dr Jun Lv for fruitful discussions.