|Home | About | Journals | Submit | Contact Us | Français|
The activation of cryptic 5′ splice sites (5′ SSs) is often related to human hereditary diseases. The DNA-based mutation screening strategies are commonly used to recognize the cryptic 5′ SSs, because features of the local DNA sequence can influence the choice of cryptic 5′ SSs. To improve the identification of the cryptic 5′ SSs, we developed a structure-based method, named SPO (structure profiles and odds measure), which combines two parameters, the structural feature derived from hydroxyl radical cleavage pattern and odds measure, to assess the likelihood of a cryptic 5′ SS activation in competing with its paired authentic 5′ SS. Compared to the current tools for identifying activated cryptic 5′ SSs, the SPO algorithm achieves higher prediction accuracy than the other methods, including MaxEnt, MDD, Markov model, weight matrix model, Shapiro and Senapathy matrix, Ri and ΔG. In addition, the predicted ΔSPO scores from the SPO algorithm exhibited a greater degree of correlation with the strength of cryptic 5′ SS activation than that measured from the other seven methods. In conclusion, the SPO algorithm provides an optimal identification of cryptic 5′ SSs, can be applied in designing mutagenesis experiments for various splicing events and may be helpful to investigate the relationship between structural variants and human hereditary diseases.
Mutations at splice sites occur frequently and result in the activation of the so-called cryptic splice sites (1–3). Two typical cases in human genes, BRCA1 and BRCA2, contain several intronic genetic variants (4,5), and approximately 5% of these are associated with splice site mutation (4). These mutations have a potential effect on the activation of cryptic 5′ splice sites (5′ SSs) (4,5) that lead to cryptic splicing events. These cryptic splicing events were considered aberrant and often cause human hereditary diseases (2,6). Therefore, predicting the activation of cryptic 5′ SSs is an essential approach in investigating human hereditary diseases.
Various approaches are used in cryptic 5′ SS identification. Recently, an EST-based method named cryptic splice finder (CSF) (7) used the spliced alignment of ESTs to identify the cryptic splice site. Although the CSF program is useful for investigating splicing mutation in genetic disease, it relies considerably on the availability of sufficient EST data and accurate genomic annotations. Another approach (8,9) used information content (Ri) to detect activated cryptic 5′ SSs in human genes. Ri is the dot product of a particular sequence vector and weight matrix derived from the nucleotide frequencies at each splice site and is used to interpret mutated authentic splice sites and associated splicing regulatory sites (9). Although Ri provides useful information for analyzing the nucleotide substitutions that potentially impair splicing, the identification of activated cryptic 5′ SSs was reported to be less accurate. Sahashi et al. (10) recently used the improved Ri to estimate the splicing consequences of mutations at human 5′ SSs and discovered that Ri had low sensitivity in predicting splicing mutations. In addition to the sequence-based analyses mentioned, a thermodynamic inference scheme, based on binding free energy (ΔG) toward the stability of the RNA duplex between 5′ SS and U1 snRNA, was proposed for 5′ SS selection (11). The method considered the effects of molecular structure and revealed that the ΔG method may discriminate strong and intermediate activation of cryptic 5′SSs in competition assays. However, the identification for the intrinsic strength of cryptic 5′ SSs using ΔG is considerably inaccurate (6). Recently, Buratti et al. (12) collected 254 cryptic 5′ SSs that were activated by mutations in human disease genes and analyzed the mutation patterns and nucleotide structures in detail. They also evaluated the performance of several computational methods, including the Shapiro and Senapathy matrix (S&S) (13), the weight matrix model (WMM) (14), the first-order Markov model (MM) (15), the maximum entropy (MaxeEnt) (16) and the maximum dependence decomposition model (MDD) (17) in discriminating authentic and cryptic 5′ SSs. Buratti et al. (2007) concluded that most of the authentic 5′ SSs contained a prediction score that was statistically higher than that in the cryptic 5′ SSs. Although most methods can locate the splice sites based on searching specific sequence patterns, the discrepancies between activated and inactivated splice sites are not addressed. In other words, these methods cannot identify the activation of cryptic splice sites when the mutations do not cause a change in prediction scores.
DNA molecules form complex structures and function by interacting with proteins, nucleic acids and other small regulatory molecules. To detect such interactions, the hydroxyl radical cleavage patterns (18,19) were widely used for monitoring structural changes of DNA molecules with single residue spatial resolution. For example, the hydroxyl radical cleavage pattern was used for assessing the structure of DNA molecules and their related biological regulation (20,21), especially the interactions of DNA–protein complexes (22–24). Recently, the hydroxyl radical cleavage patterns of DNA were discovered to be associated with context-dependent mutation rates in mammals (25) and local sequence bias of human mutation (26). In addition, Parker et al. (27) used the ORChID (OH Radical Cleavage Intensity Database) (28) as genome-scale structural information to analyze the functional non-coding regions of the human genome. Their results indicated that single-nucleotide polymorphisms could induce larger structural changes in the non-coding DNA, and DNA structural changes may help to identify the phenotype-associated mutations (27). Importantly, a recent report indicated that the changes of the structure properties of the local DNA sequence can influence the choice of cryptic 5′ SSs when DNA variants occur in human disease genes (29). Therefore, it is crucial to realize the influence of single base pair substitutions in local DNA sequence context on the mRNA splicing phenotype. According to these studies, the DNA structure change may be a crucial factor for studying cryptic 5′ SS activation in human hereditary diseases; therefore, we used the hydroxyl radical cleavage pattern as the structure feature to improve the prediction for cryptic 5′ SSs in human disease genes.
The preference of DNA-based mutation screening strategies (12,30) was used to investigate cryptic 5′ SSs in genetic diseases, and the feature was applied in the prediction tool (30). In fact, some signals that may influence the choice of 5′ SSs in the local DNA sequence have been tested as a splicing feature for 5′ SS prediction (31). To our knowledge, the association of DNA structure and the choice of cryptic 5′ SSs are rarely discussed, and a structure-based method for the screening of activated cryptic 5′ SSs for human disease genes is not available. In this study, an advanced version with structure-based method, named structure profiles and odds measure (SPO) algorithm, was developed to quantitatively evaluate the activation of a cryptic 5′ SS in competing with its authentic 5′ SS. The SPO algorithm combined structural profiles with odds measure to assess the activation likelihood for a cryptic 5′ SS. The results indicates that the SPO algorithm was more efficient than the other seven approaches, including S&S (13), WMM (14), MM (15), MaxeEnt (16), MDD (17), Ri (10) and ΔG methods (32), in identifying an activated cryptic 5′ SS in competition with its paired authentic 5′ SS. In addition, the ΔSPO score from the SPO algorithm was a more effective score than the others in identifying the inherent strength of 5′ SSs in human disease genes.
Two sets of human mutation splicing sequence data were used for the development and evaluation of the SPO algorithm. The first data set, HMD1, was collected from published studies (6,8,12) containing 490 authentic and cryptic 5′ SS data pairs (Supplementary Table S1), which were experimentally validated. Of the 490 data pairs, 275 were inactivated pairs and 215 were activated pairs. These 490 pairs of splice site sequences were used to train the SPO algorithm in determining a scoring threshold for the successful prediction of cryptic 5′ SS activation. The second data set, HMD2, contained 52 data pairs (Supplementary Table S2) from two competition assays, competition scheme I (CS-I) and competition scheme II (CS-II), which contained 26 authentic and cryptic 5′ SS data pairs (11). The CS-I compared mutations of cryptic 5′ SSs with wild types of authentic 5′ SSs, whereas CS-II compared mutations of cryptic 5′ SSs with weakened types of authentic 5′ SSs. From CS-I and CS-II, each group of 26 cryptic 5′ SSs was subdivided into 6 strong, 13 intermediate and 7 weak cryptic 5′ SSs according to their splicing strength. The HDM2 sequences were solely used to correlate the scoring method with the actual activation strength for cryptic 5′ SS independent from those 490 paired splicing sequences from HDM1. In total, 189249 5′ SSs (10) from the entire human genome were extracted as source data for the SPO algorithm.
For the likelihood of activating a cryptic 5′ SS, the SPO algorithm was developed based on the combination of structural profiles with odds measure. The structural profiles consider the local DNA structural change between the before and after mutation that occurs in a 5′ SS and the odds measure computes the actual relative probability for a splicing event to occur. Figure 1 shows the SPO algorithm. The details of defining and combining these two numerals (‘SP’ for structural profiles and ‘O’ for odds) into the proposed ‘SPO’ algorithm are as follows:
The performance of the proposed SPO algorithm in the identification of activated cryptic 5′ SSs was evaluated with the other seven reported approaches, that is, S&S (13), WMM (14), MM (15), MaxeEnt (16), MDD (17), Ri (10) and ΔG (32). Comparative evaluation was conducted by using a 5-fold cross-validation of 490 paired splicing sequences that were included in the HMD1 data set. First, all 490 pairs of splicing sequences were divided equally into five partitions. Each partition was a testing set, and the remaining four partitions were used for training. In total, five testing sets were used, and each training set was four times the size of its corresponding testing set. The indices that were used to evaluate the performance included the following: sensitivity, specificity, accuracy, precision and F-measure, which may be defined as TP/(TP+FN), TN/(FP+TN), (TP+TN)/(TP+FN+TN+FP), TP/(TP+FP) and 2×(sensitivity×specificity)/(sensitivity+specificity), respectively. The TP, TN, FP and FN represented the count of true positive, true negative, false positive and false negative cases, respectively. The receiver operating characteristic (ROC) curves from the sensitivity and 1 − specificity of the eight methods were constructed based on varying delta scores for determining the activation of a cryptic 5′ SS. The area under the ROC curve (AUC) was used as a measurement for their performance. In addition to these methods, Pearson's coefficient was also used to evaluate the correlation between the predicted scores and the activation strength of cryptic 5′ SSs from the HMD2 data set.
A 5-fold cross-validation of 490 paired sequences from the HMD1 data set was conducted. This 5-fold cross-validation was also used to determine the ΔSPO threshold in the SPO algorithm. For each of the five sets of training sequences, the ΔSPO threshold that yielded the optimal F-measure on the corresponding testing sequences was chosen. The value that corresponded to the highest occurrence of these five thresholds (to five decimal points) was designated as T for the ΔSPO threshold in the SPO algorithm. Based on this, a cryptic 5′ SS competing with its authentic 5′ SS was considered activated if its ΔSPO score was greater than T, and the amount of ΔSPO score elevated from T was used to rank the probability for such activation. If no single highest occurrence appeared from any of these five thresholds, the 5-fold cross-validation was reiterated until such a threshold was obtained.
An HMD1 data set that contained 490 pairs of human authentic and cryptic 5′ splice sequences was used for evaluating the performance of the proposed SPO algorithm (Supplementary Table S3). A threshold of T=1.2214, previously obtained from analyzing the HMD1 data set with 5-fold cross-validation, was used to determine whether a splice site was activated. The detailed sensitivity, precision, specificity, false positive rate, accuracy and F-measure in different ΔSPO score thresholds were shown in Figure 2. Moreover, the other seven reported approaches, including S&S (13), WMM (14), MM (15), MaxeEnt (16), MDD (17), Ri (10) and ΔG (32), were used for comparison. Note that these seven approaches can evaluate the likelihood of a 5′ SS based on searching specific sequence patterns, but they do not consider the comparative competition between a cryptic 5′ SS and its paired authentic 5′ SS. Therefore, to assess the likelihood of a cryptic 5′ SS activation in competing with its paired authentic 5′ SS, these seven approaches were modified by using the following scheme. After mutation occurrence, ‘ΔRi’ was defined as the Ri value of a cryptic 5′ SS subtracted by the Ri value of its paired authentic 5′ SS. The other methods were modified by using the same procedure, except ΔG method. Subject to the definition of ΔG, the delta of ΔG was defined and represented by the symbol ‘ΔΔG’. ΔΔG was the ΔG value of the authentic 5′ SS subtracted by the ΔG value of the cryptic 5′ SS. All seven deltas were derived from the same 490 paired splicing sequences. Finally, −0.009, −0.09, −0.27, 0.9362, 0.9836, −0.5408 and −0.0001 were obtained as the ΔSPO threshold for ΔMaxEnt, ΔMDD, ΔMM, ΔWMM, ΔS&S, ΔRi and ΔΔG, respectively.
Table 1 summarizes the performance of these eight scoring methods. According to the results from the 5-fold cross-validation, the SPO algorithm outperformed the others for accurately identifying activated cryptic 5′ SSs competing with paired authentic 5′ SSs in all six categories. Note a different modified strategy (taking the ratio defined as cryptic 5′ SS score divided by authentic 5′ SS score) for the seven scoring methods was also tested, the result remained consistent (Supplementary Table S4). The quantitative comparison between the scoring methods also showed that the SPO algorithm had the best prediction performance (Figure 3). In addition, the proposed SPO algorithm predicted 166/202=82.2% point mutation cases, 8/9=88.9% deletion cases, 2/3=66.7% insertion cases and 1/1=100% duplication cases when these mutations occurred. In the comparison with the other seven reported approaches (Table 2), the SPO algorithm yielded the highest accuracy for the identification of activated cryptic 5′ SSs in various mutant categories, especially in point mutation cases.
To verify that the proposed SPO algorithm can identify cryptic 5′ SSs of various activation strengths, an HMD2 data set containing 52 data pairs from two competition assays (11) was used, including 12 strong, 26 intermediate and 14 weak 5′ SSs, according to various activation levels (11). Based on the comparison for the performance of the other seven methods (Table 3), the SPO algorithm consistently achieved a high accuracy in all of the three groups and yielded the highest accuracy when the three groups of data were pair wisely combined as used in Roca’ study (11).
A Pearson's coefficient (r value) was computed between these two variables by using the HMD2 test data (consisting of two competition assays CS-I and CS-II, each of which included 26 authentic and cryptic 5′ SS data pairs) to correlate the strength of cryptic 5′ SS activation with the predicted ΔSPO scores. Table 4 summarizes the resulting r values for ΔSPO, ΔMaxEnt, ΔMDD, ΔMM, ΔWMM, ΔS&S, ΔRi and ΔΔG scores, in which the SPO algorithm displayed a greater degree of correlation than the others. In particular, the SPO algorithm appeared to perform efficiently for both CS-I and CS-II assays; however, all the other seven methods demonstrated relatively inferior performance for CS-I assay than for CS-II assay. It is known that wild types of authentic 5′ SSs were used in CS-I assay, but weakened types of authentic 5′ SSs were used in the CS-II assay (11). In in vitro experiments, the average activation of cryptic 5′ SSs was considerably stronger (P=6.13E−07) in the CS-II assay than in the CS-I assay. Therefore, activation of cryptic 5′ SSs in the CS-II assay is easier than in the CS-I assay. In summary, the SPO algorithm was able to correctly predict the activation of a cryptic 5′ SS as well as to infer the activation level by evaluating the increase of ΔSPO score from its threshold. With this feature, it is reasonable to verify the cryptic 5′ SS activation by ranking the ΔSPO scores, when a number of splicing pairs were available for consideration. In other words, SPO algorithm can be used to predict novel cryptic 5′ SSs, especially when sequencing data (like RNA-seq data) is not available.
To analyze whether DNA structural profiles extracting from the hydroxyl radical cleavage pattern can improve the identification of activated cryptic 5′ SSs, the HMD1 data set and HMD2 data set were used to estimate the effect of structural profiles. First, without the inference from structural profiles, the identification for activated cryptic 5′ SSs from HMD1 data set decreased by 7.9% in sensitivity, 4.4% in specificity, 5.9% in accuracy, 6.2% in F-measure, 6.2% in precision and 5.7% in AUC (corresponding to the result in Table 1). Second, without using structural profiles, the SPO algorithm obtained a lower degree (82%) of correlation between the strength of cryptic 5′ SS activation and ΔSPO score (corresponding to the result in Table 4), and its accuracy decreased to 0.865 for the analysis of the 52 data pairs from HMD2 data set. Interestingly, the DNA structural profiles can also improve the 2, 2, 2, 4, 5, 6 and 1% degrees of correlation between the strength of cryptic 5′ SS activation and score from MaxeEnt (16), MDD (17), MM (15), S&S (13), WMM (14), Ri (10) and ΔG (32), respectively, for the analysis of the HMD2 data set. The improvement for the seven methods was based on the use of the Sc value as a weight factor to multiply the original scores from these compared approaches. For example, an improved ΔMaxEnt score was defined as the ΔMaxEnt score multiplied by the Sc value. The scores for the other methods were improved by using the same strategy. These results indicate that DNA structural profiles derived from the hydroxyl radical cleavage pattern can improve the identification of activated cryptic 5′ SSs in human mutation cases.
Although the effect of DNA structural profiles was useful for identifying activated cryptic 5′ SSs, the detailed relationship between the DNA structural profiles and the cryptic 5′ SSs is unclear. One possible explanation could be that the changes of the DNA structural profiles at either the cryptic 5′ SS or the corresponding authentic 5′ SS may respond to the strength of cryptic 5′ SS activation. On the other hand, the changes of DNA structural profiles may be involved in non-intronic splicing mechanism when mutation occurs on the DNA level. Some non-intronic splicing information was assumed to play a vital role in shaping the split structure of eukaryote genes (7). Consequently, the DNA structural profiles may improve the identification of cryptic 5′ SSs in eukaryote genes.
This study proposes the SPO algorithm that combined structural profiles with odds measure to obtain the ΔSPO score for identifying the activated cryptic 5′ SSs. Based on the results, the SPO algorithm yields a superior identification of cryptic 5′ SSs than that by the other seven methods, and its ΔSPO score also provides information to estimate the inherent strength of 5′ SSs in human mutation data. In practical application, the SPO algorithm can be used as a powerful tool for designing mutagenesis experiments of various splicing events and can be used to study the influences of activated cryptic 5′ SSs in the field of amino acid changes in human hereditary diseases.
Supplementary Data are available at NAR Online: Supplementary Tables 1–4.
National Science Council of Taiwan (Grant No: NSC99-2627-M-001-005-MY3; 99-2621-B-001-005-MY2). Funding for open access charge: Biodiversity Research Center, Academia Sinica, Taiwan.
Conflict of interest statement. None declared.
We thank for the comments from the anonymous reviewers. The experimental data provided by Roca, Rogan and Buratti’ studies are also appreciated.