Mutations at splice sites occur frequently and result in the activation of the so-called cryptic splice sites (1–3
). Two typical cases in human genes, BRCA1 and BRCA2, contain several intronic genetic variants (4
), and approximately 5% of these are associated with splice site mutation (4
). These mutations have a potential effect on the activation of cryptic 5′ splice sites (5′ SSs) (4
) that lead to cryptic splicing events. These cryptic splicing events were considered aberrant and often cause human hereditary diseases (2
). Therefore, predicting the activation of cryptic 5′ SSs is an essential approach in investigating human hereditary diseases.
Various approaches are used in cryptic 5′ SS identification. Recently, an EST-based method named cryptic splice finder (CSF) (7
) used the spliced alignment of ESTs to identify the cryptic splice site. Although the CSF program is useful for investigating splicing mutation in genetic disease, it relies considerably on the availability of sufficient EST data and accurate genomic annotations. Another approach (8
) used information content (Ri
) to detect activated cryptic 5′ SSs in human genes. Ri
is the dot product of a particular sequence vector and weight matrix derived from the nucleotide frequencies at each splice site and is used to interpret mutated authentic splice sites and associated splicing regulatory sites (9
). Although Ri
provides useful information for analyzing the nucleotide substitutions that potentially impair splicing, the identification of activated cryptic 5′ SSs was reported to be less accurate. Sahashi et al.
) recently used the improved Ri
to estimate the splicing consequences of mutations at human 5′ SSs and discovered that Ri
had low sensitivity in predicting splicing mutations. In addition to the sequence-based analyses mentioned, a thermodynamic inference scheme, based on binding free energy (ΔG
) toward the stability of the RNA duplex between 5′ SS and U1 snRNA, was proposed for 5′ SS selection (11
). The method considered the effects of molecular structure and revealed that the ΔG
method may discriminate strong and intermediate activation of cryptic 5′SSs in competition assays. However, the identification for the intrinsic strength of cryptic 5′ SSs using ΔG
is considerably inaccurate (6
). Recently, Buratti et al.
) collected 254 cryptic 5′ SSs that were activated by mutations in human disease genes and analyzed the mutation patterns and nucleotide structures in detail. They also evaluated the performance of several computational methods, including the Shapiro and Senapathy matrix (S&S) (13
), the weight matrix model (WMM) (14
), the first-order Markov model (MM) (15
), the maximum entropy (MaxeEnt) (16
) and the maximum dependence decomposition model (MDD) (17
) in discriminating authentic and cryptic 5′ SSs. Buratti et al.
(2007) concluded that most of the authentic 5′ SSs contained a prediction score that was statistically higher than that in the cryptic 5′ SSs. Although most methods can locate the splice sites based on searching specific sequence patterns, the discrepancies between activated and inactivated splice sites are not addressed. In other words, these methods cannot identify the activation of cryptic splice sites when the mutations do not cause a change in prediction scores.
DNA molecules form complex structures and function by interacting with proteins, nucleic acids and other small regulatory molecules. To detect such interactions, the hydroxyl radical cleavage patterns (18
) were widely used for monitoring structural changes of DNA molecules with single residue spatial resolution. For example, the hydroxyl radical cleavage pattern was used for assessing the structure of DNA molecules and their related biological regulation (20
), especially the interactions of DNA–protein complexes (22–24
). Recently, the hydroxyl radical cleavage patterns of DNA were discovered to be associated with context-dependent mutation rates in mammals (25
) and local sequence bias of human mutation (26
). In addition, Parker et al.
) used the ORChID (OH Radical Cleavage Intensity Database) (28
) as genome-scale structural information to analyze the functional non-coding regions of the human genome. Their results indicated that single-nucleotide polymorphisms could induce larger structural changes in the non-coding DNA, and DNA structural changes may help to identify the phenotype-associated mutations (27
). Importantly, a recent report indicated that the changes of the structure properties of the local DNA sequence can influence the choice of cryptic 5′ SSs when DNA variants occur in human disease genes (29
). Therefore, it is crucial to realize the influence of single base pair substitutions in local DNA sequence context on the mRNA splicing phenotype. According to these studies, the DNA structure change may be a crucial factor for studying cryptic 5′ SS activation in human hereditary diseases; therefore, we used the hydroxyl radical cleavage pattern as the structure feature to improve the prediction for cryptic 5′ SSs in human disease genes.
The preference of DNA-based mutation screening strategies (12
) was used to investigate cryptic 5′ SSs in genetic diseases, and the feature was applied in the prediction tool (30
). In fact, some signals that may influence the choice of 5′ SSs in the local DNA sequence have been tested as a splicing feature for 5′ SS prediction (31
). To our knowledge, the association of DNA structure and the choice of cryptic 5′ SSs are rarely discussed, and a structure-based method for the screening of activated cryptic 5′ SSs for human disease genes is not available. In this study, an advanced version with structure-based method, named structure profiles and odds measure (SPO) algorithm, was developed to quantitatively evaluate the activation of a cryptic 5′ SS in competing with its authentic 5′ SS. The SPO algorithm combined structural profiles with odds measure to assess the activation likelihood for a cryptic 5′ SS. The results indicates that the SPO algorithm was more efficient than the other seven approaches, including S&S (13
), WMM (14
), MM (15
), MaxeEnt (16
), MDD (17
) and ΔG
), in identifying an activated cryptic 5′ SS in competition with its paired authentic 5′ SS. In addition, the ΔSPO score from the SPO algorithm was a more effective score than the others in identifying the inherent strength of 5′ SSs in human disease genes.