During evolution from simple to higher eukaryotes, splicing signals evolved from well-defined motifs to degenerated sequences with the addition of new auxiliary splicing sequences known as ESE and ESS. Although major SR proteins have been cloned and their target sites determined, much work remains to be done to understand how splice signals are recognized and splicing specificity achieved. As this complex world is progressively revealed, bioinformatics resources could play a major role in helping researchers and diagnostic laboratories to evaluate the consequence of mutations on splicing, especially because most genetic tests use DNA and not RNA samples. By giving an easy access to predictions of 5′ss, 3′ss, BP sequences as well as ESE and ESS, the HSF tool (http://www.umd.be/HSF/
) fulfills this need and may assist clinicians, geneticists and researchers (70–75
). By combining motifs identified with different experimental and computational approaches, it provides a common interface that can be used for sequence analysis. The inclusion of all exons and introns extracted from the Ensembl human genome database (20
) allows an easy access to any sequence of human genes and thus direct comparison of virtually every mutation or SNP concerning splicing elements. Since SNPs are present at a very high frequency in the genome (1/300
bp) it could be useful to evaluate their impact in association with a mutation. We therefore included in HSF data from dbSNP using Ensembl Biomart. The user can select the ‘Search for SNPs related to the analyzed sequence’ option that automatically retrieves SNPs from the database. When SNPs are localized in exons, their effect on ESE and ESS motifs could help the user to better evaluate the consequence of a given mutation.
To evaluate the efficiency of the various algorithms included in HSF and its contribution to the prediction of the consequences of mutations associated with a splicing defect, we used a set of 69 intronic mutations that disrupt the 5′ss or the 3′ss and result in exon skipping and/or activation of a cryptic splice site (), and a group of 15 mutations that were previously reported to result in splicing defects by creating or activating cryptic splice sites (). HSF was able to correctly predict the disruption of the natural splice sites. Moreover, we could confirm that (i) mutations of the last nucleotide of an exon have a strong effect on the 5′ss (ΔCV
12% ± 0.7) resulting frequently in exon skipping or partial exonic deletion or intronic retention due to activation of a cryptic splice site; (ii) mutations of the penultimate exonic nucleotide have limited consequences on the 5′ss (ΔCV
5.4% ± 0.3), but they can activate a cryptic splice site, making predictions more difficult; (iii) exonic mutations distant from the 5′ and 3′ss can activate a cryptic splice site leading to partial exonic deletion. Overall these findings underline the efficiency of the HSF algorithm to predict the effect of mutations on 5′ and 3′ss. When using the HSF algorithm, the threshold for 5′ and 3′ss is 65 with a pathogenic ΔCV of −10% except for position +4 where it is −7%. However, in few cases when unusual splice sites are used, this algorithm could be less efficient.
BP sequences represent another essential splicing signal. When a mutation is localized in proximity of the 5′ of the 3′ss, its potential effect on a BP sequence should be examined especially when a nucleotide located at less than 85
bp from the 3′ss is targeted. In order to evaluate the HSF algorithm dedicated to the identification of BP sequences, we used 14 BP sequences inactivated by intronic mutations (). HSF correctly predicted 13 out of 14 BPs and these data allowed us to define the threshold for BP detection at 67 and the pathogenic ΔBP at −10%. Moreover, for intron 3 of XPC
, HSF predicted a BP at position −24. However, according to Khan et al
), two BP sequences are present in this intron, one at positions −24 and another at –4. HSF could not predict the BP at position −4 simply because the HSF algorithm excludes positions −12 to −1 for BP identification because of steric obstruction caused by the spliceosome.
It has been demonstrated that two different splicing recognition mechanisms, correlated with intron length, can be used in a cell: exon definition for long and exon definition for short introns (77
). Although the influence of intron length seems to be less important in humans than in other species, it should, nevertheless, be kept in mind since U12 and U2-type introns have different BP consensus sequences. In the present version of HSF (v2.4), we only focused on U2-type introns, which are by far the most abundant type in mammalian cells.
-acting elements, many works have been performed to define ESE and ESS matrices based on bioinformatics or experimental approaches (78
). However, due to technical and/or conceptual bias, the various sequence sets only share partial homology. To solve this problem, HSF included all available matrices in one place. In addition, we developed new matrices to predict ESE motifs for the 9G8 and Tra2-β SR proteins and ESS motifs for the hnRNPA1 ribonucleoprotein. ESE and ESS motifs frequently overlap and therefore the identification of the specific motif/protein pair involved in a given splicing defect is difficult. This is even more complicated when considering the impact of SR and ribonucleoprotein concentration in different tissues or during development. We used a set of 20 exonic mutations known to influence splicing through ESE inactivation or ESS activation () to evaluate the efficiency of HSF to correctly predict motifs disrupted by these mutations. We showed that when the motif/protein pairs had been previously experimentally characterized (hnRNPA1 or SF2/ASF), HSF was able to correctly predict the effects of the mutation on ESE and ESS. For most mutations, however, only the general mechanism was identified (i.e. the mutant sequence inhibits splicing in various in vitro
reporter systems) and therefore the motif/protein couple is unknown. In these cases, HSF predicted the disruption of ESE motifs and/or the creation of ESS motifs (). In addition, to evaluate HSF efficiency to discriminate true from false positive signals, we used a second group of positive and negative controls (Supplementary Table 1
). We showed that both sets could be discriminated on the basis of their overall pattern (ESE, ESS, ESE+ESS; χ2
0.0028). Three matrices also gave statistically significant results: ESE-Finder (χ2
0.0067), 9G8 and Tra2ß from HSF (χ2
0.0017) and PESE (χ2
). Since these three matrices predict ESE motifs, these results could be associated with a bias towards the positive controls. Indeed, only few experimental validations of auxiliary sequences are available and they are frequently initiated by predictions of ESE motifs using ESE-Finder. PESE and the 9G8/Tra2ß HSF matrices gave stronger results than ESE Finder itself and therefore can be considered efficient matrices for the identification of ESE motifs. However, predictions with other matrices, especially the hnRNPA1 matrix, should also be considered as they could provide valuable information as shown for the c.4250T>A of DMD
. We are still in the early days of ESE and ESS motif predictions and further data are needed to select the best matrices and to define the rules for data interpretation as most mutation sets used to validate prediction tools contain mainly mutations affecting splice sites (79
). Major work is also needed to ultimately address the tissue or developmental specificity.
In conclusion, the HSF tool is dedicated to the prediction of splicing signals present in any human gene using all available matrices to identify ESE and ESS and new matrices to evaluate 5′ and 3′ss and BPs. This tool is regularly updated to include new data from bioinformatics and experimental studies in order to improve predictions. Many users already have tested HSF and have stressed its value both for basic science (identification of splicing signals) and applied research or diagnostics (prediction of the pathogenic consequences of a given mutation) (70–75
). In addition, new genotype-based therapies, such as the exon-skipping approach in Duchenne Muscular Dystrophy, are currently evaluated in clinical trials (international multi-center phase I/II clinical studies with PRO051 in patients with Duchenne Muscular Dystrophy – Prosensa company; http://prosensa.eu/
). HSF might represent an useful tool to identify key splicing sequences in different exons (75
) and therefore to design antisense oligonucleotides to induce exon skipping. This approach is being actively evaluated throughout the world and especially by the TREAT-NMD European network (http://www.treat-nmd.eu/home.php
Besides these gene-specific approaches, global projects, which either aim at developing a holistic view on Genotype-To-Phenotype data (GEN2PHEN European projects; http://www.gen2phen.org/
) or at improving health outcomes by facilitating the analysis of human genetic variation and its impact on human health, such as the Human Variome Project (81
), might benefit from using HSF. Indeed, HSF could help to predict the theoretical impact on splicing of any sequence variation affecting a human gene.