|Home | About | Journals | Submit | Contact Us | Français|
Despite a growing number of splicing mutations found in hereditary diseases, utilization of aberrant splice sites and their effects on gene expression remain challenging to predict. We compiled sequences of 346 aberrant 5′splice sites (5′ss) that were activated by mutations in 166 human disease genes. Mutations within the 5′ss consensus accounted for 254 cryptic 5′ss and mutations elsewhere activated 92 de novo 5′ss. Point mutations leading to cryptic 5′ss activation were most common in the first intron nucleotide, followed by the fifth nucleotide. Substitutions at position +5 were exclusively G>A transitions, which was largely attributable to high mutability rates of C/G>T/A. However, the frequency of point mutations at position +5 was significantly higher than that observed in the Human Gene Mutation Database, suggesting that alterations of this position are particularly prone to aberrant splicing, possibly due to a requirement for sequential interactions with U1 and U6 snRNAs. Cryptic 5′ss were best predicted by computational algorithms that accommodate nucleotide dependencies and not by weight-matrix models. Discrimination of intronic 5′ss from their authentic counterparts was less effective than for exonic sites, as the former were intrinsically stronger than the latter. Computational prediction of exonic de novo 5′ss was poor, suggesting that their activation critically depends on exonic splicing enhancers or silencers. The authentic counterparts of aberrant 5′ss were significantly weaker than the average human 5′ss. The development of an online database of aberrant 5′ss will be useful for studying basic mechanisms of splice-site selection, identifying splicing mutations and optimizing splice-site prediction algorithms.
Mutations that influence pre-mRNA splicing represent a substantial proportion of gene alterations leading to Mendelian disorders (1). cDNA-based mutation studies of disease genes that have a large number of introns showed that splicing mutations accounted for about half of mutated alleles (2,3). In contrast, estimates derived from DNA-based mutation screening designed to scan coding regions and flanking intronic sequences have generally been lower (1,4). As a significant fraction of mutated alleles in both recessive and dominant conditions has not been identified, and the availability of RNA samples from affected individuals and their families is often problematic, the overall contribution of intronic alterations acting at the level of pre-mRNA splicing could be substantial. In addition to single-gene disorders, DNA variants that influence splicing may modify the risk of developing complex diseases and their phenotypic manifestations, but the overall role of this variability in the pathogenesis of such conditions is only beginning to be explored (5–8).
The most common consequence of splicing mutations is skipping of one or more exons, followed by the activation of aberrant 5′ (donor) splice sites (5′ss), 3′ (acceptor) splice sites (3′ss) and full intron retention (1,9,10). Mutation-induced aberrant splice sites found in disease genes often involve disruption of the consensus sequence of the authentic sites, while activating a cryptic splice site nearby. However, aberrant splice sites can also be generated by mutations that create splice-site consensus sequences. As described earlier (11), we refer to these aberrant splice sites as cryptic and de novo, respectively, even though the distinction between cryptic and de novo sites may occasionally be vague, because disruption of the authentic site can also create a new splice site consensus.
Cryptic 3′ss are preferentially located in exons whereas de novo 3′ss usually reside in introns, which has been attributed to splicing signal sequences upstream of the 3′ss that are required for selection of acceptor sites, including the polypyrimidine tract (PPT) and the branch point sequence (BPS) (12). In contrast to cryptic 3′ss, cryptic 5′ss have a similar frequency distribution in exons and introns and their number decreases with increasing distance from the authentic 5′ss (11). The human 5′ss consensus sequence is MAG|GURAGU (M is A or C; R is purine), spanning from position −3 to position +6 relative to the exon–intron junction. This sequence is critical but often insufficient for accurate 5′ss recognition, and may require auxiliary sequences in both introns and exons. These sequences can repress or activate splicing and are referred to as splicing silencers or enhancers, respectively (13–17). The complementarity of the 5′ss consensus to the 5′ end of U1 small nuclear RNA (snRNA) exerts a dominant effect on 5′ss selection, but auxiliary sequences may exhibit a more prominent role in selection of competing 5′ss with lower base-pairing complementarity (18,19). In addition, the intrinsic structural properties of the RNA molecule may hinder 5′ss availability for basal splicing factors, thus controlling splicing efficiency (20–22). Moreover, 5′ss selection can also be influenced by the presence of sequence motifs specific for RNA-binding proteins (23) and by the rate at which the pre-mRNA is transcribed (21).
A variety of methods have been used to computationally predict the 5′ss strength and recognition, including nucleotide frequency matrices (24,25), machine-learning approaches and neural networks (NNs) (26,27) and methods employing putative base-pairing interactions of 5′ss with U1 snRNP (28–30) and interdependence between adjacent or more distant positions of the splicing consensus sequences (31). Exon-prediction algorithms that take into account protein-coding information may perform better than those that rely only on signals present in the splice sites (32). However, it is unknown which models best predict the localization of cryptic or de novo 5′ss that were activated in vivo.
In the present study, we compiled nucleotide sequences of cryptic and de novo 5′ss that have been reported previously in human disease genes since the first description of disease-causing aberrant splice sites (33–35). This resource is being made available to the public through an online retrieval and submission tool termed DBASS5 (database of aberrant 5′ splice sites). In addition, we provide a detailed characterization of the underlying mutation pattern, a comparison of the nucleotide composition of aberrant and corresponding authentic 5′ss, and we evaluate the performance of computational tools that predict their utilization.
Aberrant 5′ss were identified by searching home pages of peer-reviewed journals and PubMed (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi). They were included in the database and selected for further analysis if (i) they resulted from disease-causing or -predisposing mutations or variants in human genes; (ii) aberrant RNA products spliced to new 5′ss were verified by nucleotide sequencing; and (iii) their sequences or reliable identifiers were published in peer-reviewed journals between 1981 and January 2007. We also included 22 cases of aberrant 5′ss that were confirmed by minigene assays with wild-type and mutated reporter constructs transfected into mammalian cells, but from which patients’ RNA samples were not available. These criteria were similar to those used for a recently published analysis of aberrant 3′ss (36).
Aberrant 5′ss were manually validated by mapping the information in the literature to sequences in the Human Genome Project databases. Nucleotide sequences of authentic, mutated and aberrant 5′ and 3′ss are available online in the Database of Aberrant Splice Sites http://www.dbass.org.uk/, which consists of the recently described DBASS3 (36) and the newly developed DBASS5.
Validated sequences of aberrant and corresponding authentic 5′ss were used as input files for seven publicly available splice-site prediction algorithms. The Shapiro and Senapathy (S&S) matrix is based on nucleotide frequencies of 5′ss and assumes independence between individual positions of the 9-nt consensus (24,25). The S&S matrix scores were computed using an online tool available at http://ast.bioinfo.tau.ac.il/SpliceSiteFrame.htm. To take into account known dependencies between adjacent and non-adjacent positions of the 5′ss consensus, the compiled sequences were analysed using the first-order Markov model (MM) and the maximum entropy (ME) model (31). The former method considers dependencies between adjacent positions, whereas the latter model approximates short-sequence motif distributions with the ME distribution and may include dependencies between non-adjacent as well as adjacent positions. The maximum dependence decomposition model (MDD) is a decision-tree approach that accentuates the strongest dependencies in the early branches of the tree (37). The MM, ME, MDD and weight-matrix (WMM) scores, which extract single nucleotide probabilities for each position from a training set (38), were computed using online tools at http://genes.mit.edu/burgelab/maxent/Xmaxentscan_scoreseq_acc.html. The HBond algorithm, which analyses individual hydrogen-bonding patterns to the U1 snRNA 5′ end irrespective of nucleotide frequencies and assumes that the threshold values for U1 snRNP binding are influenced by specific SR proteins (29) was computed using a web application available at http://www.uni-duesseldorf.de/rna/html/hbond_score.php. The NN algorithm is a machine-learning approach that recognizes sequence patterns once it is trained with DNA sequences encompassing authentic splice sites (27). We used the NN splice site predictor NNSPLICE (v. 0.9) at http://www.fruitfly.org/seq_tools/splice.html. The free energy (ΔG) of predicted 5′ss/U1 base-pairing was computed using OligoArrayAux (39), which is available at http://www.bioinfo.rpi.edu/applications/hybrid/Oligo-ArrayAux.php. Finally, the number of H bonds (#H) between 5′ss and U1 was computed using a web tool at http://ast.bioinfo.tau.ac.il/SpliceSiteFrame.htm.
To compare the strength of aberrant or authentic 5′ss with a large number of human 5′ss, we used the sequences of 8415 5′ss reported previously (31). The non-parametric Wilcoxon–Mann–Whitney rank test (Stat-200, v. 2.01, Biosoft Ltd., UK) was employed to test the significance of score differences between authentic and aberrant 5′ss in each category.
DBASS5 (database of aberrant 5′ splice sites) is an online retrieval and submission tool for mutation-induced aberrant 5′ss available at http://www.dbass.org.uk/5/, complementing a recently described sister database of aberrant 3′ss, termed DBASS3 (36). The web application was created using the Microsoft ASP and ASP.Net server technology (http://www.asp.net), and Microsoft SQL Server database software (http://www.microsoft.com/sql/). In addition to aberrant 5′ss induced by disease-associated germline and somatic mutations, DBASS5 contains naturally occurring DNA variants that were shown to modify both the relative expression of RNA products spliced to alternative 5′ss and the disease predisposition. Polymorphisms that control exon skipping levels or full intron retention events have not been included in DBASS5.
An exhaustive search for previously published aberrant 5′ss identified 305 unique aberrant 5′ss in 166 genes (Table 1). They were generated by a total of 26 deletions/duplications, 3 insertions, 2 complex alterations and 315 point mutations (Table 2). These alterations were described in a total of 264 publications.
The number of reported cryptic 5′ss was almost three times higher than the number of de novo 5′ss (Table 1). Cryptic 5′ss were usually activated by single-nucleotide substitutions of guanosine (G) residues, which were ~3-times more common than mutations of the remaining nucleotides (177 versus 57, P < 10−15, Table 2). Conversely, substituting adenosines accounted for almost every other point mutation. Among single-nucleotide substitutions leading to de novo 5′ss, cytosine was the most frequently mutated nucleotide (32/81, 40%). In contrast, no de novo 5′ss have thus far been reported to be created by a point mutation introducing cytosine (Table 2).
The overall distribution of unique point mutations within the 9-nt consensus sequence was highly non-random both for cryptic and de novo 5′ss (Figure 1). For cryptic 5′ss, point mutations were most common at the highly conserved position +1 relative to the natural intron/exon junctions (39.4%). Interestingly, the second most frequently mutated position was the fifth intron nucleotide (21.6%), followed by positions +2 (14.7%) and –1 (14.3%). Point mutations at positions +3, +4, +6 and −2 each accounted for <3% of all the single-nucleotide substitutions. In contrast to cryptic 5′ss, the most frequent point mutations resulting in de novo 5′ss were at the highly conserved first (28%) and second (34%) intron nucleotides (Figure 1). Single-nucleotide substitutions at position +5 were found only in 5/81 (6%) unique de novo 5′ss as opposed to 50/234 unique cryptic 5′ss (χ2 = 8.6, P = 0.003). The ratio of point mutations in cryptic over de novo 5′ss in the authentic and new 9-nt consensus, respectively, was highest for position +5 (10.0), followed by position −1 (5.5) and +1 (4.1), with an average ratio for all positions of 2.9.
The overall proportion of point mutations in patients with aberrant 5′ss that created the 5′GT consensus was ~55% (Table 2). Newly created 5′GT dinucleotides were utilized by the spliceosome in 100% of the observed cases. In contrast, although mutations generating 3′AG dinucleotides found in individuals with aberrant 3′ss are also present in about half of the cases, only ~95% are used in vivo, owing to the presence of ‘AG exclusion zones’ downstream of the BPS (36).
Tables 3 and and44 show the breakdown of point mutations by nucleotide and by highly conserved positions of the 5′ss consensus. Transitions (R-to-R or Y-to-Y, Y is pyrimidine), which account for 62.5% of point mutations in human disease genes (40), were found in 58.7% of cases (Table 3). Comparison of mutations in highly conserved positions of the 5′ss consensus with those expected based on previously published mononucleotide mutation rates corrected for a number of confounding effects (40) suggested that the biased distribution is unlikely to be fully explained by differential mutability (Table 4; P = 0.002 and 4.3 × 10−7 for position +1 and +2, respectively). However, comparison with the published dinucleotides rates that take into account nearest-neighbour effects no longer showed a significant P-value for position +1, consistent with a severe block of splicing following mutations to any nucleotide (41). Nevertheless, the distribution of point mutations at position +2 was still unlikely to be fully explained by differential mutabilities (Table 4, P = 0.035), raising the possibility that the observed under-representation of +2C/A among cryptic 5′ss may be attributed to higher residual levels of accurately spliced pre-mRNAs with 5′GC or 5′GA dinucleotides. This would be consistent with a previously observed +2T>+2C>+2A>+2G hierarchy in splicing efficiency (42,43) and with efficient recognition of the 0.56% of mammalian introns that have 5′GC-3′AG splice sites (44). Finally, the distribution of point mutations at positions +1 and +2 and that for all splicing mutations in the Human Gene Mutation Database (HGMD, http://www.hgmd.cf.ac.uk/ac/index.php) were not significantly different (P = 0.77 for position +1 and P = 0.15 for position +2, Table 4). This suggests that the mutation spectrum of the 5′GT dinucleotide is similar for aberrant 5′ss and exon skipping events, which represent the bulk of HGMD entries.
Interestingly, all point mutations at position +5 of authentic 5′ss that activated cryptic 5′ss were substitutions of G, and not any other nucleotide (Figure 2), raising the possibility that 5′ss with +5G are more susceptible to aberrant splice-site activation than 5′ss with +5H (non-G). However, assuming ~78% occupancy of this nucleotide in human 5′ss (30) and a G/C substitution rate of ~70% derived from the HGMD data (40), the expected number of +5H substitutions among authentic sites whose mutation induces cryptic 5′ss activation would only be ~4 in our dataset and not significantly different from zero (P = 0.1, Fisher's exact test). A prominent influence of differential mutability rates on the mutation spectrum was also supported by the observed predominance of +5G>A transitions over transversions (Table 4). In addition, the distribution of point mutations activating cryptic 5′ss was significantly different from that resulting in de novo sites (P < 0.0001, Figure 1), with the latter showing peaks in the most conserved positions +1 and +2 and exclusive +5A>G transitions relative to new 5′ss that were located both in exons (45) or introns (46–48). However, unique point mutations in the 9-nt consensus logged in the HGMD (4) and in our sample (Figure 1) had significantly different distributions (χ2 = 27.7, P = 0.0005), with position +5 clearly overrepresented in our dataset (~22% versus ~12%). This suggests that the mutation spectra underlying cryptic 5′ss activation and exon skipping events are distinct.
A search for literature reports of point mutations that produce either aberrant 5′ss activation or exon skipping in the same 5′ss consensus revealed several discordant cases. For example, the FBN1 substitution IVS46+5G>A resulted in cryptic 5′ss activation 33-nt downstream of the authentic exon–intron junction (49), whereas the IVS46+1G>A mutation caused exon 46 skipping (50). Similarly, the PTEN mutation IVS7+1G>A activated a cryptic 5′ss 75-nt downstream of the authentic exon–intron boundary (51), whereas mutation IVS7+2T>G in the same 5′ss led exclusively to exon 7 skipping (52). In the latter case, IVS7+2T>G creates a putative splicing silencer containing the AGGG motif, which may prevent activation of the downstream cryptic 5′ss, whereas IVS7+1G>A results in no consecutive Gs in the 5′ss consensus.
The presence of IVS+5H in authentic 5′ss, which is not predicted to base-pair with U1 or U6 snRNAs, was proposed to be compensated by having G at the last exon position (–1G) (53). The –1G can base-pair to U1 snRNA (30) and is almost completely conserved (97.5%) in IVS+5H 5′ss (53). The +5/−1 association was confirmed with a large sample of homologous human-mouse 5′ss (30). In our dataset, only 18/35 (51%) of unique authentic 5′ss that were repressed by mutations of IVS+5G in favour of a cryptic 5′ss had -1G. This percentage is significantly (χ2 = 10.9, P = 0.001) lower than for a large set of authentic 5′ss (5142/6716, ~77%). In addition to position –1, adenosine –2 was less frequent in our sample (31%; 11/24) as compared with 57% in average 5′ss (3830/6716, P = 0.002), while the number of uracils at position +6 was higher (25/35; 71% versus 3415/6716; 51%; χ2 = 5.1, P = 0.02). These results are consistent with previously described +5 dependencies (30,53) and suggest that authentic 5′ss that are susceptible to IVS+5 mutations are less likely to make sufficient contacts between positions –1/–2 and their interacting factors, but may exhibit stronger putative base-pairing interactions between U1/U6 snRNAs and intron position +6.
Figure 3 shows the relative representation of each nucleotide in the consensus sequence of aberrant 5′ss (upper panels) and the corresponding authentic sites (lower panels). The consensus sequence of cryptic 5′ss had lower proportions of conserved residues than for authentic 5′ss at each position, except for the invariant position +1 (Figure 3A). This difference was much reduced in de novo 5′ss, in which conserved nucleotides at positions +3 through +6 had even higher frequencies than those in their authentic counterparts (Figure 3B). Sequence alignments of cryptic and de novo sites generated in exons and introns are shown in Supplementary Figures 1–3 together with their authentic counterparts.
Apart from ΔG and #H between 5′ss and U1 snRNA, we used seven different algorithms that predict utilization of 5′ss in multiple sequences and are publicly available (Figure 4A,B). Cryptic 5′ss had significantly lower scores with each algorithm, lower #H and higher ΔG than their authentic, wild-type counterparts. Cryptic 5′ss were most effectively discriminated from the authentic sites by the ME model, followed by MDD and MM algorithms. P-values obtained for the HBond and NN scores were higher, even when we disregarded cryptic 5′ss with non-canonical 5′ss dinucleotides to obtain the scores and replaced them with group means. All these models clearly outperformed the matrix-based prediction scores—S&S and WMM. The #H and ΔG values gave the poorest, albeit still significant discrimination. The weakness of cryptic 5′ss was well illustrated by a shift of the #H peak frequency from seven in the authentic counterparts to six in the cryptic 5′ss (Figure 4C).
In contrast to cryptic 5′ss, de novo 5′ss were not distinguished from their authentic counterparts by any of the tested algorithms (Figure 4B). Although the number of de novo 5′ss was smaller than cryptic 5′ss (Table 1), random selections of the same number of cryptic 5′ss and their comparison with authentic sites gave consistently significant discrimination with several algorithms (data not shown), indicating that computational prediction of de novo 5′ss is poor. However, newly created 5′ss activating pseudoexons had higher ME scores than the remaining de novo 5′ss (8.66 ± 3.00 versus 6.07 ± 4.83, P = 0.0002) or the remaining intronic de novo 5′ss (Table 5, P = 0.002). The corresponding 3′ss of these pseudoexons were slightly stronger than intronic de novo 3′ss (ME scores 6.79 ± 3.39 versus 5.24 ± 4.50, P = 0.04) ascertained previously (36), but were not significantly different from exonic de novo 3′ss or their authentic counterparts. Thus, activation of cryptic exons through de novo 5′ss use requires their high strength and may be facilitated by intrinsically stronger decoy 3′ss across the newly formed exon.
We then tested each computational method separately for aberrant 5′ss in exons and introns (Table 5). Although cryptic 5′ss in exons were best discriminated by the ME scores, the lowest P-values for cryptic 5′ss in introns were achieved by the NN model. To test whether this could be explained by having to disregard 5′GC splice sites for the NN method in both datasets and replace them by group means, we recalculated the NN and ME scores after removing 5′GC splice sites, but we obtained a similar result (P = 1.2 × 10−12 versus 1.0 × 10−10, respectively). Authentic counterparts of intronic de novo 5′ss were intrinsically weak and therefore less likely to challenge newly created competitors. However, this was not evident for exonic de novo sites, strongly suggesting that their activation is more reliant on splicing regulatory sequences in exons rather than on the intrinsic strength of the 5′ss consensus (Table 5). The overall performance of ME, MDD, MM, HBond and NN models for the whole set of aberrant 5′ss was very similar, with minimal differences in P-values. Finally, mutated authentic 5′ss were on average weaker than cryptic 5′ss, confirming an earlier observation (11). Again, the lowest P-values of the non-parametric test were observed for the ME model (Table 5 and data not shown). Thus, as shown for 3′ss (36), the ME algorithm discriminated best both wild-type and mutated authentic 5′ss from cryptic 5′ss (Table 5), thus providing a method of choice for computational prediction of aberrant splice sites.
Next, we carried out pair-wise comparisons of cryptic and de novo 5′ss with their authentic counterparts. For each computational algorithm, we determined the proportion of aberrant 5′ss that showed equal or higher scores than their respective wild-type authentic sites (Figure 4D). This proportion was on average significantly higher for de novo 5′ss than for cryptic 5′ss and roughly reflected the ability of each method to discriminate between aberrant 5′ss and their authentic counterparts. The percentage of exonic cryptic 5′ss with equal or higher scores than their authentic counterparts was lowest for the ME algorithm (10.5%). For intronic cryptic 5′ss, the same proportion was lowest for the NN method (14.4%, Figure 4D). Using the best-performing algorithms, ~12.3% of cryptic 5′ss were computationally stronger than their wild-type authentic counterparts, yet they were used in vivo only if the wild-type 5′ss consensus was inactivated or weakened by mutation. This underscores the importance of factors that repress utilization of decoy splice sites that are present in excess over natural sites in the genome.
Importantly, the authentic counterparts of cryptic 5′ss were significantly weaker than a large collection of 8415 natural 5′ss (31), with the ME scores of 7.75 ± 2.50 and 8.37 ± 2.08, respectively (P = 2 × 10−6; Wilcoxon–Mann–Whitney rank test). The distribution of #H in the authentic counterparts of cryptic 5′ss and natural 5′ss was also significantly different (P = 0.02, χ2 = 13.5, 5 df for 4–9 #H), with a maximum difference at 8 #H (Figure 4C). The relative weakness of the authentic counterparts of cryptic 5′ss is consistent with the notion that mutations in less conserved positions of stronger 5′ss produce, on average, higher amounts of natural transcripts and less severe phenotypes than identical alterations in intrinsically weaker 5′ss.
The predicted strengths of authentic sites that were mutated at position +5 were significantly lower than the average authentic 5′ss (ME scores 7.55 ± 1.81 versus 8.37 ± 2.08, P = 0.0002) and also somewhat lower than the authentic counterparts of all unique cryptic 5′ss in our dataset (7.55 ± 1.81 versus 7.75 ± 2.52), despite all having +5G and a higher than average relative frequency of +6T (Figure 2). Guanine at position +5 was proposed not to be obligatory for 5′ss selection if the two preceding positions are purines (28); nevertheless 24/35 (69%) of unique authentic 5′ss with point mutations of +5G had only purines at positions +3 and +4 and only a single authentic counterpart had pyrimidines at both positions.
Figure 5 shows a comparison of the ME scores of cryptic and de novo 5′ss by mutated position in the authentic and new 5′ss consensus, respectively. Cryptic 5′ss had similar ME values irrespective of the location of the point mutation (P > 0.05, F-test). In contrast, de novo sites created by mutations in a subset of intronic positions of the new 5′ss consensus tended to be stronger, with statistically significant differences between de novo and cryptic 5′ss for the highly conserved positions +1 and +2. In addition, we compared the ME scores of aberrant 5′ss with their respective counterparts (Figure 5). The authentic counterparts of cryptic 5′ss induced by point mutations at positions +2 and +5 of natural sites were weaker than the authentic counterparts of cryptic 5′ss activated by substitutions at position +1 (7.24 ± 1.93 for position +2 and 7.22 ± 2.06 for position +5 versus 8.34 ± 2.21 for position +1, P < 0.01 for both comparisons). This is consistent with the notion that mutations at less conserved positions of authentic 5′ss are less likely to completely inactivate the 5′ss and result in recognizable phenotypes than mutations at position +1. The authentic counterparts of de novo sites induced by mutations at position +1 were significantly weaker than the authentic counterparts of cryptic 5′ss induced by mutations at the same position (6.33 ± 3.39 versus 8.34 ± 2.21). The number of mutations for the remaining positions of the 5′ss consensus was too small for meaningful comparisons. The average intrinsic strength of aberrant and authentic 5′ss in each category is schematically summarized as the mean ME score in Figure 6.
Taken together, cryptic 5′ss generated in vivo were best predicted by models that accommodate nucleotide dependencies in the 5′ss, particularly by the ME algorithm, which takes into account non-adjacent positions (Figure 4A). Discrimination of exonic cryptic 5′ss from their authentic counterparts was more efficient than that for intronic cryptic 5′ss, because the former category of aberrant 5′ss was weaker than the latter (P = 0.02), for which the NN model gave the best performance (Table 5). Computational discrimination of de novo 5′ss and their authentic counterparts was poor (Figure 4B) as de novo 5′ss were, on average, stronger than cryptic 5′ss (Table 5), particularly when generated by point mutations in highly conserved intronic positions of the new 5′ss consensus (Figure 5). The intrinsic strength of exonic de novo 5′ss could not be distinguished from their authentic sites at all, pointing to the importance of exonic regulatory sequences in their selection. Finally, the authentic counterparts of aberrant (both cryptic and de novo) 5′ss were weaker than a large collection of human 5′ss, highlighting the practical importance of ranking splice sites in human disease genes using efficient computational tools. We propose that their systematic categorization may facilitate identification of intronic mutations or polymorphisms that affect pre-mRNA splicing, improve the interpretation of unknown alterations and, ultimately, increase the cost-effectiveness of mutation screening.
The DBASS5 (http://www.dbass.org.uk/5) provides access to the database of aberrant 5′ss through the search option (Supplemental Figure 4A). DBASS5 can be searched by phenotype, gene, mutation, location of aberrant 5′ss and their distance from authentic 5′ss. If more than one database entry is found, the user can manually choose the details page (Supplemental Figure 4B), which shows nucleotide sequences flanking the authentic and aberrant 5′ss, the estimated strength of both authentic and aberrant 5′ss and literature references with a PubMed hyperlink. DBASS5 visitors can register to obtain regular updates by email and can submit published data through a submission tool. Potential applications of DBASS5 include optimization of splice-site prediction algorithms, leading to improved prediction of aberrant 5′ss, identification of genes and gene segments frequently involved in aberrant splice-site activation, detection of splicing mutations in a gene or phenotype of interest and selection of in vitro models for studying basic mechanisms of 5′ss utilization.
This report presents the first comprehensive and publicly available database of aberrant splice sites in human disease genes. Together with a recently described database of aberrant 3′ss (36), this combined resource now contains over 600 unique mutations that create or activate a total of 562 aberrant splice sites.
The overall number of reported aberrant 5′ss was higher than aberrant 3′ss, consistent with sequence limitations imposed by additional signal sequences upstream of 3′ss (BPS and PPT) that are important for recognition of splice acceptor sites. The relative ratio of non-repetitive aberrant 5′ss (n = 305) and 3′ss (n = 257) [(36) and I.V.,(unpublished data)], was smaller than that reported for unique splicing mutations in the HGMD that were arbitrarily selected to reside in 5 exonic and 15 intronic nucleotides adjacent to natural splice sites (4), i.e, 1.2 versus 1.5, respectively. The lower ratio might reflect a reporting bias towards mutations closer to authentic splice sites for exon skipping events. Mutations located upstream of intronic splicing signals that are required for 3′ss selection could not be detected in many published mutation reports, because these regions were amplified only for a subset of introns or were not scanned at all. In addition, the lower ratio could be due to an under-representation of mutations leading to de novo splice sites in the HGMD as compared to our dataset. Also, the availability of suitable decoy splice sites near mutated sites is likely to determine if the outcome of a splicing mutation is exon skipping or aberrant splice-site activation (4).
The higher number of cryptic than de novo 5′ss (Table 1) can, probably to a large extent, be explained by a detection bias of DNA-based mutation screening, a method used to identify most aberrant 5′ss in this dataset, towards coding regions and flanking intronic sequences. As explained above, classification of aberrant 5′ss as cryptic and de novo 5′ss may occasionally be vague, but DBASS5 contains only two ambiguous examples (54,55). Both cases were induced by G+1-to-T+1 substitutions in 5′ss that had G at position −1, creating a new 5′GT 1-nt upstream of the authentic 5′'ss. Both cases were classified as cryptic 5′ss in our analysis. The rarity of such cases confirms the validity of the previously proposed (11) categorization of aberrant 5′ss.
The most frequent point mutations that activated aberrant 5′ss were purine transitions, accounting for 45.7% cases (11.1% A>G and 34.6% G>A mutations; Table 3). This figure seems to be somewhat lower (P = 0.08) than the ~54% (113/211) observed for aberrant 3′ss (36), probably due to a higher prevalence of transitions in the 3′YAG than those in the 5′ss consensus. Cryptic 5′ss resulted from point mutations in each nucleotide of the 9-nt consensus except for position –3, consistent with this position being the least conserved. However, position –3 has previously been implicated in pathological exon skipping in well-documented cases (56–59), suggesting that –3 substitutions in weak 5′ss are also likely to result in aberrant 5′ss, although these cases must be rare and aberrant splicing and putative phenotypic manifestations could be subtle.
As for positions adjacent to the 9-nt 5′ss consensus, each of the reported single-base substitutions at intron positions +7 and +8 created new 5′GT dinucleotides in situ that were used in vivo (60,61). Despite position +7 exhibiting a predominance of purines after several rounds of functional 5′ss selection experiments (28), point mutations downstream of the 5′ss consensus resulting in activation of cryptic 5′ss have thus far not been reported. Among upstream substitutions in DBASS5, we found only a single case of an exonic de novo 5′ss generated by a C>T transition 11-nt upstream of an authentic 5′ss (62), consistent with a disruption of exonic splicing regulatory sequences.
Are point mutations at any position of the 5′ss consensus sequence particularly prone to aberrant 5′ss activation? As observed for mutations in the HGMD (4), position +1 led the frequency table in our dataset, with 49.8% and 39.4% mutations observed in the two studies, respectively (Figure 1). However, the overall distribution of mutations within the 5′ss consensus was significantly different between the two. In particular, the proportion of mutations at position +5 was almost twice as high among cryptic 5′ss than in the HGMD [Figure 1 and ref. (4)]. For unique point mutations leading to cryptic 5′ss activation, this position was in the second place and position +2 in the third, whereas this order was opposite for unique mutations in the HGMD (50 and 34 mutations versus 347 and 456, respectively; P = 0.004).
G at position +5 is nearly invariant in Saccharomyces cerevisiae (63). In contrast, +5G is present in only ~88% of S. pombae introns (64) and ~78% of human introns (30), indicating that relief from the absolute requirement for G was an ancient evolutionary event. Comparison of exonized and non-exonized intronic Alu repeats revealed a higher number of +5Gs in exonized sequences (65). Mutations at position +5 have resulted in frequent activation of cryptic 5′ss both in yeasts (66–69) and humans (Figure 1, all references available at: http://www.dbass.org.uk). Our study is the first to provide statistical evidence that this position is important for distinct aberrant splicing outcomes. DBASS5 gives many examples of natural 5′ss in which different point mutations resulted in the same cryptic 5′ss. Similarly, there are numerous cases in the literature of identical exon skipping events caused by different point mutations in the same 5′ss. The identification of several exceptions in humans using the DBASS5 and HGMD data (49–52) is consistent with an earlier observation in S. cerevisiae, namely that cryptic 5′ss activation by +5G>A mutation was not replicated for another 5′ss point mutation in the same intron (67). These rare examples may provide important insights into the requirements for activation of aberrant 5′ss, as opposed to exon skipping events.
In addition to the local sequence context, the frequent occurrence of +5G>A substitutions underlying aberrant 5′ss activation (Figure 1 and Table 3) can be explained by a more severe splicing outcome of these transitions. More dramatic splicing defects for +5G>A transitions than +5G transversions were found in S. cerevisiae (69). In contrast, each IVS1 + 5G>H mutation in the human proinsulin gene promoted activation of a competing decoy 5′ss 26nt downstream of the authentic 5′ss to the same extent, irrespective of the substituting nucleotide [(7); J.K. and I.V., unpublished data], consistent with a position effect.
What interaction(s) at position +5 is crucial for aberrant 5′ss activation? Authentic 5′ss in which mutation at position +5 generated cryptic 5′ss had a high proportion of +5G+6T (Figure 2). Interestingly, the +5G+6T dinucleotide signifies the most frequent location for alternative 5′ss across several species, and this preference was suggested to result from U1 binding rather than U6 binding (70). However, compensatory mutations in U1 snRNA that restore base-pairing with the mutated intron frequently fail to suppress aberrant splicing, suggesting that position +5 is engaged in additional interactions (28,71–74). Interaction of U6 snRNA with the 5′ss (75) at intron position +5 (76) was partially suppressed by U6 mutations predicted to increase base-pairing (77,78). Although the 5′ ss has very limited complementarity to the U6 ACAGAG motif, this interaction seems to be important for accurate 5′ss selection also in mammals (78–80), albeit not in all systems (81). In addition, cryptic splice sites have been induced by co-expression of splicing reporters with mutated snRNAs, including U1 (82), U5 (83,84) and U6 (77–79,85), both in yeast (77,82,83,85) and mammals (78,79,84).
As U1 snRNP binds sequences that are not used as 5′ss and is present in excess over U6, sequential occupancy by both snRNPs may be absolutely essential for accurate 5′ss utilization (86,87). This would be consistent with rate-limiting U6 snRNA interactions with the pre-mRNA observed for U1-independent splicing (78) and loose requirements for U1 binding in numerous introns (28,71,72,74). Mutations of IVS+5G that resulted in cryptic 5′ss occurred in relatively weak authentic 5′ss and they are likely to further reduce U1 binding so that it may no longer be sufficient for accurate 5′ss recognition. U6/U4.U5 snRNP would then bind to nearby pseudo-sites that are, on average, intrinsically stronger than the mutated authentic 5′ss [Table 5 and ref. (11)]. The strength of predicted base-pairing interactions at positions +5 and +6 in the authentic counterparts (Figure 2) may hamper the transfer of these 5′ss from U1 to U6. In fact, strong 5′ss were reported to be inhibitory in S. cerevisiae, potentially delaying the release of U1 and productive interactions with U6 (28), although extended complementarity between U1 snRNA and a human immunodeficiency virus 1 donor site did not inhibit splicing (81). Mutations that destabilized a yeast 5′ss/U6 duplex improved the second step of splicing and hyperstabilization of the 5′ss/U6 interaction had the opposite effect, suggesting that changing the stability of these interactions alters the equilibrium between the first and second step conformations (88). Suppression of 5′ss mutations by U6 in a hybrid reporter was more efficient when U1 could pair nearby than when pairing was restored further away (79). In addition, position +5 may directly interact with U6 residues that base pair to the BPS recognition region of U2 (89) as well as with other splicing factors, such as PRP8 (90,91). Finally, sequential recognition of position +5 is likely to require contacts with exon-bound factors that may substitute for U1 interactions (92,93), and that may be essential for spliceosome assembly at authentic 5′ss and contribute to the observed high number of cryptic sites as compared to de novo 5′ss (Figure 1 and Table 1).
In summary, we have shown that cryptic 5′ss in human disease genes are best predicted by computational methods that accommodate nucleotide dependencies and not by methods employing only nucleotide frequency matrices. Discrimination of intronic cryptic 5′ss from their authentic counterparts was less effective than for exonic cryptic 5′ss, as the former were intrinsically stronger than the latter. Computational prediction of exonic de novo 5′ss was poor, suggesting that their activation in vivo critically depends on exonic splicing enhancers or silencers, rather than on the strength of the 5′ss consensus, and that improved algorithms for their prediction will need to accommodate auxiliary splicing sequences. The authentic counterparts of both de novo and cryptic 5′ss were weaker than the average human 5′ss, highlighting the practical importance of ranking splice sites in disease genes to improve detection of splicing mutations. The mutation spectra of cryptic and de novo 5′ss were distinct and differed also from that underlying exon skipping events, implicating point mutations at position +5 in frequent activation of cryptic 5′ss. Finally, the development of an online database of aberrant 5′ss will facilitate detection of introns and exons frequently involved in aberrant splicing, identification of auxiliary sequences that control selection of aberrant splice sites, fine-tuning of splice-site prediction algorithms, identification of splicing mutations, as well as studies of the basic mechanisms of splice-site selection.
Supplementary Data are available at NAR Online.
This work was supported by the Juvenile Diabetes Research Foundation International (1-2006-263), Telethon Onlus Foundation (GGP02453 and GGP06147), FIRB (RBNE01W9PM), and by the EC grant EURASNET-LSHG-CT-2005-518238. A.R.K. acknowledges support from NIH grant GM42699. We thank S. Mills, R. Sood, P. Gibbs and T. Bryant for technical help. Funding to pay the Open Access publication charges for this article was shared equally by the above funding sources.
Conflict of interest statement. None declared.