Alternative splicing is a prevalent mechanism of post-transcriptional gene regulation in multicellular eukaryotes. It allows a single gene to increase its functional and regulatory diversity, through the synthesis of multiple mRNA isoforms encoding structurally and functionally distinct protein products (
1). High-throughput RNA sequencing reveals that over 90% of multi-exon genes in mammalian genomes undergo alternative splicing (
2,
3). The strikingly high frequency of alternative splicing underscores its contribution to the organismal complexity of higher eukaryotes.
The fidelity of splicing is tightly regulated by interactions between
cis elements in exons and flanking introns and
trans splicing regulators that recognize these elements (
4,
5). Disruption of normal splicing regulation, even a shift in the ratio of mRNA isoforms of the same gene sometimes can have major functional consequences and cause human diseases (
6–
8). The most conserved features of exon recognition are splice site signals known as the 5′ splice site (donor site) and the 3′ splice site (acceptor site). The splice sites define the boundaries between exons and introns, at which the spliceosome must assemble. Importantly, the recognition of the 5′ splice sites (i.e. the donor sites) represents the first and a critical step of spliceosome assembly (
9). The vast majority (>99%) of 5′ splice sites in eukaryotic genomes are characterized by a highly conserved ‘GT’ dinucleotide in the intronic region immediately adjacent to the exon–intron boundaries (
10–
12). There are several additional conserved but degenerate nucleotide positions in the exonic and intronic regions surrounding the GT dinucleotide, which are part of the consensus 5′ splice site signal (
12,
13). Numerous disease-causing mutations within the consensus 5′ splice site disrupt splicing, leading to defective mRNA and protein products (
14–
16). However, there are also a large number of polymorphisms in the 5′ splice site with no effect on splicing (
16). Given the prevalence of aberrant alternative splicing in human diseases, it is critical to obtain an improved understanding of the signals that determine the splicing impact of 5′ splice site mutations. Such knowledge could aid in the identification of pathogenic mutations among neutral variants in large-scale medical sequencing projects.
In recent years, there has been growing evidence for widespread natural variations of alternative splicing in humans (
17–
24). Single nucleotide polymorphisms (SNPs) are the major contributor of splicing variations in human populations (
25). For example, an intronic SNP (rs3812718) in
SCN1A, which encodes a neuronal sodium-channel alpha subunit, modulates the alternative splicing of its exon 5 and affects the dose-response to antiepileptic drugs (
26). Another example is the low-density lipoproteins receptor (
LDLR), in which a SNP (rs688) promotes skipping of its exon 12 in the liver of women (
27). This exon skipping form is predicted to produce a truncated protein product lacking the transmembrane segment. Importantly, this SNP is strongly associated with an increased level of total and LDL-cholesterol in females especially in pre-menopausal women (
27). Using high-density exon arrays or high-throughput RNA sequencing, several groups have performed genome-scale surveys of splicing differences among human individuals (
17–
19,
21–
23). For example, using the Affymetrix exon 1.0 array, Kwan
et al. (
18) examined alternative splicing patterns in lymphoblastoid cell lines (LCLs) of 57 unrelated HapMap CEU individuals. They identified 177 genes whose transcript isoform compositions (owing to alternative splicing, alternative promoter usage and alternative polyadenylation) correlated strongly with surrounding SNPs. Using a similar approach, Heinzen
et al. (
21) identified 80 high-confidence associations between SNP and alternative splicing in cortical brain samples and peripheral blood mononuclear cell samples.
In this study, we explored whether natural variations of alternative splicing among human individuals could reveal important signals of 5′ splice site recognition. In a panel of seven LCLs of Asian, European and African ancestry, for which extensive genotyping data were collected by the International HapMap project (
28) and a recent genome-wide exome sequencing study (
29), we identified 1174 SNPs within the consensus 5′ splice site (three exonic nucleotides and six intronic nucleotides surrounding the exon–intron boundary) (
13). We selected 129 SNPs predicted to significantly alter the 5′ splice site activity according to the consensus splice site model in MAXENT (
13), and examined their impacts on exon splicing using a fluorescently labeled RT–PCR assay. SNPs that disrupted the GT dinucleotide immediately downstream of the exon always altered splicing, consistent with the essential role of the GT dinucleotide in 5′ splice site recognition. Surprisingly, outside of the almost invariable GT dinucleotide, only ~14% of tested SNPs affected splicing, while the vast majority (~86%) of tested exons were unaffected by the 5′ splice site SNPs. Bioinformatic analysis identified signals that could modify the splicing impact of 5′ splice site polymorphisms, most notably a strong 3′ splice site upstream of the exon and the presence of particular intronic sequence motifs downstream of the 5′ splice site. The activity of these predicted sequence features was experimentally confirmed by minigene splicing reporter experiments. In an exon of
TRIM62, the upstream 3′ splice site and poly-G runs in the downstream intron functioned redundantly to protect an exon from its 5′ splice site polymorphism. Collectively, our study provides genomic and experimental evidence for widespread context-dependent robustness to 5′ splice site polymorphisms in human transcriptomes.