We have performed deep sequencing of cDNA libraries generated from mRNA from 75 lymphoblastoid cell lines derived from Nigerian individuals as part of the International HapMap Project (69 from Pickrell et al.
[17] and 6 additional ones). In total, we generated 1.4 billion sequencing reads of either 35 or 46 base pairs. 1.2 billion of these sequencing reads mapped to the genome. We used the remainder to identify splice junctions (without reference to known exons) by splitting each sequencing read into two and mapping each end to the genome independently. In total, 48 million additional reads mapped to the genome using this read-splitting procedure, and we identified 392,612 putative splice junctions.
Most previous investigations of splice junctions using RNA-seq data have considered putative splice junctions only between previously annotated or predicted exons
[2],
[4],
[5],
[18],
[19]. Since our method makes no such restriction, one reasonable concern is that we might identify spurious junctions due to mapping or sequencing errors. However, there is strong evidence that the identified junctions are real outcomes of splicing reactions. First, despite the fact that our mapping approach used no information about the sequence specificity of the splicing reaction, the majority of the putative junctions contain GT-AG dinucleotides (the canonical splice sites) directly intronic of the edges of the predicted splice sites (). If we assume that all putative junctions without intronic matches to the canonical splice sites (or the alternative dinucleotide pair GC-AG) are false positives, we estimate a False Discovery Rate (FDR) of 1.5% for the 306,606 junctions that do contain such matches (
Methods). Second, the positions of putative alternative splice sites near protein-coding splice sites follow a periodic pattern, such that splice sites which maintain the coding frame of the exon are observed more often than those which disrupt frame (). In total, 42% of these alternative splice sites are in frame (

; binomial test against the null hypothesis of 33%). This observation recapitulates patterns seen in studies of EST databases
[20]–
[22]. In the remainder of the paper we analyze only the set of 306,606 junctions which contain intronic matches to GT-AG or GC-AG.
Many identified junctions are novel
We compared the junctions we identified to gene models from the UCSC, Ensembl, Vega, and RefSeq databases, and to spliced ESTs from Genbank
[23]–
[27]. Of the 306,606 splice junctions, 154,927 (50.5%) are not annotated as parts of known gene models, and 136,313 (44.5%) are not present in Genbank. For splice junctions not present in gene models, we estimate an FDR of about 2% (
Methods). The extensive unannotated splicing we observe is largely due to junctions that are rarely seen in our data (): while 50% of all observed junctions are not present in gene models, these account for only 1.7% of all junction-spanning sequencing reads (). For example, 21 of the 32 splice junctions observed in the gene
HERPUD1 are unannotated, but only around 0.5% of the reads from this gene are derived from these 21 unannotated junctions (). We see no sign that our identification of isoforms is near saturation (
Figure S1); thus deeper sequencing of transcriptomes will likely continue to identify additional low-abundance isoforms.
| Table 1Characteristics of observed junctions. |
Characteristics of identified junctions and splice sites
Next, we quantified overall levels of alternative splicing. To do this, we considered a set of splice sites covered by at least 50 reads in our data (there are 77,754 such 5′ splice sites and 77,733 such 3′ splice sites;

of these are annotated in gene model databases). We then counted both the number of different places in the genome to which each site is spliced, as well as the proportions of reads covering each junction. We estimate that the “major” splice form accounts for 98.4% of reactions involving each splice site; in our data the average splice site is involved in 1.8 different splicing reactions. 5′ splice sites in untranslated regions (UTRs) are involved in a mean of 2.6 splicing reactions, versus 1.8 for 5′ splice sites in protein-coding regions; the corresponding numbers are 3.2 and 1.7 for 3′ splice sites.
We then evaluated how the splice junctions correspond to known gene models. We split the junctions into five classes: (i) those where the junction is annotated; (ii) those between splice sites that are both annotated (but not annotated as being spliced together); (iii) those where the 5′ splice site (but not the 3′ splice site) is annotated; (iv) those where the 3′ splice site (but not the 5′ splice site) is annotated; and (v) those where neither splice site is annotated. 80% of the unannotated junctions involve at least one annotated splice site, and though many newly-identified splice sites fall near annotated sites, the majority do not (). This indicates that a large fraction of the low-abundance isoforms are not modifications of known exons, but instead contain entirely new exons.
Extensive unannotated splicing is present in different human populations and tissues
We next asked whether we could confirm these observations in other cell lines and primary tissues. We first performed the same analysis on RNA-Seq data generated on a different set of human LCLs
[28]. We identified 219,322 splice junctions at an FDR of 1.5%, 82,658 of which are unannotated (
Figure S1). We then analyzed RNA-Seq data from primary human liver samples (George Perry, unpublished data); we identified 156,905 splice junctions at an FDR of 0.8%, 29,655 of which are unannotated (
Figure S1). Finally, we analyzed RNA-Seq data from a set of several primary tissues
[2]; we identified 136,499 splice junctions at an FDR of 1.8%, 21,743 of which are unannotated (
Figure S1). The numbers of unannotated splice junctions in these studies are roughly consistent with the observed numbers found in the Yoruba data, given the lower sequencing depths of those other studies and hence better sampling of common junctions relative to rare junctions (
Figure S1). This confirms that the observation of extensive unannotated splicing is broadly generalizable; for the rest of the paper we focus on the original set of Yoruban LCL data since this is the largest RNA-Seq dataset currently available.
We also considered how much splicing shows evidence of being restricted to particular individuals, rather than shared across the entire population (due to, for example, sequence polymorphisms which influence splicing
[17],
[28]–
[30]). Though it is difficult to estimate this precisely, several analyses suggest that only a couple percent of splice junctions, at maximum, show evidence of being restricted to certain individuals (
Text S1).
Most unannotated splice junctions show no evolutionary conservation
We hypothesized that unannotated, rarely-used splice sites are the result of evolutionarily-neutral (or perhaps slightly deleterious) splicing errors. To test this, we compared the sequence conservation across placental mammals (using the
phyloP score
[31]) between the unannotated and annotated splice sites (we assume that current gene databases are highly enriched for truly functional exons). If the unannotated splice sites are functionally relevant, their sequence conservation should be comparable to that of annotated splice sites. For this analysis, we used the set of splice junctions where one end is an annotated splice site and the other is more than 50 bases away from an annotated splice site.
Annotated splice sites show a striking pattern of sequence conservation–the splice sites themselves are highly conserved, as are the regions exonic of the splice sites (). In contrast, the conservation scores around unannotated splice sites show little, if any, signal of evolutionary constraint (). This same pattern holds in an independent sample of LCLs and in primary human liver samples (
Figures S2 and
S3). To exclude the possibility that our analysis overlooked splice sites that are conserved in only a subset of placental mammals, we repeated this analysis using
phyloP scores calculated using only primates, and saw the same pattern (
Figure S4). Further, humans have reduced polymorphism in the annotated splice sites, but no such reduction in the unannotated splice sites (
Figure S5). The most parsimonious explanation of these observations is that the majority of rarely-used, unannotated splice sites are simply due to mis-spliced transcripts.
Estimation of splicing error rates
To estimate the fraction of mature mRNAs resulting from mis-splicing, we identified a set of splice junctions where both the 5′ and 3′ splice sites are highly conserved (those with a
phyloP score

). We then asked how often the conserved splice sites are spliced to unconserved splice sites (to provide a conservative lower bound on the amount of mis-splicing, we used a relaxed threshold of a
phyloP score

0.5 to call a splice site as unconserved for this analysis). Approximately 0.7% of reads involving either end of a conserved junction are to unconserved splice sites. Given that the median gene in the human genome has four exons (and thus three splicing reactions), this suggests that approximately 2% of transcripts from the average gene are mis-spliced. Because mis-spliced transcripts are preferentially removed by NMD mechanisms, this is likely a conservative estimate.
We next tried to identify factors that are predictive of an intron's level of splicing error. The two factors we considered were the intron's length and the expression level of the gene in which the intron falls. Longer introns show higher levels of mis-splicing (), while highly expressed genes show somewhat lower levels (
Figure S6). These associations may be confounded, however, by the fact that highly expressed genes tend to have shorter introns
[32]. Indeed, the association between splicing error rate and gene expression level disappears after correction for intron length (
Figure S5), indicating that this association is largely driven by the lower splicing error rate of small introns. The different sequence composition of introns in highly expressed genes
[33] may also influence their lower rate of splicing error (
Text S1;
Figure S7).
“Noise” splice sites are marked by genomic features that define exons
Finally, we considered the mechanism which results in the generation of mis-spliced transcripts. To do so, we looked for hexamers enriched in the vicinity of unconserved, rarely-used splice sites, as compared to nearby “decoy” splice sites (matching GT or AG) which we never observed to be used in our data. 574 hexamers show significant enrichment or depletion exonic of the 5′ SS, and 728 exonic of the 3′ splice site (;

;

test). Although the relative enrichments of hexamers near the 5′ and 3′ splice sites are similar, a set of hexamers matching the binding site for the U1 snRNP (which recognizes the 5′ splice site) is strongly depleted in the vicinity of the 5′ splice sites, but shows only limited or no depletion in the vicinity of 3′ splice sites (). This is consistent with the observation that competitive binding of the core splice factors plays an important role in splice site choice
[34]. There is a smaller, but still substantial, number of hexamers enriched or depleted intronic of the splice sites (295 and 282 for 5′ and 3′ splice sites, respectively).
We compared our list of hexamers to the list of exonic splicing enhancers (ESEs) identified by Fairbrother et al.
[35], some of which were validated experimentally, and found that 79% of the ESEs are enriched exonic of both 3′ and 5′ splice sites in our data (

,

test). This is evidence that noisy splicing is a result of the binding of the same splice factors used to identify exons more generally. Further support for this possibility comes from the observation that the hexamers identified near noise splice sites demarcate the boundaries of constitutive exons in our data ().