|Home | About | Journals | Submit | Contact Us | Français|
Alternative splicing is a prevalent post-transcriptional process, which is not only important to normal cellular function but is also involved in human diseases. The newly developed second generation sequencing technique provides high-throughput data (RNA-seq data) to study alternative splicing events in different types of cells. Here, we present a computational method, SpliceMap, to detect splice junctions from RNA-seq data. This method does not depend on any existing annotation of gene structures and is capable of finding novel splice junctions with high sensitivity and specificity. It can handle long reads (50–100 nt) and can exploit paired-read information to improve mapping accuracy. Several parameters are included in the output to indicate the reliability of the predicted junction and help filter out false predictions. We applied SpliceMap to analyze 23 million paired 50-nt reads from human brain tissue. The results show at this depth of sequencing, RNA-seq can support reliable detection of splice junctions except for those that are present at very low level. Compared to current methods, SpliceMap can achieve 12% higher sensitivity without sacrificing specificity.
RNA splicing is an important post-transcriptional step where one or more segments of the pre-mRNA are spliced out and the remaining segments (exons) are concatenated to form the mature mRNA product. By alternative splicing, it is possible to produce different transcripts (isoforms) from the same genetic locus. This process occurs in over 90% of multi-exon human genes (1,2) and greatly increases the diversity of possible transcripts in the transcriptome. Aberrant RNA splicing has been found to be associated with many human diseases (3,4). For this reason, techniques to identify and quantify splicing events are important to biology and medicine.
The most popular way to study the structure and abundance of spliced transcripts is through sequencing of expressed sequence tags (ESTs) (5). Traditionally, such studies were expensive and inefficient due to the low throughput of the Sanger method which was the main sequencing technology used in EST projects. However, with the recent advent of second generation sequencing technology (SGS), it is now feasible to conduct deep and comprehensive sequencing of transcriptomes in a high throughput and cost effective manner (6–8), making it possible to detect rare alternative splicing events. In such RNA-seq projects, tens or hundreds of millions of short sequences (30–100 nt) are read randomly from the population of transcripts under study. The first step of the analysis is thus the mapping of each short read to a reference genome to determine the genetic loci that may give rise to this read. For reads that are sampled completely within exonic regions, this mapping task can be handled by any existing short-read mapping programs, such as ELAND (Cox, unpublished software) and SeqMap (9).
However, the reads that are of most interest to us for novel isoform discovery are the ones that span across exon-exon junctions. These ‘junction reads’ cannot be mapped directly to the genome. One approach is to map the reads onto the known transcript sequences from the currently annotated exon library. Since the exon library is incomplete, this method cannot find the junctions that involve novel splicing events (10). In another approach, used in the recently developed TopHat (11) program, reads that are mappable on the reference genome are grouped into distinct clusters such that the reads within each cluster are linked together through overlapping regions. Each cluster then defines a putative exonic region. Subsequently, exon-exon junctions can be searched based on these putative exon definitions.
Clustering is a natural approach to find novel junctions in the first RNA-seq experiments (1,6,10,12–17) because the data generated at the early stage of the development of SGS are mostly very short reads (25–36 nt) that are not suitable for direct de novo detection of exon–exon junctions. However, the technology is improving rapidly, and currently the usable length of reads from some SGS instruments like the Illumina Genome Analyzer are typically in the range of 50–100 nt. The increased read length opens up the possibility to directly map the exon-exon junction without any reference to putative or annotated exons. Here we report a novel algorithm, based on the idea of using the mapping of half-reads as a way to identify the approximate location of a junction. Moreover, this method can be adapted to incorporate the extra information contained in paired-end sequencing data, to achieve a much higher level of specificity than attainable by single end sequencing. The method is implemented in a freely available Python program named SpliceMap (http://biogibbs.stanford.edu/~kinfai/SpliceMap/).
SpliceMap utilizes merely the reference genomic sequence to find the junction independently of existing exon annotation. It is possible to explore all exon splicing events, including known and novel ones, if the sequencing is of sufficient depth. The core notion is to pin down first the junction boundary on one of the two exons that are involved in the splicing event, before the mapping of the full junction. A read that spans a junction must have a match in the reference genome that is not shorter than its half length. Such a match then provides a seeding that can be used to identify a small genomic region for the search of the corresponding junction. There are four main steps in SpliceMap: half-read mapping, seeding selection, junction search and paired-end filtering (Figure 1). The last step is not applied when the data is not paired-end. For reads longer than 50 nt, we extract from them several overlapped 50-nt reads and then apply the standard method. For example, we split a 100-nt read to three segments (1–50, 26–75 and 51–100). An extra filter is added in the post-processing step for the long-read data to check the results with the full length information. In this way, we can find multiply junctions from a single long read.
Taking advantage of the reasonably long reads (50 nt) offered by the newest models of second-generation sequencers, the half length (25 nt) can be reliably aligned to the reference genomic sequence with high probability. In this step, SpliceMap maps both halves of the read to the reference genome by any currently available short read mapping tools, such as SeqMap (9) and ELAND. The maximum mismatch allowed for the half read mapping can be chosen accordingly, based on the quality of data and read length. After mapping, the following steps are carried out chromosome by chromosome.
We use the mapped hits of a half-read to narrow the search regions of the junction. These hits are extended base by base in the following step. Thus, we call the half-read mapped hit ‘seeding’. The mapped hits from the above steps are examined for seeding selection. Although the uniquely mapped hits are more reliable as seeding for junction search, one should not simply exclude all multiply mapped reads (i.e. reads mapped to more than one location) because doing so will greatly diminish the chance of detecting junctions with homologous sequences elsewhere, such as those in paralogous genes or pseudogenes. Instead of rejecting all multiply mapped hits, SpliceMap excludes only those hits that are within 400 000 nt of another hit from the same half-read. Because if two regions are identical within a distance of 400 000 nt, false splice predictions tend to form between these two regions which match the reads perfectly.
For each seeding identified, the alignment on the reference genome is then extended base by base to find the splicing point (Figure 1). SpliceMap subsequently tries to find the partner splicing point that provides perfect match of the corresponding residual sequence of the original read, within a user-specified distance (set to be 400 000 nt in our examples). When the full reads are 50 nt in size, candidates of splicing point must meet two criteria: first, the alignment extension cannot be longer than 40 nt and the residual length has to be at least 10 nt; and second, the splicing point must be next to the canonical dinucleotides splicing signal GT and AG for donor and acceptor sites, because they appears in 98% known splice sites (18). The mapping of the residual sequence is achieved by searching 10-nt seeding in a pre-computed chromosome-wide hash table and then extending to complete the full alignment. In order to reduce false positive junctions, the results are discarded if the search yields multiple matches of the residual sequence satisfying the above criteria.
When paired-end reads are available, the pairing information is used in this step to improve the specificity of junction detection. First, in the previous steps, three types of hits are identified as ‘good hits’, namely exonic hits, extension hits and junction hits. An exonic hit occurs if the two halves of a full read are mapped to locations that differ by exactly half of the read length. On the other hand, if a half-read hit can be extended maximally to an alignment length that is suitably long but yet shorter than the full read length, then it is regarded as an extension hit. Finally, junction hits are identified as above. To qualify as reliable hits, the hits generated from the two reads from a paired-end reads must satisfy the following conditions (i) both hits are ‘good hits’; (ii) their distance is not longer than 400 000 nt; (iii) the mapping direction and the positional order on the reference genomic sequence of two hits are in the consistent order with the experimental design (Figure 2).
SpliceMap provided several parameters to facilitate the assessment of the reliability and the abundance of the detected junctions. For each junction, SpliceMap computes its nR (number of reads supporting this junction), nNR (number of non-redundant supporting reads), nUM (number of uniquely mapped supporting reads), nUP and nDOWN (number of mapped reads in upstream or downstream adjacent regions respectively) (Figure 3). We examined these parameters on a RNA-seq data set of 23 412 226 paired 50-nt reads provided to us by Illumina. The various requirements of these parameters lead to different optional parametric filterings.
We found that for splice junctions near the 3′-end on a gene annotated by RefSeq as having only a single isoform, the number of supporting reads (nR) is significantly correlated with the overall expression (R2 = 0.7173) of the gene as measured by RPKM (number of reads per kb of transcript per million mappable reads) (10). This suggests that when alternative splicing may occur at a junction, we can use nR as an index for its usage. Table 1 shows the distribution of nR among predicted junctions before any optional parametric filtering. While overall only 29% of predicted junctions has nR = 1, this percentage increases to 74% among novel junctions. This is consistent with the notion that junctions not included in current annotations are likely to be relatively rare.
The nR of a junction is affected by the randomness in the representation or amplification of the RNA fragments in the library to be sequenced. If an RNA fragment is readily captured by the library preparation protocol (random hexamer priming in this case) and then highly amplified then it is likely to generate multiple reads of identical sequence. We call these ‘redundant’ reads. For junctions in low abundance isoforms, if two or more supporting reads are identical in sequence then they are likely to be redundant. In such cases, even if nR > 1 we still may not have a reliable prediction. We can greatly increase the specificity of our prediction by requiring that the junction be supported by more than one ‘non-redundant’ reads (nNR > 1). We can see that the EST validation rate is remarkably high once we require nNR > 1 (114 990/121 718 = 94%). When we restrict to novel junctions, the EST validation of predictions with nNR > 1 is still high (4494/7757 = 58%) (Table 2). Since the EST data is likely to be incomplete in its coverage of novel junctions in rare isoforms, the specificity of predictions with nNR > 1 should be considerably higher than 58%. We will examine this issue further in our discussion of our experimental validation results.
Although requiring nNR > 1 was effective in ensuring very high specificity of junction prediction, it reduced the number of predicted junctions by about 31%. Among novel predictions, the potential loss of sensitivity was even greater, i.e. from 36 091 to 7757. This motivated us to design optional filters to discriminate reliable versus unreliable predictions among junctions with nNR = 1. Instead of filtering junction prediction by nR or nNR, we require that the junction must be supported by at least one uniquely mapped reads (nUM > 0). This requirement excludes 3.8% junctions but increases the total specificity from 82.96 to 84.48% (Table 2).
The other optional filter is the requirements of a sufficient number of reads mapped adjacent to the splice site, on each side of the putative junction. To do this, recall that nUP is the numbers of reads (exon or junction reads) falling within a small region (say with a neighborhood size of K nt) upstream of the splice site of interest in the transcript, and nDOWN is the corresponding downstream adjacent reads. Thus our filter is to require each of nUP or nDOWN to be at least 1. Table 3 shows the number of detected junctions and the EST validation rates for neighborhood size K = 40, 80 and 160. We can see that this filtering excludes some junctions but improves the specificity as K decreases. By adjusting K, we can achieve a suitable balance of sensitivity and specificity in our junction predictions. When K = 40, it excludes 7.6% junctions but increases the total specificity from 82.96 to 86.91% (Table 3). The gains and losses of various optional parametric filterings are compared in Table 2.
We tested SpliceMap and TopHat on a RNA-seq data set of 23 412 226 paired 50-nt reads provided to us by Illumina. mRNA was purified from total RNA from human brain tissue with oligo-dT magnetic beads. cDNA was synthesized with random primer priming. ds–cDNA was sequenced using Illumina Genome Analyzers as recommended by the manufacturer. The data is publicly available in database GEO with accession number GSE19166. In this test, SpliceMap requires the reads to cover at least 10 nt on each side of the junction. One mismatch was allowed in seeding mapping based on 25-nt half-reads, but the mapping of the residual sequence is required to be perfect match. The allowance for mismatch on the seeding mapping allows detection of junctions with SNPs in the flanking exon sequences and also accommodates sequencing errors.
Here we report the results found by paired-end SpliceMap followed by the nUP/nDOWN filtering (K = 40) and the nUM filtering. In total, SpliceMap found 151 317 exon junctions, including 23 020 novel junctions, which were not reported in RefSeq (19), Ensembl (20) and KnownGene (21). In order to assess SpliceMap’s specificity, the detected junctions are aligned to human ESTs in GenBank (22). Because ESTs are 200–600 nt single reads, they are typically long enough to identify unique transcript fragments with high reliability. Thus if a match can be found in an EST sequence for a junction that was detected by our de novo splice discovery algorithm, then this can be regarded as independent experimental validation of the existence of the junction. As a comparison we also present the corresponding results by TopHat 1.0.12 (11) (the latest version as of 28 October 2009) (Table 4 and Figure 4).
We found that 87.9% (133 010) of the junctions detected by SpliceMap are supported by EST evidence. If we restrict to the 23 020 novel junctions predicted by SpliceMap, the percentage becomes lower (41.2%) (9493 junctions) but still represents a reasonable degree of EST support. We note that EST sequences are not comprehensive representation of the transcriptome and the percentage of junctions supported by EST is only a conservative estimate of specificity. We can see that SpliceMap has the similar EST validation rate but detected more junctions (151 317 versus 133 722) compared with TopHat.
Table 5 gives the sensitivity of SpliceMap for detecting junctions in the 12 755 genes with a single isoform in RefSeq. The junctions are binned according to the expression level (in RKPM) of the corresponding gene, and the rate of detection is computed for each bin. It is seen that SpliceMap achieves very high sensitivity (>95%) for genes with high expression (RKPM > 20). The sensitivity is still high (90%) for genes with medium expression (5 < RKPM < 20) and remains substantial (40–67%) for genes with relatively low expression (1 < RKPM < 5). However, for genes with very low expression (RKPM < 1) the sensitivity drops to below 7%. Our result suggests that much deeper sequencing than 23 million paired-reads will be necessary for the analysis of such rare transcripts. Finally, Table 5 also shows that at all levels of gene expression, SpliceMap can detect more RefSeq annotated junctions than a representative of current method (TopHat).
We also examine the degree of junction recovery for the genes with single RefSeq isoform. For each gene, we measure its degree of junction discovery by computing the percentage of junctions that have been detected. The results are presented in Table 6. We can see that more genes detected by SpliceMap are of higher degree (80–100%) of completeness in junction discovery. In particular, this completeness may be important in downstream analyses such as isoform reconstruction and abundance estimation.
In addition to the evidence from EST, our novel junctions are validated experimentally by PCR experiments. Twenty predictions were randomly selected, including 18 novel exon skipping events and two novel exons detections, all without any currently known EST or human mRNA evidence. Eighteen novel skipping events come from five bins with different numbers of non-redundant reads: 1, 2, 3–5, 6–10 and >10. Three to four predictions were randomly selected from each bins. PCR validation showed that 17 (85%) were validated. Of the three false predictions, two occur in the bin of single non-redundant read. Thus, among novel predictions with nNR > 1, the PCR validation rate is 13/14 = 92.86%. This is significantly higher (using Z test, the P-value is 0.004) than the EST validation rate of 58% which we believe is an underestimate of specificity because of incomplete EST coverage for rare isoforms.
To see how SpliceMap compares with a simple approach that does use the reference annotation; we also ran an annotation-dependent tool ERANGE (10) on the same data set. ERANGE identified 160 899 junctions, among which 127 043 junctions were also detected by SpliceMap. All results found by ERANGE are not novel but include some junctions that are not detected by SpliceMap, because ERANGE has relaxed requirements of the length of the minor flanking sequence. ERANGE requires at least 4 nt on each flanking sequence, while at least 10 nt is required by SpliceMap. The stricter requirement guarantees the reasonable reliability of the predictions of novel junction by SpliceMap. Novel junction discovery is the major function of SpliceMap, which therefore cannot be replaceable by annotation-independent ERANGE. Among 151 317 junctions found by SpliceMap, 24 274 are not reported by ERANGE, 23 020 of which are novel.
We also compared SpliceMap with BLAT, a tool commonly used for aligning EST sequences. In order to make a fair comparison, we optimized the parametric setting in BLAT and also filter the results by requiring the presence of canonical splicing signal. We found that when BLAT was run with the most non-stringent parameters and with no filtering, its sensitivity is ~55% that of SpliceMap. However, in this non-stringent setting the junctions detected by BLAT are mostly false-positives (specificity is only 3%). With more stringent parameters and with the addition of post-alignment filtering steps, we can improve the specificity very substantially but at the expense of further loss of sensitivity. At its best setting, BLAT achieved a similar but still slightly lower level of specificity with a much lower sensitivity (70% lower) as compared to SpliceMap. The details of the comparison between the two methods on the Illumina Brain data set are presented in Table 7.
BLAT is a well-optimized alignment tool but it was not designed specifically for exon junction detection from short reads. Both SpliceMap and BLAT are based on using fast mapping of segments (seedings) of the reads to narrow the search regions for more detail alignment of the reads. As a splice detection tool specific for short RNA-seq data, SpliceMap uses 25-nt seeding while BLAT usually uses about 8- to 12-nt seeding. SpliceMap generates the seeding by using short-read alignment tools such as ELAND and SeqMap, while BLAT makes use of a hash table. The short seeding allows BLAT to find the splices that span small exons, but leads to many false mappings. As discussed in the above, these false predictions can be detected and removed in post-processing steps based on canonical splicing signal and paired-end information. However the post-processing filters cannot rescue the loss of sensitivity in the BLAT that is evident even when BLAT is run with the most tolerant parameter setting. BLAT has its own ad hoc post-processing steps that are not specially designed for exon junction detection. When the reads are short, these steps may remove many correct splices. Although we expect that BLAT’s performance may improve as the read length increases, for current RNA-seq data, it cannot replace a custom-designed tool for junction detection such as SpliceMap.
The better performance of SpliceMap is achieved at the cost of longer running time—it took SpliceMap 66 CPU hours to process all the 23 million reads. As a comparison, it took TopHat 12 CPU hours to process the same data set. However, it is worth pointing out that this may be due to the fact that SpliceMap was written in Python, while TopHat was written in C++. Moreover, in our experiments we used a 16-core server so that the overall running time was greatly reduced. The Python part of SpiceMap has been rewritten in C++ and speeds up about 4 times.
The use of paired-end sequencing with long reads (50-nt or longer) promises to greatly enhance our ability to characterize transcript isoforms, but this promise can be fulfilled only if suitable computational methods are available to analyze the data. In this paper we reported that 50-nt reads can support an approach of direct de novo detection of splice junctions without the need to first cluster reads to identify putative exons, and that this approach can achieve significantly higher sensitivity in junction detection than current leading methods of RNA-seq analysis. Our results were demonstrated by a systematic analysis of a new data set consisting of 23 million paired 50-nt reads from human brain tissue. Using EST sequences as validation, we found that our method SpliceMap can predict a junction with very high validation rate (95%) if the prediction is supported by at least two non-redundant reads (nNR > 1). However, even predictions supported by only single non-redundant read (nNR = 1) can have high specificity if we implement suitable additional filters to remove unreliable predictions. Including predictions with nNR = 1 that passed these filters greatly increased the sensitivity of detection and still maintained good validation rate (87.9%). Furthermore, EST validation rate is likely to underestimate the specificity of novel predictions, and this was confirmed by our experimental tests of a subset of novel predictions by PCR-based method.
The question of the sensitivity of junction detection from RNA-seq is an important issue that has not be thoroughly examined in the current literature. One way to frame this question is to ask, for a given sequencing depth, what is the C50 which is defined as the lower limit of copy number per cell in order for a given junction to have a 50% chance of being detected. Based on the assumption that one RKPM corresponds to about 0.3 copies per cell for the protocol used in this study (10), we can estimate from Table 5 that for a RNA-seq with a depth of 23 million paired 50-nt reads, the C50 is about one copy per cell.
We found that paired-read information can help to reduce false discoveries. For the Illumina protocol used in to produce our data, the distance between two paired-end reads is about 200 nt in the mRNA and there may sometimes exist an intron between the two reads. Since the overwhelming majority of introns are smaller than 400 000 nt in size, SpliceMap will filter out any paired hits that are separated by more than 400 000 nt from all subsequent computations. Figure 5 shows an example of false predictions eliminated by the use of paired-end filtering. Compared to using only single reads (i.e. treating the same data set as 46 million unpaired 50-nt reads), the use of paired-end filtering increases the EST validation rate from 86.19 to 87.9% for all predictions and from 38.53 to 41.24% for novel predictions, without compromising the sensitivity of detection of annotated junctions. Although, single-end SpliceMap still makes better performance than TopHat, we recommend researchers using paired-end sequencing if possible. Since the EST validation rate underestimates the true specificity of novel predictions, the reduction of false predictions should be more substantial than indicated by the above numbers.
National Institutes of Health (1R01HG004634 to W.H.W.); Junior faculty grant from the Edward Mallinckrodt Jr. Foundation (to Y.X.) Funding for open access charge: 1R01HG004634.
Conflict of interest statement. None declared.
KFA, HJ and WHW conceived the study, designed the algorithm and carried out data analyses. KFA implemented the algorithm and drafted the manuscript. HJ and WHW revised the manuscript. L.L and XY. performed the experimental validation.
Shujun Luo and Gary Schroth of Illumina Inc. provided the RNA sequencing data. Gary Schroth also provided helpful comments on the manuscript.