In higher eukaryotes, a given transcribed locus can generate several mature mRNA isoforms via the process of alternative splicing (AS). AS is frequently a regulated mechanism, which coordinates the removal of the internal non-coding portions of the transcripts (introns) with the differential joining of the coding and 5′
untranslated portions (exons). As a result, proteins with similar, different or antagonistic activities can be generated from a single genomic locus (Brett et al., 2002
; Maniatis and Tasic, 2002
). In addition, AS can lead to downregulation of gene expression by diverting some of the mRNA isoforms to the nonsense-mediated mRNA decay pathway (Lewis et al., 2003
More than 90% of human genes express primary transcripts that undergo AS (Pan et al., 2008
; Wang et al., 2008
). Owing to the regulatory power of this process, an increasing number of studies are being directed at understanding AS regulation at the single-exon level (Castle et al., 2008
; Johnson et al., 2003
; Lewis et al., 2003
; Wang et al., 2008
). In general, researchers in the splicing regulation field have utilized comparative approaches to reveal tissue-specific (Relogio et al., 2005
; Ule et al., 2005
) or disease-related (Baumer et al., 2009
) AS events. However, such methodologies have not been used to generate maps of AS activity within one cellular condition. The completion of such maps would add a higher level of resolution to transcriptome analysis, allowing precise quantification of exon inclusion levels within a population of related isoforms.
Until recently, systematic analysis of AS was done using expressed sequence tags (EST) (Gupta et al., 2004
; Sorek et al., 2004
; Xie et al., 2002
) or specialized microarrays (Castle et al., 2008
; Clark et al., 2002
; Johnson et al., 2003
; Pan et al., 2008
). These techniques facilitated the discovery of a large number of alternative transcripts, and the extraction of distinctive features of alternatively spliced exons. Nevertheless, these techniques suffer from several limitations. ESTs are subject to cloning biases—especially towards the 3′
-end of transcripts—low coverage and insufficient robustness to allow reliable quantification. Likewise, the specificity of splicing microarrays is negatively affected by cross-hybridization with related mRNA molecules.
The development of deep-sequencing technologies provided an alternative to ESTs and microarrays for transcriptomic quantification. Two recent studies utilized single-end RNA-seq to analyze a series of human tissues. In Pan et al. (2008)
, the inclusion level of alternative exons was quantified as the percentage of the number of reads that match the two splice junctions formed by exon inclusion, over the splice junction formed by exon skipping. Wang et al. (2008)
also utilized splice-junction reads for quantification of minor isoforms with different frequencies, as a function of the read coverage or RPKM (reads per kilobase of exon per million mapped reads. Although both studies demonstrated improved coverage relative to microarrays and ESTs, they utilized only isoform-specific reads, leaving out the majority of reads, which map to common exons of different isoforms.
An improved version of the deep-sequencing technique utilizes paired-end tags (Fullwood et al., 2009
), which allows a significant gain of coverage and a reduction in read ambiguity through the generation of linked tag pairs that span longer stretches of sequenced template. This technology is especially suitable for AS profiling, because many exon-mapped tags are expected to span splice junctions, and these can be exploited to improve AS quantification.
Two recent methods exploit paired-end sequencing information for transcript quantification: Cufflinks (Trapnell et al., 2009
), which is based on a previous RNA-seq model for single-end reads (Jiang and Wong, 2009
) and Scripture (Guttman et al., 2010
). Both can reconstruct transcript structures using directed graphs, and assign FPKM (fragment per kilobase of exon per million mapped reads) or RPKM values to every transcript, without relying on a reference genome. Cufflinks uses a mathematical model to identify alternatively spliced transcripts at each gene locus. Scripture employs a statistical segmentation model to distinguish expressed loci, and filters out experimental noise. Both methods were originally designed to identify and quantify full-transcript expression levels, but in our tests they appear not to be optimal for inferring local exon-inclusion ratios, presumably due to limited transcript coverage and sequencing noise.
Here we introduce SpliceTrap, a method to quantify local exon inclusion levels in paired-end RNA-seq data. SpliceTrap generates alternative splicing profiles for different splicing patterns, such as exon skipping, alternative 5′
splice sites, and intron retention. It utilizes a comprehensive human exon database called TXdb (see Section 2
) to estimate the expression level of every exon as an independent Bayesian inference problem. Unlike microarray-based methods, SpliceTrap relies on RNA-seq, and therefore it can determine the inclusion level of every exon within a single cellular condition, without requiring a background set of reads.
We tested SpliceTrap both by simulation and real data analysis. Compared to Cufflinks and Scripture, it demonstrated improved accuracy, robustness and reliability in quantifying a large fraction of AS activity. In particular, SpliceTrap is suitable for studying changes at the single-exon levels and it can facilitate the discovery of nearby cis
-regulatory elements in diverse applications. SpliceTrap can be implemented online through the CSH Galaxy (Goecks et al., 2010
) server http://cancan.cshl.edu/splicetrap
and is also available for download and installation at http://rulai.cshl.edu/splicetrap/