The messenger RNA (mRNA) transcriptome consists of all mRNA molecules transcribed from the genome within a functioning cell. Different genes give rise to different transcripts with varying abundance. In addition, through the mechanism of alternative splicing, different subsets of exons in a gene may be concatenated (in transcription order) to form different transcript isoforms (1–4
). The diversity and abundance of isoforms transcribed from a gene are known to vary in response to cellular differentiation and maturation, as well as environmental factors and disease. The totality of transcripts present, and their individual abundance, characterizes the mRNA transcriptome and is a most basic phenotype. Thus, the difference between transcriptomes sampled from healthy and diseased cells may provide insight into the functional consequences of disease, as well as identifying biomarkers to classify different disease types (5
). Similarly, the difference between transcriptomes sampled at different stages in cell development may provide insight into the functional effects of cell differentiation and cell life cycles (2
Classically, the differential analysis of transcriptomes has been studied using techniques such as microarray technologies (7
) that identify differences in the total expression of known gene transcripts and exon arrays (8
) that detect differences in the expression of known gene exons. More recently, high-throughput sequencing methods, such as RNA-seq (10
), have been able to accurately record short sequences of nucleotides sampled from millions of mRNA molecules in the transcriptome, and thereby are capable of observing samples from known and unknown transcripts, providing a more complete picture of the transcriptome. In addition, the large number of molecules sampled provides the potential to accurately estimate relative abundance of transcript isoforms.
Three basic strategies have emerged to identify ‘differential transcription’, the difference in the relative abundance of the individual transcripts across samples. The first strategy, e.g. Cufflinks (6
), performs transcript inference and abundance estimation followed by differential test of relative abundance. Such an approach is ideal, but its performance relies on accurate transcript quantification, which is itself a challenging problem. The RNA-seq reads generated by most sequencing platforms are <100 nt single or paired ends. In genes with a significant number of very similar alternative transcripts, they are too short to be assigned to individual transcripts unambiguously, making the transcript quantification problem underdetermined. demonstrates a gene with four isoforms as a result of two alternative splicing events. Transcripts could start and end at any exon, or even within exons. Assuming no transcript annotation is known, there can be more than one set of valid transcripts, as shown in b. Even with four known transcripts as given, there could be multiple solutions of valid quantification (c). In this case, the problem of transcript quantification is ‘unidentifiable’ (11
) and may result in inaccurate abundance estimation. Consequentially, the uncertainty in transcript quantification may lead to false discoveries of genes with differential transcription.
Figure 1. Challenges in using short reads to identify transcripts and their abundance. (a) An example of alternatively spliced gene with five exons. Each black rectangle denotes an exon, and the arrows denote the splice junctions connecting the exons. Two alternative (more ...)
The second strategy indirectly detects differential transcription by aggregating changes of multiple features on the transcriptome (12
). For example, a non-parametric statistical test called maximum mean discrepancy was designed in (12
) for the comparison of read coverage on all exons. Flow difference metric (FDM) was designed to capture the average flow difference of all divergence nodes between two splice graphs (13
). These approaches do not rely on any transcript information. However, they provide no simple localization of differences: maximum mean discrepancy and FDM can only detect a diffuse ‘signal’ of differential transcription without identifying the specific isoforms or regions that give rise to the difference.
The last strategy examines differential expression on annotated simple alternative transcription events in existing splicing databases. Examples include ALEXA-seq (14
), MISO (15
), SpliceTrap (16
) and MATS (17
). These methods have been shown to be accurate in identifying differences in utilization of a skipped exon by isoforms in two samples. However, they do not extend easily to more complex alternative splicing patterns with more than two alternative splice forms. These methods cannot be easily generalized to accommodate novel alternative splicing events that can be discovered by RNA-seq data, consequentially misinterpreting the data and the splicing events.
In this article, we present an ab initio method named DiffSplice for the detection and visualization of differential alternative transcription. DiffSplice circumvents the need for full-length transcript inference and quantification and localizes its search at alternative splicing modules (ASMs) (d). These modules represent the genomic regions, where alternative transcripts diverge, localizing the nature of the difference and decreasing the complexity of the differential analysis by comparing corresponding ASMs between samples (e). The ASMs are detected automatically from a transcriptome-wide expression-weighted splice graph (ESG), which is built directly from read alignments and captures all the sample-relevant splicing events including novel ones. Expression estimation of associated isoforms and tests for differential transcription start from the simplest ASMs, which yield estimation that is more robust to sequencing bias, and work outward. A non-parametric statistical test is introduced to assign the significance level of the differential transcription in the ASMs with a controlled false discovery rate (FDR). By design, differential analysis on ASM can be performed using short reads.
Our results on synthetic data sets demonstrate the precision of DiffSplice in the discovery and the expression estimation of ASMs and hence the sensitivity in the quantitation of transcriptional differences between samples. Simulation experiments on human transcriptome support the robustness of our method at different sampling depths and under various sampling biases. We applied DiffSplice on a time course lung differentiation data set, where 498 genes were tested to have significant change of transcription, as well as 2077 with significant change of overall gene expression, supporting the hypothesis that differential transcription is the key in the mucociliary cell differentiation and function. We also discovered 910 novel alternative splicing events that were not present in existing RefSeq and UCSC transcript annotations. The consideration of replicates in test statistics allowed DiffSplice to account for sample variations, reducing the risk of unreliable discoveries. Beyond the scope of differential transcription in alternatively spliced exons, the application of the proposed method on a breast cancer data set led to the discovery of cell line-specific structural variations such as deletions, demonstrating the feasibility in identifying irregular transcription variants that may reveal crucial regulatory mechanism in a cancer transcriptome.