Recently, high-throughput sequencing of mRNA (RNA-Seq) has revealed tissue-specific alternative splicing4
, novel genes and transcripts5
, and genomic structural variations6
. Deeply sampled RNA-Seq permits measurement of differential gene expression with greater sensitivity than expression7
microarrays. However, the analysis of RNA-Seq data presents major challenges in transcript assembly and abundance estimation arising from the ambiguous assignment of reads to isoforms8-10
. In earlier RNA-Seq experiments conducted by some of us, we estimated the relative expression for each gene as the fraction of reads mapping to its exons after normalizing for gene length11
. We did not attempt to allocate reads to specific alternate isoforms although we found ample evidence that multiple splice and promoter isoforms are often co-expressed in a given tissue 2
. This raised biological questions about how the different forms are distributed across cell types and physiological states. In addition, our prior methods relied on annotated gene models that, even in mouse, are incomplete. Longer reads (here 75bp versus 25bp in our prior work) and pairs of reads from both ends of each RNA fragment can reduce uncertainty in assigning reads to alternative splice variants12
. To produce useful transcript-level abundance estimates from paired-end RNA-Seq data, we developed a new algorithm that can identify complete novel transcripts and probabilistically assign reads to isoforms.
For our initial demonstration of Cufflinks, we performed a time course of paired-end 75bp RNA-Seq on a well-studied model of skeletal muscle development, the C2C12 mouse myoblast cell line13
(Methods). Regulated RNA expression of key transcription factors drives myogenesis and the execution of the differentiation process involves changes in expression of hundreds of genes14,15
. Prior studies have not measured global transcript isoform expression, though there are well-documented expression changes at the whole gene level for a set of marker genes in this system. We aimed to establish the prevalence of differential promoter use and differential splicing, because such data could reveal much about the model's regulatory behavior. A gene with isoforms that code for the same protein may be subject to complex regulation in order to maintain a certain level of output in the face of changes in expression of its transcription factors. Alternatively, genes with isoforms that code for different proteins could be functionally specialized for different cell types or states. By analyzing changes in relative abundances of transcripts produced by the alternative splicing of a single primary transcript, we hoped to infer the impact of post-transcriptional processing (e.g. splicing) on RNA output separately from rates of primary transcription. Such analysis could identify key genes in the system and suggest experiments to establish how they are regulated.
We first mapped sequenced fragments to the mouse genome using an improved version of TopHat16
, which can align reads across splice junctions without relying on gene annotation (Supplementary Methods Section 2
). Out of 215 million fragments, 171 million (79%) mapped to the genome, and 46 million spanned at least one putative splice junction (Supplementary Table 1
). Of the splice junctions spanned by fragment alignments, 70% were present in transcripts annotated by UCSC, Ensembl, or Vega.
To recover the minimal set of transcripts supported by our fragment alignments, we designed a comparative transcriptome assembly algorithm. EST assemblers such as PASA introduced the idea of collapsing alignments to transcripts based on splicing compatibility17
, and Dilworth's Theorem18
has been used to assemble a parsimonious set of haplotypes from virus population sequencing reads19
. Cufflinks extends these ideas, reducing the transcript assembly problem to finding a maximum matching in a weighted4
bipartite graph that represents compatibilities17
among fragments ( and Supplementary Methods Section 4
). Non-coding RNAs20
have been reported to regulate cell differentiation and development, and coding genes are known to produce noncoding isoforms as a means of regulating protein levels through nonsense-mediated decay22
. For these biologically motivated reasons, the assembler does not require that assembled transcripts contain an open reading frame. Since Cufflinks does not make use of existing gene annotations during assembly, we validated the transcripts by first comparing individual time point assemblies to existing annotations. We recovered a total of 13,692 known isoforms and 12,712 new isoforms of known genes. We estimate that 77% of the reads originated from previously known transcripts (Supplementary Table 2
). Of the new isoforms, 7,395 (58%) contain novel splice junctions, with the remainder being novel combinations of known splicing outcomes. 11,712 (92%) have an open reading frame (ORF), 8,752 of which end at an annotated stop codon. Although we sequenced deeply by current standards, 73% of the moderately abundant (15-30 FPKM) transcripts detected at the 60 hour time point with three lanes of GAII transcriptome sequencing were fully recovered with just a single lane. Because distinguishing a full-length transcript from a partially assembled fragment is difficult, we conservatively excluded novel isoforms that were unique to a single time point from further analyses. Out of the new isoforms, 3,724 were present in multiple time points, and 581 were present at all time points. 6,518 (51%) of the new isoforms and 2,316 (62%) of the multiple time point novel isoforms were tiled by high-identity EST alignments or matched RefSeq isoforms from other organisms, and endpoint RT-PCR experiments confirmed new isoforms in genes of interest (Supplementary Table 3
). We concluded that a majority of the unannotated transcripts we found are in the myogenic transcriptome, and that the mouse annotation remains incomplete.
Figure 1 Overview of Cufflinks. The algorithm takes as input cDNA fragment sequences that have been (a) aligned to the genome by software capable of producing spliced alignments, such as TopHat. With paired-end RNA-Seq, Cufflinks treats each pair of fragment reads (more ...)
For the purposes of estimating transcript abundances, we first selected a set of 11,079 genes containing 17,416 high-confidence isoforms (Supplementary File 1
). Of these, 13,692 (79%) were known and the remaining 3,724 (21%) were novel isoforms of known genes present in multiple time points. We then developed a statistical model of RNA-Seq parameterized by the abundances of these transcripts (, Supplementary Methods Section. 3
). Cufflinks' model allows for the probabilistic deconvolution of RNA-Seq fragment densities to account for cases where genome alignments of fragments do not uniquely correspond to source transcripts. The model incorporates minimal assumptions23
about the sequencing experiment, and extends the unpaired read model of Jiang and Wong8
to the paired-end case. Abundances were reported in expected F
ilobase of transcript per M
illion fragments mapped (FPKM). A fragment corresponds to a single cDNA molecule, which can be represented by a pair of reads from each end. Confidence intervals for estimates were obtained using a Bayesian inference method based on importance sampling from the posterior distribution. Abundances of spiked control sequences (R2
=0.99) and benchmarks with simulated data (R2
=0.96) revealed that Cufflinks' abundance estimates are highly accurate. The inclusion of novel isoforms of known genes during abundance estimation had a dramatic impact on the estimates of known isoforms in many genes (R2
only 0.90), highlighting the importance of coupling transcript discovery together with abundance estimation.
We identified 7,770 genes and 10,480 isoforms undergoing significant abundance changes between some successive pair of time points (FDR < 5%). Many genes display substantial transcript-level dynamics that are not reflected in their overall expression patterns (Supplementary File 2
). For example, Myc, a proto-oncogene which is known to be transcriptionally and post-transcriptionally regulated during myogenesis24
, is down-regulated overall during the time course, and while isoforms A and B follow this pattern, isoform C has a more complex expression pattern (). We noted that many genes displayed switching between major and minor transcripts, some containing isoforms with muscle-specific functions, such as tropomyosin I and II, which display a dramatic switch in isoform dominance upon differentiation (Supplementary Appendix B
). However, many genes featured dynamics involving several isoforms with behavior too complex to be deemed “switching”. In light of these observations, we classified the patterns of expression dynamics for transcripts, assigning them one of four “trajectories” based on their expression curves being flat, increasing, decreasing or mixed (Methods). Based on trajectory classification, a total of 1,634 genes were found to have multiple isoforms with different trajectories in the time course, and we hypothesized that differential promoter preference and differential splicing were responsible for the divergent patterns.
Figure 2 Distinction of transcriptional and post-transcriptional regulatory effects on overall transcript output. (a) When abundances of isoforms A, B, and C of Myc are grouped by TSS, changes in the relative abundances of the TSS groups indicate transcriptional (more ...)
To explore the impact of regulation on mRNA output and to check whether it could explain the variability of trajectories, we grouped transcripts by their start site (TSS) instead of just by gene. Changes in the relative abundances of mRNAs spliced from the same pre-mRNA transcript are by definition post-transcriptional, so this grouping effectively discriminated changes in mRNA output associated with differential transcription from changes associated with differential post-transcriptional processing. Of the 3,486 genes in our high confidence set with isoforms that shared a common TSS, 41% had TSS groups containing different isoform trajectories. Summing the expressions of isoforms sharing a TSS produces the trajectory for their primary transcript, and we identified 401 (48%) genes with multiple distinct primary transcript trajectories. However, trajectory classification was not precise enough to prioritize further investigation into individual genes and could not form the basis for statistical significance testing. We therefore formalized and quantified divergent expression patterns of isoforms within and between TSS groups with an information-theoretic metric derived from the Jensen-Shannon divergence. With this metric, relative transcript abundances move in time along a logarithmic spiral in a real Hilbert space25
, and the distance moved measures the extent of change in relative expression. Quantification of expression change in this way revealed significant (FDR < 5%) differential transcriptional regulation and splicing in 882 of 3,486 (25%) and 273 of 843 (32%) candidate genes respectively, with 70 genes displaying both types of differential regulation (Supplementary Table 4
). Myc () undergoes a shift in transcriptional regulation of transcript abundances to post-transcriptional control of abundances () between 60 and 90 hours, as myocytes are beginning to fuse into myotubes.
Focusing on the genes with significant promoter and isoform changes, we noted that in many cases changes in relative abundance reflected switch-like events in which there was an inversion of the dominant primary transcript. For example, in FHL3, a transcriptional regulator recently reported to inhibit myogenesis26
, Cufflinks assembled the known isoform and another with a novel start site. We validated the 5′ exon of this isoform along with other novel start sites and splicing events by form-specific RT-PCR (, Supplementary Methods Section 4
). Limiting analysis to known isoforms would have produced an incorrect abundance estimate for the known isoform of FHL3. Moreover, the novel isoform is dominant prior to differentiation, so this potentially important differentiation-associated promoter switch would have been missed (). In total, we tested and validated 153 of 185 putative novel transcription start sites by comparison against TAF1 and RNA polymerase II ChIP-Seq peaks. We also observed switches in the major isoform of alternatively spliced genes. In total, 10% of multi-promoter genes featured a switch in major primary transcript and 7% of alternatively spliced primary transcripts switched major isoforms. We concluded that not only is the impact of promoter-switching on mRNA output significant, many genes are also exhibiting evidence of post-transcriptionally induced expression changes, supporting a role for dynamic splicing regulation in myogenesis. A key question is whether genes that display divergent expression patterns of isoforms are differentially regulated in a particular system because they have isoforms that are functionally specialized for that system. Of the genes undergoing transcriptional or post-transcriptional isoform switches, 26%, respectively 24%, code for multiple distinct proteins according to annotation. Genes with novel isoforms were excluded from the coding sequence analysis, so this fraction likely underestimates the impact of differential regulation on coding potential. We thus speculate that differential RNA level isoform regulation, whether transcriptional, post-transcriptional, or mixed in underlying mechanism, suggests functional specialization of the isoforms in many genes.
Figure 3 Excluding isoforms discovered by Cufflinks from the transcript abundance estimation impacts the abundance estimates of known isoforms, in some cases by orders of magnitude. Four-and-a-half-LIM domains 3 (Fhl3) inhibits myogenesis by binding MyoD and attenuating (more ...)
Although Cufflinks was designed to investigate transcriptional and splicing regulation in this experiment, it is applicable to a broad range of RNA-Seq studies (). The open-source software runs on commonly available and inexpensive hardware, making it accessible to any researcher using RNA-Seq data. We are currently exploring the use of the Cufflinks assembler to annotate genomes of newly sequenced organisms, and to quantify the impact of various mechanisms of gene regulation on expression. When coupled with assays of upstream regulatory activity, such as chromatin state mapping or promoter occupancy, Cufflinks should help unveil the range of mechanisms governing RNA manufacture and processing.
Figure 4 Robustness of assembly and abundance estimation as a function of expression level and depth of sequencing. Subsets of the full 60-hour read set were mapped and assembled with TopHat and Cufflinks and the resulting assemblies were compared for structural (more ...)