Trinity: a method for de novo transcriptome assembly
Trinity is a modular method and software package, combining three components (): Inchworm, Chrysalis and Butterfly. In contrast to de novo assembly of a genome, where few large connected sequence graphs can represent connectivities among reads across entire chromosomes, in assembling transcriptome data we expect to encounter numerous individual disconnected graphs, each representing the transcriptional complexity at non-overlapping loci. Accordingly, Trinity partitions the sequence data into these many individual graphs, and then processes each graph independently to extract full-length isoforms and tease apart transcripts derived from paralogous genes.
In the first step, Inchworm assembles reads into the unique sequences of transcripts. Inchworm () uses a greedy k-mer based approach for fast and efficient transcript assembly, recovering only a single (best) representative for a set of alternative variants that share k-mers (due to alternative splicing, gene duplication, or allelic variation). Next, Chrysalis () clusters related contigs that correspond to portions of alternatively-spliced transcripts or otherwise unique portions of paralogous genes. Chrysalis then constructs a de Bruijn graph for each cluster of related contigs, each reflecting the complexity of overlaps between variants. Finally, Butterfly () analyzes the paths taken by reads and read pairings in the context of the corresponding de Bruijn graph and reports all plausible transcript sequences, resolving alternatively spliced isoforms and transcripts derived from paralogous genes. Below, we describe each of Trinity’s modules.
Inchworm assembles contigs greedily and efficiently
Inchworm efficiently reconstructs linear transcript contigs in six steps (Methods,
). Inchworm (1) constructs a k-mer dictionary from all sequence reads (in practice, k=25); (2) removes likely error-containing k-mers from the k-mer dictionary; (3) selects the most frequent k-mer in the dictionary to seed a contig assembly, excluding both low-complexity and singleton k-mers; (4) extends the seed in each direction by finding the highest occurring k-mer with a k-1 overlap with the current contig terminus and concatenating its terminal base to the growing contig sequence (once a k-mer has been used for extension, it is removed from the dictionary); (5) extends the sequence in either direction until it cannot be extended further, then reporting the linear contig; (6) repeats steps 3–5 starting with the next most highly abundant k-mer, until the entire k-mer dictionary has been exhausted.
Inchworm does not capture the full complexity of the transcriptome; for example, only one alternatively spliced variant can be reported at full length per locus, with partial sequences reported for unique regions of any alternatively spliced transcripts. However, its contigs do maintain the information required by subsequent Trinity components to reconstruct and search the entire graph containing all possible sequences. Indeed, these contigs provide a complete representation of the sequence-overlap-based de Bruijn graph, with each k-mer being unique in the set, and the k-1 sub-sequences implicitly defining the edges in the graph. This approach is much more efficient than computing a full graph from all reads at once, and it quickly provides a meaningful intermediate output of the strong-supported contigs.
Chrysalis builds de Bruijn transcript graphs
Chrysalis clusters minimally-overlapping Inchworm contigs into sets of connected components, and constructs complete de Bruijn graphs for each component (, Methods). Each component defines a collection of inchworm contigs that are likely to be derived from alternative splice forms or closely-related paralogs. Chrysalis works in three phases (Methods). (1) It recursively groups inchworm contigs into connected components. Contigs are grouped if there is a perfect overlap of k-1 bases between them and if there is a minimal number of reads that span the junction across both contigs with a (k-1)/2 bases match on each side of the (k-1)-mer junction. (2) It builds a de Bruijn graph for each component using a word size of k-1 to represent nodes, and k to define the edges connecting the nodes. It weights each edge of the de Bruijn graph with the number of (k-1)-mers in the original read set that support it. (3) It assigns each read to the component with which it shares the largest number of k-mers, and determines the regions within each read that contribute k-mers to the component.
Butterfly resolves alternatively spliced and paralogous transcripts
Butterfly reconstructs plausible full-length, linear transcripts by reconciling the individual de Bruijn graphs generated by Chrysalis with the original reads and paired-ends. It reconstructs distinct transcripts for splice isoforms and paralogous genes, and resolves ambiguities stemming from errors or from sequences of length greater than k bases that are shared between transcripts.
Butterfly consists of two parts (, Methods). During graph simplification, Butterfly iterates between (1) merging consecutive nodes in linear paths in the de Bruijn graph to form nodes that represent longer sequences; and (2) pruning edges that represent minor deviations (supported by comparatively few reads), which likely correspond to sequencing errors. Diploid polymorphisms are expected to be more frequent than sequencing errors and will likely be maintained. In plausible path scoring, Butterfly identifies those paths that are supported by actual reads and read pairs, using a dynamic programming procedure that traverses potential paths in the graph while maintaining the reads (and pairs) that support them (Methods). Since reads and sequence fragments (paired reads) are typically much longer than k, they can resolve ambiguities and reduce the combinatorial number of paths to a much smaller number of actual transcripts, enumerated as linear sequences.
RNA-Seq of Schizosaccharomyces pombe
We first generated RNA-Seq data from the fission yeast Schizosaccharomyces pombe
. The S. pombe
has relatively substantial splicing for a eukaryotic microorganism, with short introns (mean intron length = 80.6 bp) and dense transcripts (mean intergenic region = 938 bp based on coding genes only). To maximize transcript coverage, we pooled ~154 million pairs of strand-specific13, 14
76 base Illumina read sequences from four biological conditions: mid-log growth, growth as all glucose has been consumed, late stationary phase, and heat shock (Methods,
Sensitivity limit for full-length reconstruction
We next estimated the upper limit for which annotated transcripts can possibly be perfectly reconstructed given a particular dataset of sequences. Any k
-mer-based assembly approach is limited to those sequences that are represented by the exact k
-mer composition of the RNA-Seq read set. To determine this empirical upper sensitivity limit (“Oracle Set”), we built a k
-mer dictionary from all the reads and identified all known reference protein-coding sequences (CDS) that are “reconstructable to full length (FL) given the read set”, as those sequences that can be populated by adjacent and overlapping k-mers across their entire length. Since this Oracle set also contains transcript sequences that are covered by k-mers
, but not entire reads
, some transcripts will appear reconstructable but are not. Conversely, the Oracle set reflects only annotated known
genes and known
isoforms, which are likely an underestimate, especially in mammals16
. Nevertheless, the Oracle set provides a useful sensitivity benchmark.
In the S. pombe
data set, nearly all (91%, 4600/5064) reference CDS sequences exist in the Oracle Set (25-mer dictionary, 154M paired-reads), since almost all encoded transcripts (98%) are expressed in the measured conditions (>= 0.5 Fragments Per transcript Kilobase per Million fragments mapped (FPKM)), consistent with previous studies in yeasts5, 17, 18
. When reducing the coverage by random sub-sampling, the size of the Oracle Set is saturated at 50M paired reads (4494/5064, Supplementary Fig. 1
online), which we chose as our subsequent benchmarking set.
Trinity recovered most S. pombe transcripts
From the 50M pairs of reads, Trinity fully reconstructed 86% of annotated transcripts (4338/5064, Supplementary Table 1
online) at full length, including 94% of the stringently-defined oracle transcripts (4218/4494). Of the 276 oracle transcripts not fully reconstructed, 90 (33%) are reconstructed over at least 90% of their length, and 177 (64%) are reconstructed over at least 50% of their length.
Overall, Trinity generated 27,841 linear contigs longer than 100 bases, grouped into 23,232 components (Supplementary Note
online). Only 2,454 of the 27,841 Trinity contigs did not align to the genome using GMAP19
. Of those, 30% match a Uniref9020
protein (BLASTX E<=1e-10), almost invariably (90%) a Schizosaccharomyces
protein, and likely reflect assemblies with error-rich reads.
Trinity reconstructs full-length transcripts across a broad range of expression levels and sequencing depths (). For example, it accurately captured the full-length transcript of 71% of genes from the second quintile (5–10%), and 81–95% full-length coverage of annotated transcripts in the remaining quintiles. Considering both full-length and partial reconstructions, Trinity reconstructed a large fraction of the bases in each transcript ().
Trinity correctly reconstructs the majority of full-length transcripts in fission yeast and mouse
In many cases, Trinity accurately resolved the sequences of closely-related paralogous transcripts. Out of 77 gene families containing 185 paralogs21
, Trinity recovered at full length all members of 33 families (68 genes), at least one member from an additional 33 families (46 genes found, 45 genes missing), and missed all 26 genes in the remaining 11 families, often involving lowly expressed genes. Some of the most highly expressed transcripts in S. pombe
are derived from paralogous genes with very similar sequences (e.g.
, those encoding ribosomal proteins21
), yet were resolved by Trinity.
Extended UTRs and long anti-sense in S. pombe
Compared to the existing annotation, Trinity extended the 5′UTR of 312 transcripts (median extension = 80bp; average = 176bp), and the 3′UTR of 543 transcripts (median = 72bp; average = 172bp) (Supplementary Fig. 2a,b
online). It also found 3,726 previously unannotated 5′ UTRs (median length 183bp, average length = 288bp), and 3,416 3′ UTRs (median length = 272bp, average length = 397bp).
Trinity identified 2319 transcripts at 1235 intergenic loci as novel transcribed sequences (, Methods
) and 612 long antisense transcripts that covered more than 75% of the length of the corresponding sense transcript (), and were not likely to be derived from extended transcription of a neighboring gene. 113 of the intergenic transcripts and 612 long antisense transcripts were multiexonic. Although both were more lowly expressed on average than annotated protein-coding genes (Supplementary Fig. 3
online), 49 long antisense transcripts (at 35 loci) were at least 5-fold more highly expressed than the corresponding sense coding transcript (e.g.
, an antisense transcript to the meiotic gene Mug27/Slk1 (SPCC417.06c) was >100-fold more highly expressed, Supplementary Fig. 4
online). This supports a role for antisense transcriptional regulation in meiosis for S. pombe15, 22–24
, and is consistent with previous findings in S. cerevisiae25
Trinity improves the yeast annotation
Trinity recovered most expressed annotated mouse transcripts
Compared to yeasts, mammalian transcriptomes exhibit substantially more complex patterns of alternative splicing26
. To test Trinity’s ability to identify different isoforms, we sequenced ~52.6 million 76b read pairs from C567BL/6 mouse primary immune dendritic cells (DCs). Unlike in S. pombe
, only 54% of known mouse genes (10,724) were identified as expressed (>= 0.5 FPKM), and of those, the Oracle set determined 8,358 to be full-length reconstructable (727 loci have two or more isoforms variable in the CDS, totaling 9,258 transcripts).
Trinity reported 48,497 contigs longer than 350bp, capturing 8,185 transcripts to full-length (Supplementary Table 2, Supplementary Note
online), corresponding to 7,749 loci (including 7947 (86%) transcripts at 7573 (91%) loci in the mouse Oracle set). The percentage of transcripts recovered to full-length and the fraction of length captured were high across a broad range of expression levels ().
Trinity resolved splice isoforms and gene paralogs in a manner consistent with the mouse Oracle set. Trinity found 872 full length, alternatively spliced, isoforms from 385 loci (53% of the loci with alternatively spliced variants in the Oracle set), and matched the full-length transcripts for 463 (61.6%) of 752 paralogous transcripts in the Oracle set (> 70% identity between paralogs, Methods, ).
Trinity resolves closely paralogous genes
Trinity extended the annotated 5′UTR for 5,265 transcripts (5,036 loci, median length = 43, average = 91, Supplementary Fig. 2c
online), and included one or more additional 5′UTR exons in 305 cases (Supplementary Fig. 5
online). It extended the 3′UTR in 2918 transcripts (2819 loci, median length = 20, average = 248, Supplementary Fig. 2d
online), adding 3′UTR exons in 62 cases (Supplementary Fig. 2b
online). UTR length differences were often due to alternative splicing events restricted to the UTR.
High sequence fidelity of reconstructed transcripts
We measured the assembled transcript base error rate by aligning the full-length transcripts to the corresponding reference genome (using BLAT), and capturing mismatches, insertions, and deletions from the highest scoring alignment (Supplementary Table 3
online). In fission yeast, rates of mismatches, insertions, and deletions are each less than 1 in 10,000. In mouse, rates were approximately twice as high, reflecting the lesser transcript fold-coverage. Since the raw read error rate is ~1%, Trinity thus resolved ~99% of sequencing errors.
Comparing Trinity’s performance to other methods
We compared Trinity’s performance to that of other assemblers by several measures. First, we examined the number of reference transcripts reconstructed to full-length by each method (‘sensitivity’). In S. pombe
, Trinity out-performed the de novo
sequence assemblers, ABySS1
, as well as the ‘mapping first’ programs Scripture3
(). Trinity performed well across a range of 10M to the full 150M input sequence reads, while the alternative methods tended to peak at ~50M pairs or smaller inputs (Supplementary Fig. 6a
online). In mouse (RefSeq annotation set, ), Trinity (8185 transcripts, 7749 genes) outperformed the other de novo
assembly methods ABySS (5561, 5500), Trans-ABySS (7025, 6598) and SOAPdenovo (761, 760), with the mapping-first programs Cufflinks (9010, 8536) and Scripture (9086, 8293) exhibiting better sensitivity. Furthermore, Trinity and Cufflinks appear best-tuned in their sensitivity across the broadest range of expression levels (Supplementary Fig. 7
online). Unlike Trinity, several of the de novo
methods did not perform well in fully reconstructing transcripts within the highest
expression quintiles (Supplementary Fig. 7
Comparison of Trinity to other mapping-first and assembly-first methods
Second, we assessed the accuracy of splice pattern detection by examining individual introns and the combinations of introns (‘splicing patterns’) defined by mapping all the reconstructed transcripts (annotated or not) back to the reference genome (, Methods). We compared the number of annotated reference introns (or splicing patterns) captured by each method, and the number of previously unannotated introns (or extended splicing patterns). Unannotated introns or splice patterns captured by more than one method are less likely to be false positives. In S. pombe, Trinity identified the largest number of reference introns (4,543) () and 1,582 unannotated introns, most of which are in putative, unannotated UTRs. 1,174 of these (74%) are also identified by at least one other method and thus are more likely genuine. Trinity also identifies the largest number of annotated splicing patterns in S. pombe (). The large numbers of falsely-fused S. pombe transcripts reported by the alternative methods contributes to their lack of sensitivity.
In mouse, most methods had comparably high sensitivity for detecting individual annotated introns (), but varied in detecting complete splicing patterns (). Scripture identifies the most annotated splicing patterns (7,274), closely followed by Trinity (7,127). However, Scripture reports over 110,000 unique splicing patterns, ~10-fold more than Trinity and all other methods (each less than 10,000 unique patterns), suggesting many false positives in Scripture, and excellent precision in Trinity. Overall, relatively few of the non-annotated splicing patterns predicted by each method are supported by at least one other method (18–25%, except Scripture (2%) and ABySS (66%)).
Finally, we examined the number of distinct contigs that mapped to each reference genomic locus, as well as the coverage (tiers) of reconstructed transcripts per locus. This accounts for multiple reported transcripts that represent the same region of a locus, for example due to alternative splicing, captured allelic variation, or enumerating transcripts with otherwise undetected sequencing errors. In S. pombe, Trinity reports 7,057 transcripts that map to 4,874 genes with an average coverage of 1.37 tiers per gene, similar to the alternative methods except Scripture (4.37 tiers per gene) and trans-ABySS (5.08 tiers per gene). Similarly, in mouse, Trinity performs comparably well to all other methods (31,706 contigs map to 11,334 genes, 2.05 tiers per gene on average), except trans-ABySS (111,000 contigs, 10,685 genes, 5.93 tiers). The large numbers of trans-ABySS transcripts covering similar regions of loci is not reflected in the number of distinct splicing patterns, indicating that multiple similar transcript sequences are being generated at individual loci, rather than many different splice isoforms. ABySS alone, although lacking the higher sensitivity of trans-ABySS, reports a smaller number of contigs (~1 transcript tiers per locus).
De novo assembly of the whitefly transcriptome
In the absence of a sequenced genome, de novo
assembly of RNA-Seq is the only viable option to study the transcriptomes of most organisms to date. For example, although the highly diverse class Insecta
contains several key model organisms, it is not densely covered by high quality draft genome sequences. In addition, insect transcriptomes exhibit complex alternative splicing patterns28
. The whitefly B. tabaci
is one such example, for which the genome was not sequenced, and where RNA-Seq samples are genetically polymorphic, since they are derived from a mixture of individuals from an outbred population28
We applied Trinity to a published RNA-Seq dataset from whitefly, consisting of ~21.9 million pairs of 76 base Illumina reads, sequenced using conventional non-strand-specific methods29
. Trinity produced 196,000 transcripts, 14,522 longer than 1,000 base pairs, capturing allelic variants (e.g.
, ). Of those, 4,323 had top blastx matches (E <= 1−10
) to 2,880 unique Uniref9020
protein sequences, along at least 80% of the corresponding homologous protein sequence. This number of approximately full-length Trinity-assembled transcripts is substantially higher than achieved by other de novo
Trinity reconstructs polymorphic transcripts in whitefly
To assess the extent to which alternative splice forms are captured by the Trinity assembly, we self-aligned all contigs within individual graph components, requiring at least one alternative internal exon of minimum length 21 bp and a multiple of 3. By this definition, 325 components contain at least two different isoforms. One such example (), is a highly conserved ortholog to an ELAV-like protein in the ant Harpegnathos saltator, which is present as two different isoforms involving inclusion of two different alternatively-spliced exons.