Study design and RNA-Seq data collections
Currently five publicly available assemblers have been reported to be used for de novo assembling short-read RNA-Seq data into transcripts. They are SOAPdenovo, ABySS, trans-ABySS, Oases and Trinity. Trans-ABySS was developed by ABySS team that adopted MK strategy to ABySS. Following the same approach, we applied MK strategy to SOAPdenovo and Oases (referred as SOAPdenovo-MK and Oases-MK, respectively). Trinity, on the other hand, fixed its k-mer value at 25 that was not changeable. It used a specially designed algorithm to recover possible transcripts/isoforms to ensure high plausibility. But at the meantime, to assemble the same dataset Trinity required runtime at least 20 folds more than the other programs used under SK condition. So we found it impractical to apply MK strategy to Trinity at current stage. Thus, our design included 7 program conditions: 4 with SK (SOAPdenovo, ABySS, Oases and Trinity) and 3 with MK (SOAPdenovo-MK, trans-ABySS and Oases-MK). All the tests were run on the same single-node machine with 512G memory and 4 AMD Opteron 6168 (12-core) processors.
In order to examine how genome with different complexity affects assembly outcomes, we selected public RNA-Seq data from two model organisms as benchmark: fruit fly (
D. melanogaster) and fission yeast (
S. pombe). Fruit fly has a genome size of 117 Mb, having 22680 protein coding genes and average intron length ~ 2.3kb (based on RefSeq gene sets). Fission yeast has a smaller genome of ~ 12.5 Mb [
20], with 5174 protein coding genes, and average intron length ~ 81bp. Besides both organisms have excellent genome reference available, their distinct genome properties helped elucidate how simple (fission yeast) or more complex (fruit fly) genomes influenced transcriptome assembly. Tea plant,
C. sinensis, has a large genome (~ 4G) yet to be resolved. We hoped to significantly improve on its existing transcriptome assembly, so to demonstrate the usefulness of optimizing strategy and guidelines for
de novo transcriptome assembly.
Comparison of transcript assembly under different program conditions
In order to compare the performance of each assembler, we put in test two sets of benchmark data that displayed different data properties. In addition, we varied the amount of initial inputs from the two sets of data to evaluate the effect of coverage depths on the assembly outcomes (details in
Materials and Methods). The outcomes are summarized in Additional file
1 and
2.
When measured in the number of assembled transcripts, total bases of transcripts, mean length, N50, percentage of low quality transcripts, number of long-transcripts (≥1kb), and number of reads that could be mapped back to transcripts (RMBT), we observed significant improvement on the outcomes when MK strategy was applied to each program. For all paired tests: SOAPdenovo
vs. SOAPdenovo-MK, ABySS
vs. Trans-ABySS, and Oases
vs. Oases-MK, there were at least 50% increases in the number of assembled transcripts, total bases of transcripts, and number of long-transcripts comparing MK to SK (Additional file
1 and
2).
With increasing coverage depth, each assembler generally produced a larger number of transcripts and more total bases, but the mean transcript length and N50, after an initial increase, peaked at a certain threshold and started to decrease. The percentage of RMBT had a pattern reversely correlated to increasing coverage depth for all program conditions except for Trinity.
Overall, Oases-MK assembled the most transcripts and long-transcripts, whereas trans-ABySS/ABySS produced the longest mean transcript length and the largest N50. While Trinity preformed the best in the percentage of RMBT, SOAPdenovo was the worst in the category. The percentage of RMBT is an important benchmark for evaluating the performance of each method. An optimal program should use as many reads as possible to reconstruct high-quality transcripts. Trinity reached almost 90% with the D. melanogaster data, which may be attributed to its greedy k-mer-based approach at the Inchworm step. Oases-MK came in second for this measure. Given the number of low quality transcripts, performance of SOAPdenovo was not satisfactory.
Resources usage by different assemblers
The demand for resources to carry out de novo assembly is an important factor to consider when choosing a software tool. While it was proved to be critical in assembly of large genome, resources usage for assembling transcripts bears some equal importance for practical reason. We monitored and recorded the runtime and memory usage for four SK assemblers running on testing data sets on the same computer. We found the runtime and memory usage were two essential factors that limit the use of a program. The measured data of runtime and memory occupancy for each assembler tested with SK method are illustrated in Figure .
The four SK assemblers displayed distinct memory usage patterns through their processing steps. Among them, Oases consumed the largest maximum memory (at Velvetg step), whereas memory usage by ABySS was the smallest (Figure ). It was assumed that larger data set would consume more memory. This was generally true with all four assemblers as the memory usage displayed a good correlation with the size of testing data (Figure ), though Oases was the most sensitive, and ABySS the least sensitive in response to increasing data size. The k-mer values also had great impact on both memory usage and runtime. Memory usage displayed reverse correlation with k-mer values for Oases but remained constant for SOAPdenovo and ABySS (Figure , Trinity remains unknown as its k-mer value was not changeable). While Trinity required the longest runtime and SOAPdenovo the least for the same testing dataset, the time costs for all four tools, as expected, were approximately proportionate to the size of testing data set (Figure ). Runtimes for ABySS, Oases, and SOAPdenovo were reversely correlated with the k-mer values (Figure ), but the impact was not as dramatic as that of k-mer values on memory usage.
These results indicated that assembly using Oases with small k-mer value requires large memory and may exceed the memory space of a typical computing sever nowadays, and processing of a large data set by Trinity can exceed reasonable execution time and hence becomes impractical. Thus these factors warrant careful consideration when one chooses a tool for analysis as well as setting parameters associated with the tool.
Validating assembled transcripts by mapping to reference genome
To validate assembled transcripts, we mapped each transcript to its reference genome as described in Materials and Methods: Map reconstructed transcripts to reference. Transcripts assembled from D. melanogaster data sets using different methods showed a high percentage in alignment to its reference genome. Less than 0.5% of assembled transcripts failed to align (Figure , shown using Dme-13g data set), and similar results were found using smaller sampling data from D. melanogaster data sets (data not shown). Pairwise alignment using BLAT was performed for transcripts from SOAPdenovo-MK, trans-ABySS, Oases-MK and Trinity. Shared (defined as at least 95% sequence identical between two transcripts from different methods) and unique (if the transcript is not shared, then it was unique) transcripts were then aligned to genome separately. While the shared transcripts were generally validated by mapping to genome at a high percentage, the unique ones were mapped to reference genome at various levels with Trinity being the best and SOAPdenovo the worst (Figure ).
For
S. pombe data set, Trinity, Oases and Oases-MK showed worse performance than for
D. melanogaster data set, with more than 10% transcripts failing to be aligned to reference (Figure ). Unique transcripts accounted for more than 60% of all unmapped-transcripts (Figure ) except for trans-ABySS (33.83%). Except for trans-ABySS (19/45), the rests had over 50% of unique unmapped-transcripts with BLASTX hits (E≤10
-10) to Uniprot database [
21] (Figure ), representing some
bona fide gene transcripts. We further tested whether low quality sequence in
S. pombe data set contributed to the high percentage of unmapped-transcripts. After trimming low quality nucleotides (<Q20) from 3’-end before re-assembly, Trinity had a 6~7% increase in matched transcripts (data not shown), confirming that sequence errors in
S. pombe data set were at least part of the reason for the higher level of unmapped-transcripts.
Evaluating gene coverage and integrity of assembled transcripts
The gene coverage and transcript integrity are important performance benchmarks for transcriptome assembly. We evaluated gene coverage and transcript integrity with D. melanogaster and S. pombe data sets by matching reconstructed transcripts to CDS and examining the numbers of covered full-length genes. The full-length transcripts reconstructed by different program conditions displayed some similar patterns: the numbers of full-length transcript initially went up with increasing sequence reads; in cases of SOAPdenovo-MK, ABySS, trans-ABySS, Oases-MK and Trinity their numbers leveled off at certain data levels, whereas for SOAPdenovo and Oases their numbers started to drop (Figure ). The turning points appeared to be related to the complexity of the genome. The turning point was around 3G for fruit fly, and between 1-3G for fission yeast.
For D. melanogaster, there is totally 55.46Mb of unique transcripts from RefSeq or 53.80Mb from Ensemble gene sets. Assuming 80% of the genes expressed, the 3Gb-sequence reads, where the turning point was observed, amounts to ~75× average coverage on total expressed genes. For S. pombe, the turning point equals to approximately 100× average coverage. These numbers are important reference in design of future de novo transcriptome study, in which some estimate and careful testing are recommended to find the optimized parameters for a given organism. Full-length, partial-length, and fused CDS were illustrated for transcripts reconstructed from D. melanogaster (Figure ) and S. pombe (Figure ) data sets. At the curve-turning point or the full-data point, MK methods appeared to build more full-length CDS comparing to SK with same assemblers, whereas partial-length CDS remained almost unchanged. On the other hand, there was an increase in the numbers of fused CDS being associated with the MK methods.
It’s worth noting that the number of fused genes was low for S. pombe transcripts reconstructed by Trinity, which took use of strand-specific information for assembly (Figure ). This was not observed with D. melanogaster transcripts, where no strand-specific information was available. In addition, Trinity had a “--jaccard_clip” option that was recommended for gene dense genome with lots of transcripts overlapping on the same strand. For S. pombe transcripts, the option significantly reduced the number of fused genes (Figure , personal communication with Brian J. Haas).
In comparison of different program conditions, Oases-MK appeared to cover the most in number of genes as well as the most in number of full-length genes. While comparable in total number of assembled transcripts, SOAPdenovo-MK and trans-ABySS were lagging in the number of reconstructed full-length genes (Figure ). For SK methods, Oases’s performance was satisfactory at small data set, but lagged behind with increased inputs. Again, SOAPdenovo was the worst performer for this measurement, especially with large inputs data at high coverage depth.
Evaluating sensitivity of assemblers to genes expressed at different levels
The sensitivity of program condition to gene expression level was examined by counting the full-length transcripts of various expression levels. As shown in Figure , using varying k-mer values Oases captured transcripts in a different range of expression quintiles. The small k-mer value, i.e. k=19, worked better for transcripts at low quintiles, whereas a large k-mer value, i.e. k=49 only worked in a high quintile range. On the other hand, the MK methods took advantage of these properties from different k-mer values, and can cover transcripts in a broad expression range (Figure ).
Comparing the different program conditions, our data showed that all had a poor performance at 10%~30% lowest quintiles (Figure ). Surprisingly, Trinity reconstructed a steady number of CDS at above 30% quintiles. The others, SOAPdenovo, Oases, and ABySS when using SK strategy did not perform well for either the lowly or the highly expressed genes. However, when employing MK strategy, the performance of SOAPdenovo, Oases, and ABySS was greatly improved, especially on the high quintile levels (Figure ). We observed that highly expressed genes were often assembled into incomplete transcripts. As shown in Figure , NM_079795 represents one of the highly expressed genes in D. melanogaster. While Trinity correctly reconstructed the entire transcript of NM_079795, various short forms were generated by other program conditions.
De novo assembly of C. sinensis transcriptome by different assemblers
The tea plant,
Camellia sinensis, is one of the most important economic cultivar that is used to produce a good variety of tea products. It has an estimated genome size of about 4.0Gb [
22]. With its large genome size and no genome draft being available, the transcriptome analysis provided a good option to study the gene composition, genetic polymorphism, and metabolic basis of this important economic plant. However, there were some great challenges researchers faced. They included unknown number of genes in
C. sinensis, potentially very large genetic diversity of the studied population, and unclear evolution history,
etc.
We performed
de novo assembly analysis to the published RNA-Seq data set from
C. sinensis [
3], which consisted of 15.46 million pairs of 75bp Illumina sequence reads. To calibrate the system and make our results comparable to the original published work (used SOAPdenovo), we first tested different
k-mer values with SOAPdenovo, and found
k =25 produced similar results with N50 and mean transcript length comparable to the recently published results (Additional file
3: columns “Published data” and “SOAPdenovo”). Then we performed
de novo assembly using different program conditions on the
C. sinensis RNA-Seq data (basic statistics are shown in Additional file
3). Overall, the MK methods (SOAPdenovo-MK, trans-ABySS and Oases-MK) produced much larger numbers of transcripts (≥100bp) with more total bases than the original published assembly data and SOAPdenovo results we obtained. SOAPdenovo-MK, trans-ABySS and Oases-MK also produced superior results in mean length, N50 and numbers of long-transcripts (≥500bp and ≥1kb) than the original published results. Within SK methods, Trinity generated significantly better results than the original published assembly data and SOAPdenovo results in almost all categories except mean length and N50. The better assemblies by MK methods and Trinity were translated into larger numbers of coding proteins. We observed significant increases in BLASTX hits to Uniprot database [
21] and in the numbers of unique Uniprot proteins identified (Additional file
3). These additional genes would certainly help reveal the complete metabolic pathways in
C. sinensis and identify the missing genes in natural molecule synthesis important to tea flavor and quality. One good example is Cinnamate 4-hydroxylase (C4H, EC1.14.13.11), which is an important enzyme that converts cinnamate to p-coumarate in flavonoid biosynthesis pathway. In the original paper [
3], it was indicated that there was no cinnamate 4-hydroxylase in
C. sinensis. However, in our assembly results from either Oases-MK or Trinity, while performing BLASTX against the KEGG database [
23], we were able to identify multiple C4H gene transcripts (Additional file
4 and
5) that filled into the gap in flavonoid biosynthesis pathway.