|Home | About | Journals | Submit | Contact Us | Français|
Motivation: Several new de novo assembly tools have been developed recently to assemble short sequencing reads generated by next-generation sequencing platforms. However, the performance of these tools under various conditions has not been fully investigated, and sufficient information is not currently available for informed decisions to be made regarding the tool that would be most likely to produce the best performance under a specific set of conditions.
Results: We studied and compared the performance of commonly used de novo assembly tools specifically designed for next-generation sequencing data, including SSAKE, VCAKE, Euler-sr, Edena, Velvet, ABySS and SOAPdenovo. Tools were compared using several performance criteria, including N50 length, sequence coverage and assembly accuracy. Various properties of read data, including single-end/paired-end, sequence GC content, depth of coverage and base calling error rates, were investigated for their effects on the performance of different assembly tools. We also compared the computation time and memory usage of these seven tools. Based on the results of our comparison, the relative performance of individual tools are summarized and tentative guidelines for optimal selection of different assembly tools, under different conditions, are provided.
Supplementary information: Supplementary data are available at Bioinformatics online.
Recently developed next-generation sequencing platforms, such as the Roche 454 GS-FLX System, Illumina Genome Analyzer and HiSeq 2000 system, and ABI SOLiD™ System, have revolutionized the field of biology and medical research (Schuster, 2008). Compared to traditional Sanger sequencing technology (Bentley, 2006; Sanger et al., 1977), these new sequencing platforms generate data much faster and produce much higher sequencing output, while decreasing costs by more than a thousand fold (Shendure and Ji, 2008). The ability to rapidly generate enormous numbers of sequence reads at markedly reduced prices has greatly extended the scope of economically feasible sequencing projects. The prospect of sequencing the entire human genome for a large number of samples has become a reality.
These new sequencing technologies also pose tremendous challenges to traditional de novo assembly tools designed for Sanger sequencing, as they are incapable of handling the millions to billions of short reads (35–400 bp each) generated by next-generation sequencing platforms (Dohm et al., 2007). Therefore, several novel de novo assembly tools have been developed, such as SSAKE (Warren et al., 2007), VCAKE (Jeck et al., 2007), SHARCGS (Dohm et al., 2007), Euler-sr (Chaisson and Pevzner, 2008), Edena (Hernandez et al., 2008), Velvet (Zerbino and Birney, 2008), Celera WGA Assembler (Miller et al., 2008), ABySS (Simpson et al., 2009) and SOAPdenovo (Li et al., 2009).
With the recent introduction of multiple de novo assembly tools, it has become necessary to systematically analyze their relative performance under various conditions so that researchers can select a tool that would produce optimal results according to the read properties and their specific requirements. Zhang et al. (2011) recently compared the performance of several of these tools for assembling sequences of different species. Although they evaluated multiple criteria such as runtime, RAM usage, N50 and assembly accuracy, their results were based on simulation reads using only a single depth of coverage (100×) and a single base call error rate (1.0%). Further investigation is necessary to determine whether, and how, these assembly tools are differentially affected by varying depths of coverage, sequencing errors, read lengths and extent of GC content of the sequence reads. Furthermore, the assembly performance of SOAPdenovo (v1.05) has dramatically improved for long read assembly. Consequently, sufficient information is not currently available for informed decisions to be made regarding the tool that would be most likely to produce the best results, based on variations in the practical conditions identified above.
Accordingly, in this study, we systematically studied and compared the performance of seven commonly used de novo assembly tools for next-generation sequencing technologies, using a number of metrics including N50 length (a standard measure of assembly connectivity, to be more specifically defined later), sequence coverage, assembly accuracy, computation time and computer memory requirement and usage. To imitate different practical conditions, we selected a number of experimentally derived benchmark sequences with different lengths and extent of GC content, and simulated single-end and paired-end reads with varying depths of coverage, base calling error rates and individual read lengths. Based on the results of our analyses, we have developed guidelines for optimal selection of different assembly tools under different practical conditions. Identifying and recognizing the various limitations of specific tools under different practical conditions may also provide useful guidance and direction for improving current tools and/or designing new high-performance tools.
Seven tools, SSAKE (v3.7), VCAKE (vcakec_2.0), Euler-sr (v1.1.2), Edena (2.1.1), Velvet (v1.0.18), ABySS (v1.2.6) and SOAPdenovo (v1.05 for 64bit Linux), were selected for studies and comparative analyses. These tools are all publicly available, and most of these tools are currently often used to assemble short reads generated by next-generation sequencing platforms, such as Illumina Genome Analyzer (read length = 35–150 bp) and ABI SOLID (read length = 35–75 bp). Of these seven tools, all are capable of assembling single-end reads, but only SSAKE, Euler-sr, Velvet, ABySS and SOAPdenovo support paired-end reads assembly.
Eight experimentally determined sequences (Table 1) were obtained from the NCBI database (http://www.ncbi.nlm.nih.gov/) and used as benchmark sequences to test the performance of the seven assembly tools. These sequences range from ~99 kb (base pair) to ~100 Mb, each with a different extent of GC content.
Simulated single-end and paired-end reads were generated from benchmark sequences with several variable parameters, including depth of coverage, base calling error rate (BCER) and individual read length. Depth of coverage is the average number of reads by which any position of an assembly is independently determined (Taudien et al., 2006). BCER is the estimated probability of error for each base call (Ewing and Green, 1998).
Single-end reads simulation method was the same as that used previously (Dohm et al., 2007), that is, each read was generated as a DNA fragment of the preset read length from any position in the benchmark sequence with equal probability. Each base of the read was then randomly and independently changed into another base with probability of BCER. In paired-end read simulation, a fragment with length of fragment size was randomly obtained from the benchmark sequence, then two reads of the preset read length were generated simultaneously from the two ends of this fragment, which were considered as one pair. We applied the fragment size distribution based on the empirical distribution of the experimental read dataset of the E.coli library (GenBank accession no. SRX000429) (Supplementary Fig. S1). The simulation of base calling errors was the same as that of single-end read errors.
The total number of reads was determined by the following formula:
To study and compare the seven selected de novo assembly tools, sequencing reads were simulated as follows.
Runtime parameters for the seven assembly tools were generally set to the default or recommended values of each method with a few exceptions: for VCAKE, the runtime parameter c was set as 0.7 in order to make it consistent with SSAKE. [Each base call in VCAKE was dependent on a voting result; when the votes were totaled and the base proportion exceeded a threshold, c, that base was added to the output contig (Jeck et al., 2007).] Parameter k for Velvet, ABySS, SOAPdenovo and parameter m for Edena should vary with read length in order to get good N50 lengths. Since no clear default settings for these parameters were presented in the manuals for the corresponding tool, we established values for k and m that produced relatively optimal N50 lengths, based on our own preliminary empirical testing of conditions for each tool. Specific values of the parameters k and m are provided in Supplementary Table S1.
Most of the assembly was carried out on a cluster with eight computer nodes, with each node consisting of dual Quad-Core (2.40 GHz) processors and 12 GB RAM. Comparison tests of required computational demand were performed on a server with dual Quad-Core (2.40 GHz) Processors and 32 GB RAM.
The seven selected de novo assembly tools were applied to assemble the simulated sequencing reads into contigs. In paired-end assembly, tools that support paired-end reads performed an additional step of scaffold construction to get the final output contigs. Contigs with lengths >100 bp were used to evaluate the performance of each tool. Each simulation and assembly was conducted five times, and the assembly results were set as the average values.
The performance of each tool was measured by a number of metrics, including N50 length, sequence coverage, assembly error rate, computation time and computer memory usage. N50 length is the longest length such that at least 50% of all base pairs are contained in contigs of this length or larger (Lander et al., 2001). N50 length provides a standard measure of assembly connectivity, reflecting the nature of the bulk of the assembly rather than the cutoff which defines the smallest reportable assembly unit (Jaffe et al., 2003). Higher N50 length indicate better performance of the assembly tool. Sequence coverage refers to the percentage of the benchmark sequence covered by output contigs. In the calculation of assembly error rates, we aligned the output contigs to the benchmark sequence, and calculated the number of mismatched bases from alignment results. The assembly error rate was the percentage of these mismatched bases in the total bases of aligned contigs in the reference sequence. Sequence coverage and assembly error rates were analyzed by blastz (Schwartz et al., 2003).
To determine whether, and how, the assembly performance of the seven tools was differentially affected by the depth of coverage and extent of GC content in the source sequences, these tools were used to assemble simulated sequence reads (BCER=0.6%) generated from different benchmark sequences (GC content=~36–50%) at different depths of coverage. Assembly performance of the seven tools is illustrated in Figure 1 and Tables 2–5. Figure 1 and Tables 4 and and55 present test results for part of a benchmark sequence as an example, but similar results were obtained for the other benchmark sequences tested (Supplementary Tables S2–9).
With increasing depths of coverage, the performance of these seven tools showed some interesting patterns (Fig. 1) in assembly connectivity measured by N50 length. Although there was an initial increase in N50 lengths with increasing depth of coverage, N50 lengths reached a plateau when the depth of coverage reached a certain threshold. For simplicity, DCAP will be used here to refer to the depth of coverage at which the N50 length plateau was reached.
In single-end assembly, DCAP for SSAKE and Edena (~50×) was greater than that for VCAKE, Velvet, ABySS and SOAPdenovo (30–40×); DCAPs for Euler-sr varied with read length (~50× when read length was 35 bp and ~20× when read length was 75 bp). In paired-end assembly, DCAPs for most tools were lower than those observed in single-end assembly. DCAPs for SSAKE (~40×) was still greater than that for Velvet, ABySS and SOAPdenovo (20–30×); DCAPs for Euler-sr varied with read length (~40× when read length was 35bp*2 and ~20× when read length was 75bp*2).
To compare N50 values among the various tools, we chose N50 values at a depth of coverage of 70×, because this exceeded the DCAP for all tools (Tables 2 and and3).3). General observations for N50 values of these tools under these various conditions are described below. Comparison results varied with different read lengths and GC content. Sequences with a GC content of 36.90 and 38.16% are referred to as ‘Low GC content’, whereas, those with a GC content of 44.38 and 49.81% are referred to as ‘High GC content’. Similarly, ‘short read’ and ‘long read’ refer to 35 and 75 bp read lengths, respectively.
In single-end reads assembly, with:
In paired-end reads assembly:
Using benchmark sequences D.mel and T.bru as examples, we compared assembly performance of the seven tools with regard to sequence coverage and assembly error rate (Tables 4 and and5).5). Generally, long reads resulted in high sequence coverage and assembly error rates.
In single-end reads assembly:
In paired-end reads assembly:
To determine whether, and how, assembly performance of the seven tools was differentially affected by changes in BCER, these tools were applied to assemble sequencing reads simulated from three benchmark sequences (D.mel, H.inf and T.bru) with variable BCER (0.0–1.0%, with a 0.2% incremental change at every step).
N50 lengths for all seven tools showed decreasing trends, with increases in BCER, but generated different patterns.
When selecting a tool for de novo sequence assembly, computational demand by the tool should also be considered. This is particularly important when analyzing large genome sequence data (e.g. human genomes) for large samples. The utility of a tool can be seriously limited if it takes up excessive memory space, consumes too much CPU time and exceeds reasonable execution time. Consequently, we compared the runtime (RT) and resident memory usage (RM) required for the seven tools to assemble large datasets. The test results are presented in Table 6.
In this test, we also analyzed N50 lengths, sequence coverage and assembly error rate. The results were consistent with several conclusions in previous sections (Supplementary Table S16).
This study compared seven publically available and commonly used de novo assembly tools: SSAKE, VCAKE, Euler-sr, Edena, Velvet, ABySS and SOAPdenovo. These tools are specifically designed to assemble large numbers of short reads generated by next-generation sequencing platforms.
In analyzing these tools, stronger performance is indicated by higher N50 values, higher sequence coverage, lower assembly error rates and lower computational resource consumption (to enable assembly of larger genomes). The performance of different assembly tools was dependent, to some extent, on the test conditions. Based on the results of our investigation, we propose the following guidelines for tool selection. Generally, SSAKE, Edena and Euler-sr need higher depths of coverage (~50×) than Velvet, ABySS and SOAPdenovo (~30×) to generate higher N50 lengths; SOAPdenovo was the fastest of all tools, and ABySS almost always consumed the least memory space. We have developed a tentative reference/guidelines for selecting optimal de novo tools under varying conditions (Table 7). Specific comments regarding the performance of individual tools under different conditions are summarized below.
SSAKE provided good sequence coverage, and also generated good N50 lengths when assembling sequences with low GC content. On the other hand, SSAKE tended to generate more assembly errors and needed more depth of coverage to reach DCAP than most of the other tools tested. The time and memory usage of SSAKE was also the highest of the tools tested. Our results indicated that assembly of large sequences (e.g. Homo sapiens) was not feasible with SSAKE.
VCAKE produced the shortest N50 lengths in most situations, and the sequence coverage by VCAKE was comparable to SSAKE. VCAKE also generated many assembly errors, even higher than that of SSAKE under certain test conditions. The computational resources required to run VCAKE were a little less than those required for SSAKE.
In assembling single-end short reads, Euler-sr produced the longest N50 values, but it also generated high assembly error rates, comparable to that of SSAKE. In addition, sequence coverage of Euler-sr was the lowest under most test situations. Euler-sr consumed intermediate computational resources.
Under most conditions tested, Velvet and ABySS show similar assembly performance; they generated similar N50 lengths, their DCAPs were relatively low and they required acceptable computational resources. Consequently, it is feasible to use these tools for assembling large sequences, such as those obtained for Homo sapiens. ABySS produced fewer assembly errors, and consumed a little less memory and more runtime than Velvet. When assembling paired-end reads, ABySS produced the highest assembly coverage of all tools tested. When assembling larger genomes, Velvet sometimes used exceptionally high runtimes and memory.
Edena needs a high depth of coverage, comparable to SSAKE, to reach the DCAP. It produced similar, or greater, N50 values to Velvet in most single-end assemblies, and generated assembly error rates that were comparable to Velvet. The computation demands of Edena were intermediate, between SSAKE and ABySS.
SOAPdenovo was the fastest assembler. Its DCAP was as low as that of ABySS and it produced among the highest N50 values in paired-end read assembly, and relatively high N50 values in single-end assembly. SOAPdenovo generated higher assembly error rates and lower sequence coverage than ABySS. It also consumed more memory than ABySS when assembling paired-end reads. The appropriate setting for SOAPdenovo (SOAPdenovo31mer, SOAPdenovo63mer and SOAPdenovo127mer that support kmer ≤31,≤63 and ≤127, respectively) must be selected based on read length. SOAPdenovo63mer/SOAPdenovo127mer consumed two/four times as much RAM as SOAPdenovo31mer.
In light of our results, investigators may choose the most appropriate assembly tool(s) to use based on their specific experimental setting and available computational resources. Our results may also serve as a reference, when designing sequencing projects, for selecting targeted depths of coverage, control levels of sequencing error rates, etc. Given the rapid increase in use of next-generation sequencing technologies, our results should be of value to both empiricists, during experimental design, and to bio-informaticians who seek guidance for selecting appropriate assembly tool(s) for data analyses and who attempt improvement of the assembly tools.
Funding: Shanghai Leading Academic Discipline Project (S30501 in part); startup fund from Shanghai University of Science and Technology. The investigators of this work were partially supported by grants from NIH (P50AR055081, R01AG026564, R01AR050496, RC2DE020756, R01AR057049 and R03TW008221); Franklin D. Dickson/Missouri Endowment from University of Missouri–Kansas City and the Edward G. Schlieder Endowment from Tulane University.
Conflict of Interest: none declared.