At present, mainly three distinct strategies are applied in short reads assembly. Among them, Greedy-extension is the implementation of string-based method, while De Bruijn graph and overlap-layout-consensus (OLC) are two different graph-based approaches. Each assembly tool is suitable for dataset from specific sequencing platform.
For each short reads assembly procedure, less computational time and memory cost is our expectation. The computational time of the assembly process is determined by both the dataset complexity and the assembly strategy. The information about running times, maximum memory occupancies for different assemblers applied to different datasets is illustrated in and . For string-based assembler, the time and memory cost is approximately proportionate to dataset size, although it is also affected by the complexity of dataset. Among them, SSAKE runs in rather less time than other peer assemblers, but the RAM usage increases dramatically with augmentation of dataset size. QSRA 
is developed upon the original VCAKE algorithm, which indeed reduces the computational time, at the cost of RAM occupation. SHARCGS runs in comparable speed as QSRA, however it is highly memory-intensive, even unable to handle E.coli
short reads dataset with our computer power used in this study. Edena is a typically graph-based assembly tool, which has two running modes: strict and nonstrict modes 
. For the strict mode, fewer but more accurate contigs are generated, while nonstrict mode acts on the contrary. Compared with string-based tools, Edena is superior in terms of time and RAM utilization. Velvet and SOAPdenovo typify another graph-based method. Similar to Edena, they implement assembly tasks with fairly little computational time and memory usage.
Computational running time and maximum memory occupancy of 36-mer short reads assembly procedures.
Computational running time and maximum memory occupancy of 75-mer short reads assembly procedures.
Especially, SOAPdenovo runs in an extreme speed as the exploitation of threads parallelization, but may perform not well enough for small datasets due to the initial task allocation. At last, Taipan was proposed as the hybrid of string-based and graph-based approaches 
, with the dominative feature - the exceedingly short runtime. Nevertheless, the minimum RAM of computer to execute the assembler is high and the requirement for memory grows slowly with the increase of dataset size. Result also shows that more running time and RAM consuming are demanded for paired-end (PE) reads assembly than single-end (SE) reads dataset with the same assembler (Unpublished data). Compare with 36-mer short reads assembly, only OLC, De Bruijn
and hybrid assemblers can be applied for 75-mer short reads assembly. Our study indicates that no significant difference on the computational time and RAM occupancy for the assembly of these two types of short fragments, with the same sequencing coverage.
The assembly accuracy and integrity is another consideration for the evaluation of the short reads assemblers. Obviously, contigs with high fidelity and genome coverage are our expectation. Different assemblers have their own performance. Their percentages of correctly mapped contigs and genome coverage for different datasets are shown in and . The latest version of SSAKE is of robustness to sequencing errors, compared with it is first version, which was introduced to handle error-free short reads 
. Other string-based assemblers, such as VCAKE and SHARCGS performed in rivalry with the latest version of SSAKE while QSRA could only generate less precise and lower coverage contigs in contrast with previous tools. What deserves to be mentioned is that Edena, as an assembler based on the overlap-layout-consensus algorithm (OLC) 
, had a quite surpassing performance on various datasets. However, contigs produced from two De Bruijn
graph-based assemblers, especially SOAPdenovo, were of lower accuracy, but with comparable genome coverage to string-based software. Nevertheless, when handling dataset of huge size, such as short reads from C.elegans
genome, SOAPdenovo had similar performance as Edena. This result can be elucidated as following: for De Bruijn
graph-based method, certain proportion of base errors are incorporated into contigs during the construction of graph with k mers generated from input short reads, this process then generate less precise contigs. In the end, the hybrid assembler, Taipan was capable to generate sequences of high accuracy and genome coverage as string-based assembler for small datasets, but performed poorly for the assembly of large genome dataset. After inspection on this assembly procedure, we supposed that it was the exploitation of only partial fraction of short reads that lead to the low coverage productive contigs. Here, we also verified that PE reads is superior to SE reads in terms of resolution for repetitive elements, which is in consistent with previous study 
. In addition, our result shows that more accurate and higher genome coverage contigs can be produced with longer reads datasets, while it may be a paradox for assembly of large genome, such as C.elegans
, of which none of the selected assemblers in this study is suitable for its 75-mer reads assembly.
Accuracy and integrity for 36-mer datasets assembly.
Accuracy and integrity for 75-mer datasets assembly.
For further analysis of assembled contigs, the contig size distribution was calculated and shown in and . For many biological studies, DNA sequence with sufficient length is necessary. Under ideal condition, only one contig that matches the whole genome sequence perfectly could be generated from each assembly procedure. Practically, the contigs generated by different assembly procedures are separated by gaps for the presence of repetitive fragments. From , , , , , and , it is clear that different assembly strategies perform differently on diverse datasets. For dataset of very small size, string-based assemblers produced fewer but longer reads than De Bruijn graph-based tools. However, it became reverse when the size of dataset increases. Edena, the OLC assembler, could assemble short reads into relatively long contigs for various datasets. Taipan, as a hybrid assembly tool, had better performance than Edena for small datasets. When handling short fragments from large genomes such as C.elegans, even though fairly longer largest contigs was formed, N50 and N80 size were not available with too few assembled contigs. Here in general, we can claim that PE reads or longer reads would generate better assembly results. Besides, for De Bruijn assemblers, Velvet produced better assembly result than SOAPdenovo when assembly of 75-mer short reads datasets, because of the wider range of K value to be chosen in Velvet.
Statistics for assembled contigs of 36-mer short reads.
Statistics for assembled contigs of 75-mer short reads.