As for the synthetic reads, we used the datasets of 4 organisms: C.elegans (Caenorhabditis elegans), E.coli (Escherichia coli strain K-12), L.major (Leishmania major strain Friedlin), and S.cerevisiae (Saccharomyces cerevisiae S288c). The reference genome of each organism was downloaded from NCBI Genome Sequence (, 1
, and S2
). MetaSim 
was used to generate synthetic reads for each reference genome. MetaSim provides options to choose a read length, an average sequence coverage value, and an empirical error model. The sequence coverage stands for how many times a nucleotide in the original sequence (the genome of an organism in our study) appears at the reads. We set the read length to either 36 or 75 base pairs (bps), the sequence coverage to 10, 20, 40, 80, or 160, and the empirical error model to either error free (Exact) or an error model for the short reads of the Illumina technology (Illumina). We used the error model included in MetaSim for the error probabilities of 36 bp reads and the one from Plantagora 
for the probabilities of 75 bp reads. For example, a dataset ‘E.coli-Illumina-75
bp-80x’ consists of a sequence of reads from the E.coli reference genome with the sequence coverage of 80, each of which has 75 base pairs, and following the Illumina error probability model. All simulation parameters of MetaSim are listed in Table S3
. ccTSA relies on separate scaffolding tools to orient and align the contigs into super-contigs or scaffolds. In order to fairly compare the performance and quality of the assemblers, we configured each assembler to treat the synthetic sequences as single-end reads, and excluded scaffolding and gap closure parts from comparison even though MetaSim generated paired-end data.
Reference genome datasets downloaded from NCBI Genome Sequence.
We used the paired-end whole-genome shotgun data of the following organisms: S.aureus (Staphylococcus aureus) and R.sphaeroides (Rhodobacter sphaeroides). We downloaded the data sets from the GAGE 
web site at http://gage.cbcb.umd.edu
, which originated from NCBI Genome Sequence, and then were preprocessed using the Quake 
and ALLPATHS-LG 
error correctors. As for the real reads, we set all the assemblers to perform scaffolding and gap closure parts to compare the quality values of the assembly results. Because ccTSA did not exploit paired-end reads, we used SSPACE 
to scaffold contigs. We ran ccTSA and SSPACE using both datasets of preprocessed reads and reported the better assembly results. For the other assemblers compared in this paper, we used their own internal scaffolding features. We reported the NG50 values, the numbers, and the error-corrected sizes of contigs and scaffolds using the analysis tools available from the GAGE web site.
The parallel versions of Velvet 1.2.01 
, SOAPdenovo 1.05 
, and ABySS 1.2.7 
were used for assembly. We compared the generated contigs (contiguous DNA sequences reconstructed from the assemblers) with the reference genomes using megablast 
in NCBI BLAST+2.2.25 
. The parameters and configuration files used for BLAST+, Velvet, ABySS, SOAPdenovo, and ccTSA are listed in Table S4
. We measured the assembler performance on a system with 4 octo-core Intel Xeon 4820 processors (total 32 computing cores) and 512GB of main memory that ran RHEL 6, gcc 4.4.4, and Open MPI 1.4.3. We used 16 hardware threads for executing the assemblers by default, and scaled the assemblers to utilize up to 32 cores. Unless mentioned otherwise, ccTSA pruned the k-mers with coverage value 1 from the k-mer coverage table before building a de Bruijn graph. We used SSPACE 1.1 
for scaffolding contigs generated from ccTSA.
We compared the execution time, the maximum memory usage, and the quality of the generated contigs of ccTSA with other assemblers. For the experiments using the synthetic reads, we used the following quality metrics: the largest contig length (Max), N20, N50, NG50, N80, and the fraction of the genome covered by the assembled contigs, called covered genome ratio (CGR). The assembled contigs were aligned to the reference genome with NCBI BLAST+2.2.25 using megablast algorithm. Among the generated contigs, we discarded the sequences that were either lower than 98% identical to the reference or too short (shorter than 100 bases for 36 bp reads and 200 bases for 75 bp reads). We counted the bases in the genome that were mapped to the remaining contigs to compute the covered genome ratio. The NG50 value is the length of a contig when the aggregate size of the contigs that are not smaller than the contig reaches half of the reference genome length.
shows the NG50 values ccTSA produced for datasets from 4 organisms when we varied the read length, the error model, and the sequence coverage of the synthetic reads. shows the NG50 of E.coli 36 bp synthetic reads without base-call errors (E.coli-Exact-36 bp) on various k-mer lengths. At a given sequence coverage, the NG50 values first increased then decreased as the k-mer length increased. As the sequence coverage increased, the NG50 values increased but were saturated starting from 80x. Also, the k-mer length giving the best NG50 value increases as the sequence coverage increases. When we introduced errors to the reads using the Illumina error model, the trends of the NG50 over the k-mer length and the sequence coverage were similar, but the NG50 values were smaller than the ones without errors (). When we increased the read length from 36 bp to 75 bp, the trends were unchanged, but the NG50 increased as fewer regions of a genome were aliased such that a read was mapped to multiple regions (). On other organisms, the trends of the NG50 were unchanged. However, the NG50 at a given sequence coverage decreased as the length of a genome increased ().
The NG50 of ccTSA on datasets from 4 organisms with different sequence coverage and k-mer values.
shows the NG50 values from ccTSA and the other assemblers on E.coli 75 bp reads using the Illumina error model. Other assemblers showed similar trends in NG50 when the k-mer lengths and sequence coverage values were varied. The NG50 values of Velvet were higher than those of other assemblers on small sequence coverage values, but became similar when the coverage value exceeded 40x. The NG50 values on other organisms showed similar trends and were not included in this paper. Because the improvement on NG50 was marginal after the sequence coverage of 80, we used 80x reads hereafter.
The NG50 of 4 assemblers on datasets from E.coli with different sequence coverage and k-mer values.
We compared the NG50 values of the four assemblers on the E.coli datasets in . All the assemblers generated similar NG50 values on a given k-mer length. No single assembler produced the highest NG50 values on the entire range of k-mer values, but the NG50 values of Velvet and ccTSA were higher than others on many points. For the 75 bp reads with the Illumina error model, the k-mer values that provided the highest NG50 were similar: 53 for Velvet, SOAPdenovo, and ccTSA, and 55 for ABySS. The results on other organisms showed the same trends. Among them, we presented the NG50 values on L.major 80x reads with the Illumina error model in .
The NG50 of 4 assemblers on E.coli and L.major 80x with various k-mer values.
NG50 is not the only quality metric of the assembly results. We report other metrics, such as N20, N50, N80, the largest contig length, and the covered genome ratio (CGR), of ccTSA on 75 bp reads with the Illumina error model in . On a given k-mer length, the aggregate contig length was the largest, followed by the longest contig length, N80, N50, NG50, and N20 on most cases as expected. The CGR of the generated contigs was higher than 95% on most k-mer lengths, which shows the usefulness of ccTSA as a sequence assembler. The CGR values ccTSA produced were also similar to those from other assemblers, as shown in Figure S1
The quality values of ccTSA on 75 bp, Illumina, 80x datasets from 4 organisms with various k-mer values.
Above results showed that the assembly quality, such as the NG50 and the CGR, of ccTSA was on par with or surpassed that of other sequence assemblers. We then compared the performance of the assemblers, where ccTSA provided huge advantages over the others in sequencing speed. shows the execution time of ccTSA, Velvet, SOAPdenovo, and ABySS, when we increased the number of utilized hardware threads from 1 to 32. On each dataset, we used the k-mer length that gave the highest NG50 value, which was also the function of the assembler. The sequencing speed was improved by utilizing multiple threads on all the assemblers and it scaled better on larger datasets, but the sequencing speed of ccTSA was substantially better than other assemblers. ccTSA was 23.1, 5.6, and 13.3 times faster than Velvet, SOAPdenovo, and ABySS, respectively, on E.coli-Exact-36 bp reads, 13.0, 4.6, and 17.9 times faster than Velvet, SOAPdenovo, and ABySS on E.coli-Illumina-75 bp reads, and 9.7, 5.3, and 16.6 times faster than Velvet, SOAPdenono, and ABySS on L.major-Illumina-75 bp reads, when 16 hardware threads were used. The sequencing speed of ccTSA also scaled better than others. When the number of threads was increased from 1 to 16, the sequencing speed of ccTSA improved 9.0 times while that of Velvet, SOAPdenovo, and ABySS improved 2.8, 5.3, and 3.3 times on L.major-Illumina-75 bp reads. summarized the contig length, quality, sequencing speed, and memory usage of the assemblers. Even though ccTSA was substantially faster than others, it used more main memory than others except SOAPdenovo on many datasets. Because a genome could have billions of base pairs, it is important to lower the memory usage.
The execution time of 4 assemblers on E.coli and L.major 80x with thread numbers varied.
The contig lengths, quality, sequencing speed, and memory usage of the sequence assemblers.
We implemented a feature in ccTSA that trades the memory usage during execution for the quality of the generated contigs. This feature is based on the observation that the histogram of the coverage values on a k-mer coverage table reveals that a large portion of k-mers have low coverage values, mostly from base-call errors. If we prune these low coverage k-mers in the middle of building the table periodically instead of pruning them after all reads are processed, we can considerably lower the memory usage at the cost of slightly worse assembly quality due to the small possibility that the k-mers to be pruned are not from errors. If we increase the pruning frequency, low coverage k-mers are pruned more often so that ccTSA uses less memory, but the quality gets lowered as well. On the contrary, lowering pruning frequency leads to more memory usage, but better contig quality. showed that pruning the k-mers with coverage value 1 after processing every 50 M reads lowered the memory usage and execution time by 47.3% and 9.5%, respectively, at the cost of 5.6% degradation in NG50 compared to the default option that pruned the k-mers with coverage value 1 after finishing coverage table construction on C.elegans-Illumina-75 bp reads. Changing the pruning frequency to every 20 M reads further lowered the memory usage and execution time by 43.4% and 6.7% at the cost of additional 5.1% degradation in NG50.
The relationship between the NG50, execution time, and memory usage of ccTSA.
shows the assembly quality of ccTSA and the other assemblers on S.aureus and R.sphaeroides. ABySS, SOAPdenovo, and Velvet could exploit paired-end reads and generate scaffolds. We used SSPACE, a separate scaffolding tool, to take the output contigs from ccTSA and generate scaffolds. We configured ccTSA not to prune k-mers. We used the following quality metrics, which were used for the GAGE evaluation study: the number, NG50, and corrected NG50 of the contigs and scaffolds from the assemblers as well as the number of errors. The number of misjoins and indel errors larger than or equal to 5 base pairs was counted as the errors for contigs, and the number of misjoins became the errors for scaffolds. We broke contigs and scaffolds at each error and reported the broken ones as the corrected NG50 values. As for the results of ABySS, SOAPdenovo, and Velvet, We listed the values reported in the GAGE evaluation paper 
. When we set the k-mer length to 31, which was the number used at the GAGE paper, the quality values of ccTSA were better than those of ABySS and comparable to those of SOAPdenovo and Velvet. By changing the k-mer length, we could find the configurations that had better quality values. For example, when we set the k-mer length to 45 base pairs, the NG50 value of S.aureus scaffolds was 1.56 million base pairs, which was much longer than those of other assemblers.
The quality values of the sequence assemblers on paired-end data sets.