|Home | About | Journals | Submit | Contact Us | Français|
Motivation: Next-generation sequencing technologies generate very large numbers of short reads. Even with very deep genome coverage, short read lengths cause problems in de novo assemblies. The use of paired-end libraries with a fragment size shorter than twice the read length provides an opportunity to generate much longer reads by overlapping and merging read pairs before assembling a genome.
Results: We present FLASH, a fast computational tool to extend the length of short reads by overlapping paired-end reads from fragment libraries that are sufficiently short. We tested the correctness of the tool on one million simulated read pairs, and we then applied it as a pre-processor for genome assemblies of Illumina reads from the bacterium Staphylococcus aureus and human chromosome 14. FLASH correctly extended and merged reads >99% of the time on simulated reads with an error rate of <1%. With adequately set parameters, FLASH correctly merged reads over 90% of the time even when the reads contained up to 5% errors. When FLASH was used to extend reads prior to assembly, the resulting assemblies had substantially greater N50 lengths for both contigs and scaffolds.
Availability and Implementation: The FLASH system is implemented in C and is freely available as open-source code at http://www.cbcb.umd.edu/software/flash.
Thanks to the rapidly dropping cost of DNA sequencing technologies, de novo whole-genome sequencing (WGS) projects are generating very deep coverage of new genomes. However, even with the high coverage that is produced by these technologies, and despite dramatic improvements in genome assembly algorithms (Gnerre et al., 2011; Li et al., 2010), the short read lengths produced by next-generation technologies present a significant barrier to reconstructing a genome from WGS data. Any increase in read length will have a significant, positive impact on the quality of genome assemblies.
Several current protocols generate sequences from both ends of a library of DNA fragments. If the fragments are shorter than twice the read length, the resulting paired-end reads will overlap. For example, a project might use 175 bp libraries with 100 bp reads. This type of library presents an opportunity to extend the reads' length by overlapping and merging paired-end reads before using these reads in the assembly.
In this article, we present FLASH (Fast Length Adjustment of SHort reads), a software tool to find the correct overlap between paired-end reads and extend the reads by stitching them together. We tested the correctness of the tool on one million simulated 100 bp paired-end reads and found it to be highly reliable. We then used real data to measure the effect of this merging procedure on the quality of an assembly. We also compared the speed and accuracy of FLASH to that of SHERA (Rodrigue et al., 2010), a related tool that also extends read length by finding the overlap between paired-end reads.
FLASH requires as input a fastq library of paired-end reads in which at least some of the reads overlap the read generated from the opposite end of the same DNA fragment. It processes each read pair separately and searches for the correct overlap between the paired-end reads. When the correct overlap is found, the two reads are merged, producing an extended read that matches the length of the original DNA fragment from which the paired-end reads were generated.
To find the correct overlap, FLASH considers every possible legal overlap between paired-end reads, where a legal overlap is defined as any ungapped alignment between two reads such that at least min-olap bases overlap one another. We chose to allow only ungapped alignments because with the Illumina sequencing platforms, which are the primary focus of our experiments, insertions and deletions are very rare. The overall flow of the algorithm is as follows:
FLASH scores every possible legal overlap between paired-end reads. Because a legal overlap is defined here as an ungapped alignment between two reads that overlap by at least min-olap bases, the maximum number of overlaps that are considered for each pair of reads is Omax=R−m, where R is the read length and m is the minimum overlap. The number of overlaps considered might be smaller than Omax if the reads contain a large number of N's.
By default, the min-olap parameter is set to 10 bp. In our experiment, lower values of min-olap resulted in many incorrectly extended reads, because shorter overlaps often occur by chance in large WGS datasets, especially when mismatches are allowed. Higher values of min-olap will reduce bad merges further, but will make the system miss too many true overlaps. The trade-off between sensitivity and precision for different min-olap values is shown in Figure 6, described in Section 3.
To find the best overlap for a pair of reads, we calculate the score for each overlap as the ratio between the number of mismatches in the overlapping region and the length of that region. If an ‘N’ occurs at any position in any read, that position is ignored and not counted towards either mismatches or total overlap length. We select the overlap with the lowest ratio as the best overlap for a given read pair.
Assuming the DNA fragment size selection step was done with relatively tight controls, we expect the length of the overlap between paired-end reads within a library to be normally distributed, with a mean determined by the fragment size F and the read length R as 2R−F. For our two sample datasets from Staphylococcus aureus and human chromosome 14, we were able to compute the true length empirically by mapping read pairs against the genomes. Figure 1 shows the distribution of fragment lengths. The bacterial fragments had a mean length of 170 bp while the human fragments were slightly shorter, ~165 bp. In the case of human data, the distribution is much tighter than in the case of bacterial data.
Considering that the typical Illumina error distribution has more errors towards the 3′-end of the read, we expect a higher ratio of true sequencing errors to occur in shorter overlaps, because both paired-end reads will only overlap in the regions with the highest error rate. This will lead any simple scoring scheme to prefer longer overlaps, which might not always be correct.
To determine whether a long overlap is better than a shorter one, we define max-olap to be the maximum length of the overlap expected in 90% of read pairs for a given read length and fragment size. For all overlaps longer than max-olap, we calculate their score as the ratio between the number of mismatches and max-olap rather than the actual overlap length. The parameter max-olap is set to 70 by default, which we found to work well for 100-bp Illumina reads from 180 bp fragments. Note that for the S.aureus data, 70 was too short because the fragment length was shorter than expected, and because the wide distribution of lengths resulted in many pairs overlapping by >70 bp. The algorithm's strong performance despite this deviation from the input fragment length indicates that the heuristic scoring function is robust.
The result of this scheme is that we sometimes prefer overlaps shorter than the maximum overlap detected, even if the mismatch to overlap ratio is greater for the shorter overlap. However, if an overlap longer than max-olap is substantially better than any shorter overlap, we select the longer overlap.
Because fragment lengths vary considerably, some pairs of reads will fail to overlap or will overlap by too few bases to be detected; e.g. for fragments >190 bp, our method will not detect overlap between two 100 bp reads. To distinguish between low-quality overlapping reads and non-overlapping reads, we created a mismatch ratio threshold, mr, which defines the maximum proportion of mismatches that we allow in any overlapping region. This value, set to 0.25 by default, can be increased to produce more aggressive read merging, or decreased to produce more conservative merging. The impact of different mismatch ratios on the accuracy of read merging is shown in Figure 5. As the figure shows, with a 1% error rate in reads, over 99% of read pairs are correctly processed when 0.2≤mr≤0.3. We selected the median of this range (0.25) as the default value.
If the best overlap found between two paired-end reads has a mismatch ratio less than or equal to mr, the paired-end reads are merged. For each position in the overlapping region where the reads disagree, the algorithm chooses the base with the higher quality value. FLASH outputs a fastq file containing the extended reads, where each base in the overlapping region is assigned the quality value of the base with the higher quality where the reads agree, and a score of 2 where they disagree. FLASH also outputs two fastq files containing paired-end reads for which no good overlap was found. These unmerged reads can be used in the assembly as a normal paired-end fragment library.
We tested the correctness of FLASH using both simulated and real data, and compared it for speed and correctness to SHERA (Rodrigue et al., 2010), the only other existing tool that combines paired-end reads from short fragment library. We also measured how much FLASH affected the genome assemblies of S.aureus and human chromosome 14.
We simulated 1 000 000 pairs of 100 bp long reads. The paired-end reads were generated from fragments with a mean length of 180 bp, normally distributed with an SD of 20 bp. Error-free reads were generated using wgsim from the SAMtools package (Li et al., 2009). We inserted errors at a rate of 1, 2, 3 and 5% with an increasing probability of errors towards the end of reads. [The probability of error is defined by y=a(1.05)x, where x is the read position and a and y are based on the error rate and read length.] In the simulated data, 16.4% of fragments had length ≥200, 5.5% had paired-end reads overlapping by <5 bp, and an additional 8% had paired-end reads overlapping by <10 bp. Thus, overall 30% of pairs overlapped by too little to be detected with default settings. A very small fraction of fragments (< 1%) were <100 bp. The distribution of fragment lengths is shown in Figure 2, while the distribution of errors in the 1% error rate sample is shown in Figure 3. The simulated reads used in this data are available at the FLASH website.
When evaluating FLASH, we considered the trade-off between correct and incorrect merges of paired reads (Figs 5 and and6).6). We assume that two 100 bp reads should have been merged if they were generated from a fragment that was 100–190 bp in length, because we required a minimum overlap of 10 bp. Results for FLASH and SHERA for simulated reads with error rates ranging from 0% to 5% are shown in Table 1.
In the table, correct merges refers to the number of read pairs correctly overlapped and merged together (Fig. 4a). A merge is considered correct if a fragment of the correct length is produced, even if the wrong base is selected as the consensus base at one of the overlapping positions. (The only time there is a choice is when the two reads disagree at, a given site, in which case FLASH chooses the base with the higher quality value, breaking ties randomly.)
Correct non-merges are paired-end reads that were not merged together and that were generated from fragments that could not be extended (Fig. 4d); i.e. they were <100 bp or >190 bp. A total of 299 563 pairs in each simulated sample met this criterion.
Incorrect non-merges are pairs that were not merged even though they were derived from fragments between 100 bp and 190 bp in length (Fig. 4b). And finally, incorrect merges are read pairs that were merged together and where either (i) the reads were too far apart and should not have been merged (Fig. 4e) or (ii) the resulting extended read is the wrong length (Fig. 4c).
FLASH processed error-free reads correctly 99.7% of the time, with a very small number of false merges (2644) and just 10 cases where the system failed to merge a pair that it should have. In contrast, SHERA merged only 91.6% of the pairs correctly, with dramatically higher numbers of reads that were either merged incorrectly or failed to merge. For FLASH, the incorrect merging of reads results from two sources. First, if there are two distinct perfect overlaps between a pair of reads, FLASH selects the longer overlap, which on rare occasions results in an incorrect overlap. Second, because FLASH allows 25% of bases in the overlapping region to be mismatches, on rare occasions an apparent overlap comes from fragments >190 bp that do not actually overlap.
With a 1% error rate, which is approximately the error rate of current Illumina sequencers, FLASH correctly handles the same overall percentage of all pairs, with slightly fewer incorrect merges but more incorrect non-merges. SHERA also performed similarly to its results on error-free data, and again had far more incorrectly processed pairs than FLASH.
For reads with 2% error, FLASH still processed >98% of the pairs correctly, but the number of pairs that it failed to merge increased significantly. Increasing the maximum mismatch rate from 0.25 (the default) to 0.3 decreased these failed merges from 14 058 to 4410 (see the column labeled FLASH-0.3 under 2% error.) Thus, if we know that a dataset has a higher error rate, then we can adjust FLASH accordingly to improve its performance.
At even higher error rates of 3 and 5%, FLASH still handled a large majority of pairs correctly, but the numbers dropped dramatically at 5%. Increasing the mismatch ratio to 0.35 for both these trials increased the correct performance of FLASH, and for the 5% error data the results were quite good, improving from 78% to almost 94% of read pairs. Overall, FLASH was superior to SHERA on nearly all these simulated datasets.
Next we evaluate the impact of two key parameters on the accuracy of the read merging procedure. The tests below were performed on simulated reads with a 1% error rate, keeping all parameters constant except for the one being evaluated.
Figure 5 shows how the mismatch ratio parameter affects the number of correctly and incorrectly merged reads. As expected, increasing the allowed mismatch ratio increases the number of correctly merged read pairs, but at the expense of also increasing the number of incorrectly merged pairs. It is also clear from the graph that as the ratio drops <0.2, the number of correctly merged pairs drops very steeply, making the optimal value for these data, with 1% error, to be a mismatch ratio between 0.2 and 0.3.
Figure 6 shows the number of correctly and incorrectly merged read pairs for different values of the minimum overlap length parameter. As we decrease the minimum required overlap length, the number of correctly extended reads steadily increases, as expected. Interestingly, as the overlap drops below 15 and then to 10 bp the number of correctly merged pairs levels off, while the errors increase rapidly.
FLASH currently runs only in single threaded mode, but it is very fast especially for small to moderate datasets. We compared FLASH and SHERA on one million pairs of 100 bp long reads. We measured performance on two computers: a six-core 2.4 GHz AMD Opteron server with 256 GB of RAM, and a dual-core Intel Xeon 3.0 GHz desktop computer with 2 GB of RAM. Run times are summarized in Table 2.
As Table 2 shows, FLASH is far faster than SHERA. The most likely source of the speedup is that FLASH is implemented in C, while SHERA uses Perl, an interpreted language. The overall run time for FLASH is linearly proportional to read length times the number of reads.
FLASH essentially creates longer reads from all the read pairs that it correctly merges. Longer read lengths should have a significant positive impact on whole-genome assembly, and therefore we wanted to test FLASH on real data by using it as a pre-processor for two assemblers: CABOG (Miller et al., 2008), which is a recent version of the Celera Assembler, and SOAPdenovo (Li et al., 2010). For the two genomes used in our tests, a bacterium and the human genome, the true assembly is known, which allowed us to evaluate the correctness of merged read pairs as well as the correctness of the assemblies.
For our first test, we used short-read data from S.aureus, which was sequenced to deep coverage with Illumina sequencing technology by the Broad Institute (MacCallum et al., 2009). The data comprise two paired-end libraries, each of which we randomly sampled to 45X coverage. The first fragment library (SRA accession SRX007714) contains 647 052 pairs with a fragment size of 180 bp and read length of 101 bp. The second library (SRX007711) contains 1 747 035 pairs with a fragment size of 3.5 kb and read length of 37 bp. The reads are available from the NCBI Sequence Read Archive or from http://gage.cbcb.umd.edu/data.
We performed error correction on these reads using Quake (Kelley et al., 2010) with a k-mer size of 18. Quake discards reads that have too many errors, which results in some reads losing their ‘mate’. We used these reads as an unpaired library, along with the two paired-end error corrected libraries to assemble S.aureus using SOAPdenovo. Note that we could not run CABOG on this dataset, because that assembler requires a minimum read length of 64 bp. The results from the initial SOAPdenovo assembly are shown in Table 3 in the Original column.
Next, we used FLASH to merge read pairs from the 180 bp fragment library, which resulted in 369 496 merges. With Illumina paired-end sequencing, the second read in each pair is typically of much lower quality, which probably explains why FLASH could not merge a higher percentage of the reads. Figure 1 shows that the fragment lengths were ~170 bp. We retained the read pairs that were not merged and used them in the assembly as a paired-end library with insert size of 180 bp. As in the first assembly, we ran Quake to correct reads. We then ran SHERA to merge reads from the 180 bp fragment library, and as before, we used Quake to correct reads before assembly.
As Table 3 shows, using FLASH to merge reads produced much larger contigs and scaffolds than both the original assembly and the assembly that used SHERA to merge reads. The improvements over the original assembly were particularly dramatic: the contig N50 size increased > 5-fold, and scaffold N50 size increased 4-fold.
To check the correctness of the read merging algorithm, we used Bowtie (Langmead et al., 2009) to map the merged and non-merged read pairs to the finished S.aureus genome. As shown in Table 4, FLASH correctly processed 90.8 of the reads versus 80.4% for SHERA. The larger number of incorrectly merged pairs is probably the explanation for the poorer genome assembly results (Table 3).
We also compared the accuracy of the assemblies generated by SOAPdenovo with and without merging read pairs. To evaluate accuracy, we used the nucmer program from the MUMmer package (Kurtz et al., 2004) to map all contigs at least 200 bp long to the reference genome. We considered a contig correct if it mapped with at least 98% identity over 98% of its length. For both assemblies, six contigs failed to map correctly. Thus, there was no difference in correctness while the FLASH-assisted assembly contained far larger contigs.
For the second assembly comparison, we used three paired-end libraries from the human sequence NA12878 (SRA024407) sequenced at the Broad Institute (Gnerre et al., 2011). The reads were selected by using Bowtie to map the reads from NA12878 onto the entire human genome, and then selecting only those pairs that mapped to chromosome 14. The mapping yielded 18 252 400 pairs in the 180 bp fragment library, 11 334 704 pairs in the 3000 bp library and 1 202 532 pairs in the 35 kb library. All reads are 101 bp long. The chromosome 14 reads are available from http://gage.cbcb.umd.edu/data/index.html.
As with the assembly of S.aureus, we used Quake with a k-mer size of 18 to correct all reads, and used the corrected reads from three paired-end libraries in the assembly by SOAPdenovo and CABOG.
We then merged reads in the 180 bp library using FLASH, which yielded 17 047 292 extended reads. Note that this was a far higher percentage of the library than were merged in the S.aureus data; this is likely due to the fact that the distribution of fragment lengths in the human data was much tighter (Fig. 1), making more fragment lengths fall within the detectable overlap range of FLASH.
The assembly results are shown in Tables 5 and and6.6. As with S.aureus, the assembly that used merged reads from FLASH outperformed the assembly that used the original reads in every category regardless of the assembler used to assemble the genome. The CABOG assembly had far larger contigs and scaffolds than the SOAPdenovo assembly, and the improvements from FLASH were also much more striking: the contig N50 size more than doubled, and the scaffold N50 size tripled.
We then mapped the merged and non-merged reads to the reference genome to measure how many had been correctly processed. FLASH's results were very similar to those for S.aureus, handling 91% of pairs correctly as before. Failure to merge reads again accounted for a large majority of its mistakes: 1.1 million pairs were left separate although they overlapped sufficiently for merging. Over 16 million pairs were correctly merged while 572 622 pairs were merged incorrectly.
Finally, we compared the accuracy of the assemblies that used FLASH as a preprocessor to those that did not. Note that these are de novo human assemblies, constructed without use of the reference human genome. Comparative assemblies, which typically have far larger contigs than de novo assemblies, might not show as great a difference if FLASH was used.
As above, we considered a contig to be correct if 98% of its length aligned to the genome with at least 98% identity. Table 7 shows the results for assemblies by both SOAPdenovo and CABOG.
As shown in Table 7, using FLASH as a preprocessor for genome assembly not only improved the assembly by producing fewer, larger contigs, but it also reduced the number of erroneous contigs for both SOAPdenovo and CABOG. This also shows that FLASH is not limited to just one assembler, and suggests that it would be a useful adjunct to standard assembly pipelines.
Merging reads from opposite ends of a short fragment can run into problems when the fragment contains short tandem repeats. If the overlapping regions contains nothing else, the two reads might be merged too aggressively, collapsing the repeat region into fewer copies than are actually present in the fragment. While exact tandem repeats might cause problem for FLASH, non-exact tandem repeats should be resolved easily.
The Figure 7 illustrates the problem. Figure 7a (i) shows the original fragment, with unique sequences A and B flanking a region R that contains multiple copies of a short tandem repeat. The left part of the figure illustrates the situation when the tandem repeats are identical, while the right-hand side shows what happens when the repeats contain some variation. As shown in Figure 7a (iii), the best overlap might be too short when the repeats are exact, leading FLASH to collapse the repeat region slightly. This problem only arises when the repeats are exact. If R is shorter than the read length, then some reads will contain R entirely, allowing a genome assembler to resolve it correctly.
The short read lengths generated by next-generation sequencing technologies are the biggest challenge in assembling genomes into large, near-complete sequences. To tackle this problem, we developed FLASH, a highly accurate, very fast tool to merge pairs generated by sequencing both ends of short fragment libraries. FLASH can be used for any paired-end sequence data in which the paired reads overlap, including RNA-seq data. As shown in our experiments, merging paired reads with FLASH can have a significant positive impact on the quality of genome assemblies.
Funding: National Institutes of Health under grants (R01-LM006845 and R01-GM083873) in part.
Conflict of Interest: none declared.