FLASH requires as input a fastq library of paired-end reads in which at least some of the reads overlap the read generated from the opposite end of the same DNA fragment. It processes each read pair separately and searches for the correct overlap between the paired-end reads. When the correct overlap is found, the two reads are merged, producing an extended read that matches the length of the original DNA fragment from which the paired-end reads were generated.
To find the correct overlap, FLASH considers every possible legal overlap between paired-end reads, where a legal overlap is defined as any ungapped alignment between two reads such that at least
min-olap bases overlap one another. We chose to allow only ungapped alignments because with the Illumina sequencing platforms, which are the primary focus of our experiments, insertions and deletions are very rare. The overall flow of the algorithm is as follows:
- Align the pair of reads so that they overlap completely; e.g. by the full length of the shorter read.
- Repeat while the overlap is longer than min-olap:
- Calculate the overlap length. If an ‘N’ occurs in the overlapping region, it is not counted towards the overlap length.
- Calculate the score for the overlap as the ratio between the number of mismatches and the overlap length, ignoring N's.
- If the score of the overlap is smaller than the score of the best overlap, save it as the new best overlap.
- If the score is equal to the best previous score:
- Calculate the average quality value of all mismatches in the overlap.
- If the average quality value is smaller than the average quality value of mismatches in the best overlap, save the current overlap as the best overlap.
- Slide the reads apart by one base, reducing the overlap by one.
- Compare the score of the best overlap to the mismatch threshold. If the score is bigger than the mismatch threshold, report that no good overlap was found; otherwise, return the best overlap.
FLASH scores every possible legal overlap between paired-end reads. Because a legal overlap is defined here as an ungapped alignment between two reads that overlap by at least min-olap bases, the maximum number of overlaps that are considered for each pair of reads is Omax=R−m, where R is the read length and m is the minimum overlap. The number of overlaps considered might be smaller than Omax if the reads contain a large number of N's.
By default, the
min-olap parameter is set to 10 bp. In our experiment, lower values of
min-olap resulted in many incorrectly extended reads, because shorter overlaps often occur by chance in large WGS datasets, especially when mismatches are allowed. Higher values of
min-olap will reduce bad merges further, but will make the system miss too many true overlaps. The trade-off between sensitivity and precision for different
min-olap values is shown in , described in
Section 3.
To find the best overlap for a pair of reads, we calculate the score for each overlap as the ratio between the number of mismatches in the overlapping region and the length of that region. If an ‘N’ occurs at any position in any read, that position is ignored and not counted towards either mismatches or total overlap length. We select the overlap with the lowest ratio as the best overlap for a given read pair.
Assuming the DNA fragment size selection step was done with relatively tight controls, we expect the length of the overlap between paired-end reads within a library to be normally distributed, with a mean determined by the fragment size F and the read length R as 2R−F. For our two sample datasets from Staphylococcus aureus and human chromosome 14, we were able to compute the true length empirically by mapping read pairs against the genomes. shows the distribution of fragment lengths. The bacterial fragments had a mean length of 170 bp while the human fragments were slightly shorter, ~165 bp. In the case of human data, the distribution is much tighter than in the case of bacterial data.
Considering that the typical Illumina error distribution has more errors towards the 3′-end of the read, we expect a higher ratio of true sequencing errors to occur in shorter overlaps, because both paired-end reads will only overlap in the regions with the highest error rate. This will lead any simple scoring scheme to prefer longer overlaps, which might not always be correct.
To determine whether a long overlap is better than a shorter one, we define max-olap to be the maximum length of the overlap expected in 90% of read pairs for a given read length and fragment size. For all overlaps longer than max-olap, we calculate their score as the ratio between the number of mismatches and max-olap rather than the actual overlap length. The parameter max-olap is set to 70 by default, which we found to work well for 100-bp Illumina reads from 180 bp fragments. Note that for the S.aureus data, 70 was too short because the fragment length was shorter than expected, and because the wide distribution of lengths resulted in many pairs overlapping by >70 bp. The algorithm's strong performance despite this deviation from the input fragment length indicates that the heuristic scoring function is robust.
The result of this scheme is that we sometimes prefer overlaps shorter than the maximum overlap detected, even if the mismatch to overlap ratio is greater for the shorter overlap. However, if an overlap longer than max-olap is substantially better than any shorter overlap, we select the longer overlap.
Because fragment lengths vary considerably, some pairs of reads will fail to overlap or will overlap by too few bases to be detected; e.g. for fragments >190 bp, our method will not detect overlap between two 100 bp reads. To distinguish between low-quality overlapping reads and non-overlapping reads, we created a mismatch ratio threshold, mr, which defines the maximum proportion of mismatches that we allow in any overlapping region. This value, set to 0.25 by default, can be increased to produce more aggressive read merging, or decreased to produce more conservative merging. The impact of different mismatch ratios on the accuracy of read merging is shown in . As the figure shows, with a 1% error rate in reads, over 99% of read pairs are correctly processed when 0.2≤mr≤0.3. We selected the median of this range (0.25) as the default value.
If the best overlap found between two paired-end reads has a mismatch ratio less than or equal to mr, the paired-end reads are merged. For each position in the overlapping region where the reads disagree, the algorithm chooses the base with the higher quality value. FLASH outputs a fastq file containing the extended reads, where each base in the overlapping region is assigned the quality value of the base with the higher quality where the reads agree, and a score of 2 where they disagree. FLASH also outputs two fastq files containing paired-end reads for which no good overlap was found. These unmerged reads can be used in the assembly as a normal paired-end fragment library.