We have created a new sequence alignment tool that is specifically designed to meet the challenges of practical whole human genome resequencing using short read data (25–100 bases). BFAST has been implemented completely for the handling of large datasets from the ABI SOLiD color space reads as well as direct sequence reads possible with other platforms. The estimated scale of the problem–aligning a billion reads per day on a moderate computer cluster–obviously demands extremely fast alignment algorithms, but more importantly demands high accuracy of the alignments (i.e. obtaining the true alignment of a read, among the many various candidates). This is fundamental because the purpose of resequencing is to locate the variants, including insertions and deletions, relative to a reference genome, and to distinguish real genomic variants from inevitable sequencing errors. Thus, it is critical to find the true alignments of reads that do not match perfectly to the reference, so that sequencing errors can be filtered out via multiple coverage, and so that real variants will not go undetected due to systematic under-mapping of reads with the variant. In practice, then, a suitable method must be accurate to meet this challenge and still perform its operations under realistic computer hardware configurations and in reasonable time, and this was the underlying design goal of BFAST.
At a practical level, the BFAST performance tests presented here demonstrate various modes of accuracy of alignment for both simulated data and real-world datasets from both the Illumina and ABI SOLiD sequencing platforms. We demonstrate that BFAST is able to substantially increase the number of reads correctly placed in the genome when containing short insertions and deletions and various numbers of errors while maintaining a lower false alignment rate than any other alignment tools. We are able to map sensitively reads that have up to 10% of the bases in the reads being errors, as well as, reads that contain insertions and deletions up to 10 base pairs with modest error rates. This is critical for identifying indels within the genome of interest, and for optimizing independent alignments of paired end reads to identify structural variants. We demonstrate that BFAST is relatively robust to a wide range of errors. Under key conditions, BFAST is improves the likelihood of correct genomic placement of individual reads over a wide variety of scenarios.
While BFAST is highly sensitive with the use of 10 indexes, as shown here, if the sensitivity were not satisfactory for a given purpose, to the addition of additional genomic indexes would improve the sensitivity with a modest increase in the computational time. This relative error or variant tolerance of BFAST is highly relevant to two scenarios. First, efforts are being extended to sequence new genomes of species for which a closely related species genomic sequence currently exists. The alignment of the generated short read data must be aligned accurately to use the prior genome scaffold in the presence of unknown base differences. Second, the implementation of new and perhaps more error prone sequencing methods that may generate even larger amounts of data than current methods but with a higher error rate. BFAST will be able to more effectively align these poorer quality reads and render new technologies more useful.
The comparison of BFAST to other methods here–MAQ, BWA, Bowtie, SHRiMP, SOAP and BLAT–is not meant to be an exhaustive comparison to all other available programs, nor is it meant to imply that these particular methods cannot provide satisfactory results for large scale resequencing alignment problems. Rather, it is simply to provide a practical context for judging the speed and accuracy of BFAST demonstrate that any method used to align short reads should be assessed for its sensitivity and accuracy in the presence of multiple variants. The examples shown here illustrate that there can be a substantial decreases in accuracy of mapping for certain types of variants, such as small indels, or higher numbers of mismatches, which may be biologically important. Given the complexity of the possible settings, these methods comparison are always incomplete. For instance, we do not evaluate scenarios that would allow the evaluation of certain heuristic choices of the other methods, deferring to the design principle of completeness as motivation for this omission. We note that for many biologically relevant mutation classes that the accuracy of placement by BFAST with 10 genomic indexes is 80–90% thus there is little to be gained by iterative or heuristic processes.
Because BFAST uses multiple gapped indexes, BFAST is able to sensitively identify potential alignment positions. Only reads with under 384 CALs are subsequently evaluated by the powerful gapped local alignment algorithm, which distinguishes reads with identical mapping to multiple locations versus reads with a best matched location. By enforcing an upper limit on the number of CALs, BFAST could potentially ignore reads that map to multiple locations but could be placed within the human genome. Thus, in the interests of higher sensitivity alignments, a key area of potential improvement would be to improve the CAL search step with specialized indexes, such that the correct location is within a more limited set of CALs relative to the current implementation of BFAST.
Any method that underperforms in sensitivity compared to BFAST would need to reduce the false mapping rate to show improvement over BFAST beyond performance. This new method would need to improve the CAL search step, as the gapped local alignment is optimal, such that for some read the new method includes the correct location in its set of CALs when BFAST does not. This correct location would then be identified by the powerful gapped local alignment. This requires the new method to have greater sensitivity, when the opposite was assumed. As BFAST is tunable for any level of sensitivity, we are able to sensitively map reads as well as controlling for false positives.
The choice of alignment tool for each biologist should be tailored to the goals of the sequencing project. While under the specified parameters, the other methods can be somewhat faster for some datasets and slower for others; the fraction of data mapped was typically less than that achieved using BFAST. We note that BFAST does not have default speed or accuracy parameters per se but rather allows the design of index sets with a broad range of speed/accuracy trade offs. The software distribution includes an extensive family of pre-optimized index set masks from which subsets can be selected given any desired accuracy, as specified in the Supplemental Materials S1
(Table S3–S6). Our experience with aligning short read data (25–100 base-pair reads) has revealed that a major problem is that level of accuracy actually being achieved is not known in advance, i.e. what variants will be missed or underrepresented by the alignment process. We stress the need to assess accuracy for short read alignment as part of any such resequencing effort. If the alignment method is not able to sensitively align, then some categories of mutations will be completely missed.
We offer the BFAST program as a new tool in the available for processing massively parallel sequence data. The BFAST program is freely available at http://bfast.sourceforge.net
, and is fully intended for use in production-level whole human genome resequencing efforts, such as are now underway in our own lab. For this reason, we have taken care with the implementation to make it portable, documented, independent of the sequencing platform, and targeted to run efficiently on a cluster. In practice, BFAST can align 1 billion Illumina 55 base-pair reads in 24 hour period on a cluster containing 156 computer computational cores, with suitable amounts of RAM, which makes these alignments practical from a computer hardware point of view. In particular, 20-node cluster with each node consisting of a dual quad core CPU with 24Gb RAM would suffice. We note that the benchmarks reported here ignored multi-threaded computation, and parallel computation, both of which BFAST supports, which in practice proportionately reduces the number of independent nodes required, when multi-CPU nodes and multi-core CPUs are available, as is widely the case. Further, BFAST is implemented in such a fashion that large-scale alignments can be successfully performed in computation clusters where only a fraction (~20%) of the nodes have the higher RAM needed for efficient indexed lookup (16Gb-24Gb), while the majority of nodes need only have 4Gb of RAM for the local alignments phase of the computations.
From the balanced timing profiles (see Supplemental Materials S1
: Figure S12), it is not easy to make the overall process substantially faster because the indexed lookup, local alignment, and result curation/file handling all take comparable time. Thus, even if the indexed lookup were made infinitely fast, the overall running time would be reduced modestly. Therefore, dramatic overall speed improvements seem unlikely without specialized hardware accelerators or tiered algorithmic approaches. However, improvements in indexed lookup speed can be translated directly into accuracy improvement, by searching relative to a large number of indexes without increasing
the overall running time. This will indeed become viable as bigger dynamic memory capacity becomes possible. There is, however, a limit to the strategy of bigger indexes, in that by being resistant to all possible variants by use of a much bigger index, will require longer key sizes in order to limit spurious CALs. For shorter reads, this will begin to reduce the accuracy of any one index, at which point this strategy loses its effectiveness. The optimal parameters will need to be adjusted over time as sequencing technologies produce longer and cleaner reads.
While BFAST was designed to support the resequencing of human genomes with short reads, it is in fact a completely general sequence alignment tool that should perform well in aligning any number of reads, of any length, to any target genome. Because it supports a flexible and fast indexed lookup methodology, and permits this as a user configurable feature, it should be possible to configure BFAST to work well for many large-scale resequencing problems. Even in the limit of long reads, say 100–1000 bases, where high accuracy is easily achieved with a single index, the speed-optimized indexing employed by BFAST will still provide near-optimal lookup performance.