This section reports first on the overall design of the new software and then discusses several enhancements to BLAST.
Two criteria were most important in the design of the new BLAST code: 1.) the code structure should be modular enough to allow easy modification; and 2.) the same BLAST code should be embedded in at least two different host toolkits. This would allow both the new NCBI C++ toolkit and the older NCBI C toolkit to use the same BLAST source code.
At a high level, the BLAST process can be broken down into three modules (Figure ). The "setup" module sets up the search. The "scanning" module scans each subject sequence for word matches and extends them. The "trace-back" module produces a full gapped alignment with insertions and deletions.
Figure 1 Schematic of a BLAST search. The first phase is "setup". The query is read, low-complexity or other filtering might be applied to the query, and a "lookup" table is built. The next phase is "scanning". Each subject sequence is scanned for words ("hits") (more ...)
The setup phase reads the query sequence, applies low-complexity or other filtering to it, and builds a "lookup" table (i.e., perfect hashing). The lookup table contains only words from the query for nucleotide-nucleotide searches such as BLASTN or MEGABLAST. DISCONTIGUOUS MEGABLAST allows non-consecutive matches in the initial seed. Protein-protein searches such as BLASTP allow "neighboring" words. The neighboring words are similar to a word in the query, as judged by the scoring matrix and a threshold value.
The scanning phase scans the database and performs extensions. Each subject sequence is scanned for words ("hits") matching those in the lookup table. These hits are used to initiate a gap-free alignment. Gap-free alignments that exceed a threshold score then initiate a gapped alignment, and those gapped alignments that exceed another threshold score are saved as "preliminary" matches for further processing. The scanning phase employs a few optimizations. The gapped alignment returns only the score and extent of the alignment. The number and position of insertions, deletions and matching letters are not stored (no "trace-back), reducing the CPU time and memory demands. Searches against nucleotide subject sequences consider only unambiguous bases (A, C, G, T), with ambiguous bases (e.g., N) replaced at random during preparation of the BLAST database or subject sequence. A four letter alphabet allows packing of four bases into one byte, and the subject sequences are scanned four letters at a time. Finally, less sensitive heuristic parameters are employed for the gapped alignment, and the full extent of a gapped alignment may, in rare cases, not be found.
The final phase of the BLAST search is the trace-back. Insertions and deletions are calculated for the alignments found in the scanning phase. Ambiguous bases are restored for nucleotide subject sequences, and more sensitive heuristic parameters are used for the gapped alignment. Composition-based statistics [6
] may also be applied for BLASTP (protein-protein) and TBLASTN (protein compared against translated nucleotide subject sequences).
Ideally, one should be able to independently replace the functionality described in each of the small rectangles of Figure (e.g., "build lookup table") with another implementation. Some coordination is required: for example, the lookup table is used when finding word matches, so both "build lookup table" and "find word matches" need to be changed together. Finding word matches is the most computationally intensive part of the BLAST search, so the implementation should be as fast as possible. To address this, the author of the lookup table implementation must provide the scanning routine for finding word hits. Other modules can be changed independently.
The selection of ISO C99 allows use of the new BLAST code in both C and C++ environments. The host toolkit provides a software layer to allow BLAST to communicate with the rest of each toolkit. This design requires a clean separation between the algorithmic part of BLAST and the module that retrieves subject sequences from the database. To allow this, the retrieval of subject sequences for processing by the core of the BLAST code is performed through an Abstract Data Type (ADT), which specifies a set of data values and permitted operations. The actual retrieval occurs through an implementation of the ADT in the host toolkit. The implementation can be changed depending upon the need and requires no changes to the BLAST algorithm code itself.
The subject sequence information required by BLAST is quite simple. It consists of the total number of sequences to be searched, the length of any given sequence, as well as methods to retrieve the actual sequence. The total database length is needed for calculation of expect values. A database name and the length of the longest subject sequence are also required to implement some functions in an efficient manner. In order to satisfy the above requirements, an ADT, called the BlastSeqSrc [16
], was implemented.
Low-complexity regions and interspersed repeats typically match many sequences. These matches are normally not of biological interest, may lead to spurious results, and confound the statistics used by BLAST. BLAST offers two query masking modes to avoid such matches. One is known as "hard-masking" and replaces the masked portion of the query by X's or N's for all phases of the search. On the other hand, "soft-masking" makes the masked portion of the query unavailable for finding the initial word hits, but the masked portion is available for the gap-free and gapped extensions once an initial word hit has been found.
The BLAST databases can also be masked. Masking information is stored as a series of intervals, so that masking can be switched on or off. Information from multiple masking algorithms can be stored in the same BLAST database and accessed separately. Currently, database masking consists of skipping masked portions of the database during the scanning phase, but it is still possible to extend through masked portions of the database; as such, database masking is analogous to soft-masking a query.
Minimizing memory and cache footprint
Modifications that reduce the CPU time and memory footprint of BLAST searches with long query or subject sequences are examined. First, an optimization for the scanning phase of the BLAST search is presented. Then, an improvement for the trace-back phase is described.
BLAST searches with very large queries are routine, but some of the data structures scale with the query length. The following analysis examines the scanning phase (Figure ) of the BLAST search.
Two large structures are frequently accessed during the scanning phase. The first is the "lookup table", which maps words in a subject sequence to positions in the query. The second is the "diag-array", which tracks how far BLAST has already extended word hits on any given diagonal; its size scales with the query length. The scanning phase is a large fraction of the time of most BLAST searches, so these structures must be accessed quickly. Contemporary CPUs typically communicate with main memory through several levels of cache, called a "memory hierarchy". For example, the L1 cache is the smallest and has the lowest latency; the L2 cache is larger but slower. On a machine with an Intel Xeon CPU, the L1 cache might be around 16 kB and the L2 cache can range in size from 0.5-4 MB. If the CPU does not find data or an instruction in the cache, it must fetch it from main memory; a "cache miss". Performance could be improved by making the lookup table and diag-array small enough to fit into L2 cache, still leaving room for instructions and other data.
In order to be specific, the discussion in the next two paragraphs is limited to a BLASTX search, which translates a nucleotide query in six frames (three frames on each strand) and compares it to a protein database.
The lookup table contains a long array (the "backbone"), with each cell mapping to a unique word. The lookup table translates each residue type to a number between 1 and 24, so a three-letter word maps to an integer between 1 and 243. For a three-letter word, an array of 32768 (323) cells allows a quick calculation of the offset into the backbone while scanning the database for word matches. Each cell of the backbone consists of four integers. The first integer specifies how many times that word appears in the query; the other three can have one of two functions. For three or fewer occurrences, the three integers simply specify the positions of the word in the query. If there are more than three occurrences, however, the integers are an index into another array containing the positions of the word in the query. The total memory occupied by the backbone is 16 bytes × 32768, or about 524 kB. Finally, there is a bit vector occupying 4096 bytes (32768/8). The corresponding bit is set in the bit vector for backbone cells containing entries. For a short query, where the backbone may be sparsely populated, this allows a quick check whether a cell contains any information.
A BLASTX query of N nucleotides becomes twice as long when it is represented as six protein sequences. The diag-array consumes one four-byte integer per letter in the query. An estimate of the total memory occupied by the lookup table backbone and the diag-array, in bytes, for a nucleotide query of length N is:
For a query of N = 50 k, this is close to a million bytes, already the total size of L2 cache in many computers used for BLAST searching. Modifications to these structures might permit larger queries, but for contigs and chromosomes the structures would still overflow the L2 cache. To overcome this, the query is split into smaller overlapping pieces for the scanning phase of the search. BLAST then merges the results and aligns the entire query during the trace-back phase, obtaining the same results as a search that was not split. Splitting the query has an additional advantage; since the sub-query used during the scanning phase is of bounded length, it is possible to use a smaller data type in the lookup table (specifically, a two byte rather than a four byte integer). This reduces the first term in the above equation from 528,384 to 266,240 bytes.
The final phase of the BLAST search, the trace-back, processes the preliminary matches, producing an alignment with insertions and deletions. Additionally, heuristic parameters may be assigned a more sensitive value, ambiguities in a nucleotide database sequence are resolved, and the composition of the subject sequences may be taken into account when calculating expect values. Some subject sequences must be retrieved again for this calculation, but since the preliminary phase finds the rough extent of any alignment, the entire sequence is often not needed. This is most important for short queries searched against a database of much longer sequences. Only part of the subject sequences, when appropriate, is now retrieved, and performance results are presented under "Partial subject sequence retrieval" below.