Searching databases of DNA and protein sequences is one of the fundamental tasks in bioinformatics. The Smith-Waterman algorithm guarantees the maximal sensitivity for local sequence alignments, but it is slow. It should be further considered that biological databases are growing at a very fast exponential rate, which is greater than the rate of improvement of microprocessors. This trend results in longer time and/or more expensive hardware to manage the problem. Special-purpose hardware implementations, as for instance super-computers or field-programmable gate arrays (FPGAs) are certainly interesting options, but they tend to be very expensive and not suitable for many users.
For the above reasons, many widespread solutions running on common microprocessors now use some heuristic approaches to reduce the computational cost of sequence alignment. Thus a reduced execution time is reached at the expense of sensitivity. FASTA (Pearson and Lipman, 1988) [1
] and BLAST (Altschul et al., 1997) [2
] are up to 40 times faster than the best known straight forward CPU implementation of Smith-Waterman.
A number of efforts have also been made to obtain faster implementations of the Smith-Waterman algorithm on commodity hardware. Farrar [3
] exploits Intel SSE2, which is the multimedia extension of the CPU. Its implementation is up to 13 times faster than SSEARCH [14
] (a quasi-standard implementation of Smith-Waterman).
To our knowledge, the only previous attempt to implement Smith-Waterman on a GPU was done by W. Liu et al. (2006) [4
]. Their solution relies on OpenGL that has some intrinsic limits as it is based on the graphics pipeline. Thus, a conversion of the problem to the graphical domain is needed, as well as a reverse procedure to convert back the results. Although that approach is up to 5 times faster than SSEARCH, it is considerably slower than BLAST.
In this paper we present the first solution based on commodity hardware that efficiently computes the exact Smith-Waterman alignment. It runs from 2 to 30 times faster than any previous implementation on general-purpose hardware.
The Smith-Waterman algorithm
The Smith-Waterman algorithm is designed to find the optimal local alignment between two sequences. It was proposed by Smith and Waterman (1981) [5
] and enhanced by Gotoh (1982) [6
]. The alignment of two sequences is based on the computation of an alignment matrix. The number of its columns and rows is given by the number of the residues in the query and database sequences respectively. The computation is based on a substitution matrix and on a gap-penalty function.
• A: a1a2a3….an is the first sequence,
• B: b1b2b3….bm is the second sequence,
• W(ai, bj) is the substitution matrix,
• Ginit and Gext are the penalties for starting and continuing a gap,
the alignment scores ending with a gap along A and B are
Finally the alignment scores of the sub-sequences Ai, Bj are:
where 1≤i≤n and 1≤j≤m. The values for E, F and H are 0 when i<1 and j<1. The maximum value of the alignment matrix gives the degree of similarity between A and B.
An important point to be considered is that any cell of the alignment matrix can be computed only after the values of the left and above cells are known, as shown in Figure . Different cells can be simultaneously processed only if they are on the same anti-diagonal.
Smith-Waterman data dependencies. Each cell of the alignment matrix depends on the cells on the left and above it. Independent data can be found only on the same anti-diagonal.
CUDA programming model
The two major GPU vendors, NVidia and AMD, recently announced their new developing platforms, respectively CUDA [7
] and CTM [8
]. Unlike previous GPU programming models, these are proprietary approaches designed to allow a direct access to their specific graphics hardware. Therefore, there is no compatibility between the two platforms. CUDA is an extension of the C programming language; CTM is a virtual machine running proprietary assembler code. However, both platforms overcome some important restrictions on previous GPGPU approaches, in particular those set by the traditional graphics pipeline and the relative programming interfaces like OpenGL and Direct3D.
We selected NVidia GeForce 8800 and its CUDA platform to design our Smith-Waterman implementation because it is the first available GPU on the market capable of an internal integer representation of data.
In CUDA, the GPU is viewed as a compute device suitable for parallel data applications. It has its own device random access memory and may run a very high number of threads in parallel (Figure ). Threads are grouped in blocks and many blocks may run in a grid of blocks. Such structured sets of threads may be launched on a kernel of code, processesing the data stored in the device memory. Threads of the same block share data through fast shared on-chip memory and can be synchronized through synchronization points. An important aspect of CUDA programming is the management of the memory spaces that have different characteristics and performances:
Figure 2 CUDA architecture. New CUDA compatible GPUs are implemented as a set of multiprocessors. Each multiprocessor has several ALUs (Arithmetic Logic Unit) that, at any given clock cycle, execute the same instructions but on different data. Each ALU can access (more ...)
• Read-write per-thread registers (fast, very limited size)
• Read-write per-thread local memory (slow, not cached, limited size)
• Read-write per-block shared memory (fast, very limited size)
• Read-write per-grid global memory (slow, not cached, large size)
• Read-only per-grid constant memory (slow but cached, limited size)
• Read-only per-grid texture memory (slow but cached, large size)
The proper choice of the memory to be used in each kernel depends on many factors such as the speed, the amount needed, and the operations to be performed on the stored data. An important restriction is the limited size of shared memory, which is the only available read-write cache. Unlike the CPU programming model, here the programmer needs to explicitly copy data from the global memory to the cache (shared memory) and backwards. But this new architecture allows the access to memory in a really general way, so both scatter and gather operations are available. Gather is the ability to read any memory cell during the run of the kernel code. Scatter is the ability to randomly access any memory cell for writing. The unavailability of scatter was one of the major limitations of OpenGL when applied to GPGPU applications. The main point in approaching CUDA is that the overall performance of the applications dramatically depends on the management of the memory, which is much more complex than in the CPUs.