Abstract
Background
Pairwise sequence alignment methods are widely used in biological research. The increasing number of sequences is perceived as one of the upcoming challenges for sequence alignment methods in the nearest future. To overcome this challenge several GPU (Graphics Processing Unit) computing approaches have been proposed lately. These solutions show a great potential of a GPU platform but in most cases address the problem of sequence database scanning and computing only the alignment score whereas the alignment itself is omitted. Thus, the need arose to implement the global and semiglobal NeedlemanWunsch, and SmithWaterman algorithms with a backtracking procedure which is needed to construct the alignment.
Results
In this paper we present the solution that performs the alignment of every given sequence pair, which is a required step for progressive multiple sequence alignment methods, as well as for DNA recognition at the DNA assembly stage. Performed tests show that the implementation, with performance up to 6.3 GCUPS on a single GPU for affine gap penalties, is very efficient in comparison to other CPU and GPUbased solutions. Moreover, multiple GPUs support with load balancing makes the application very scalable.
Conclusions
The article shows that the backtracking procedure of the sequence alignment algorithms may be designed to fit in with the GPU architecture. Therefore, our algorithm, apart from scores, is able to compute pairwise alignments. This opens a wide range of new possibilities, allowing other methods from the area of molecular biology to take advantage of the new computational architecture. Performed tests show that the efficiency of the implementation is excellent. Moreover, the speed of our GPUbased algorithms can be almost linearly increased when using more than one graphics card.
Background
The most important and the most frequently used algorithms in computational biology are probably the NeedlemanWunsch [1] and the SmithWaterman [2] algorithms for global and local pairwise alignments of DNA (and protein) sequences, respectively. These algorithms are based on dynamic programming. As a result, one gets an optimal alignment, but the approach requires a lot of time and memory. The problem becomes more serious when pairwise alignments have to be computed for a set of thousands of sequences (a common case at the assembly stage of DNA recognition [35]). A natural extension of the pairwise alignment is a multiple sequence alignment (MSA) problem, which is much more complex. Theoretically, the MSA problem can be also solved by dynamic programming, but it was proved that for a SumofPairs score this problem is NPhard [6]. Thus, heuristic approaches are frequently used (see review [7]). The most common ones, based on the so called progressive algorithm, require the alignment of every input sequence pair. Sometimes, such pairwise alignments are performed with highly specialized methods like in case of [8,9], but often it is the NeedlemanWunsch or SmithWaterman algorithm [10,11] resulting in timeconsuming methods. Hence, the increasing number of sequences is perceived as one of the upcoming challenges for the MSA problem in the nearest future [12].
Recently, modern graphics processing units (GPUs) have been widely exploited for solving many bioinformatic problems. An example may be the problem of scanning databases for sequences similar to a given query sequence. A few efficient implementations addressing this problem have been developed (see [1316]). However, it should be stressed that scanning a database is considerably different from aligning every possible pair of sequences from a given input set. Both problems, seemingly the same, vary in many aspects, especially in case of lowlevel GPU optimizations. Moreover, it is worth noting that all the methods mentioned above compute only the alignment score, not the alignment itself. Yet, many reallife applications require also the alignment to be computed. One of the known approaches, by KhajehSaeed A. et al. [17], partially solves this problem but the application has been designed for a very specific benchmark. Additionally, the method adopted for backtracking procedure is not clear and not very efficient either. Hence, the software is not applicable in practice, e.g. to the MSA or the DNA assembly problem. However, the idea presented by Liu Y. et al. [18] seems to be a quite successful approach to the former of these two problems. The proposed solution uses the MyersMiller algorithm [19] to compute the alignment. The main advantage of this algorithm is the possibility of aligning very long sequences as the backtracking procedure works in linear space. The main drawback, on the other hand, is the necessity of conducting additional computations  the backtracking routine has quadratic computational complexity here. But yet, many practical applications require dealing with a large number of short sequences, e.g. [20]. In these problems a special emphasis should be put on efficient processing without any redundant or repeated computations and not necessarily on saving memory.
The main goal of this work derives from the discussion above. It is a construction of GPUbased dynamic programing algorithms for pairwise alignment. One difference between our approach and the previous ones is that we have optimized the algorithm for aligning every sequence with each other from a given input set. The second difference is that our method, unlike others, performs the backtracking procedure in linear time. Although special data structures are used here, no redundant computations are needed. In contrast to the MyersMiller algorithm, it was designed for the GPU architecture. Moreover, the three basic pairwise alignment algorithms, i.e. local, global and semiglobal, differ only in details, so all of them have been implemented. As a result we got a valuable tool for multisequence pairwise alignments which is fast and can be run on a common personal computer equipped with NVIDIA GPU (G80, GT200 or Fermi). Extensive computational tests show its advantage over CPUbased solutions like the Emboss package or the highly optimized Farrar's implementation. Moreover, our task manager is able to use more than one GPU. Performed tests show that the multiGPU support influences the execution time considerably.
GPGPU and the CUDA programming model
There are a few substantial differences between CPU and GPU architectures that make a GPU a more powerful tool for some applications. The same differences cause some difficulties in programming of graphics cards. Firstly, GPUs have many more cores, which are the main computational units, e.g. NVIDIA GeForce 280 has 240 cores. Secondly, there is much less cache memory available on the GPU. Moreover, the cache memory on the graphics card is not managed automatically, but by a programmer.
Such an architecture gives opportunities to utilize the hardware more efficiently. On the other hand, writing parallel algorithms on GPU is more timeconsuming, because it requires indepth knowledge and understanding of the hardware. As a result the algorithm can be much faster than its CPU version. Although there are a few GPGPU (generalpurpose computing on graphics processing units) technologies like ATI Stream [21] or OpenCL [22] on the market, one of them  CUDA [23], is a bit more established than others. Our implementation of alignment algorithms was done using this technology. The CUDA environment is an extension of C/C++ programming languages which enables programmers to access the resources of the GPU.
To understand the essentials of CUDA, one has to be aware of different types of available memory. The main differences between these memory types have been shown in Table 1. The proper usage of memory is the key to good performance. However, not only the type of memory used is important, but also their correct usage. Different kinds of memory have different access patterns. It means that for instance, the order of reading/writing data can be also crucial [24]. Because RAM (also called the main or global memory) is much slower than the memory on the chip, most of the CUDA programs follow this simple rule: fetched data from the global memory is processed locally as much as possible, using registers, shared memory and caches, then the results are written back to the global memory. In this way one can limit expensive data transfers from or to the global memory.
Table 1. Differences between memory types in CUDA
Another significant property of CUDAenabled graphics cards is that the GPU consists of many multiprocessors and each multiprocessor has a number of cores working as one or more SIMT (single instruction multiple thread) units. One such unit is able to execute one and only one instruction at the same time, but in many threads and on various parts of data. As a result, during the process of designing an algorithm, one must take this into consideration.
Being conscious of the architecture described briefly above, one can design and implement alignment algorithms efficiently.
Algorithms for pairwise sequence alignment
There are three basic algorithms for performing pairwise sequence alignment: NeedlemanWunsch [1] for computing global alignment, its modification for semiglobal alignment and SmithWaterman [2] for computing local alignment. All these algorithms are based on the idea of dynamic programming and to some extent work analogically. Taking into consideration Gotoh's enhancement [25], the algorithms are described briefly below.
Let us define:
• A  a set of characters of nucleic acids or proteins,
• s_{i } the ith sequence,
• s_{i}(k) ∈ A  kth character of the ith sequence,
• SM  substitution matrix,
• SM(c_{i }∈ A, c_{j }∈ A)  substitution score for c_{i }and c_{j }pair,
• G_{open } penalty for opening a gap,
• G_{ext } penalty for extending a gap,
• H  a matrix with partial alignment scores,
• E  a matrix with partial alignment scores indicating vertical gap continuation,
• F  a matrix with partial alignment scores indicating horizontal gap continuation,
NeedlemanWunsch algorithm
To compute the alignment of two sequences, the algorithm (called later NW algorithm) has to fill the matrix H according to the similarity function. The similarity function determines a score of substitution between two residues. This relation is given in a substitution matrix, like one from BLOSUM [26] or PAM [27] families. The matrix H also takes gap penalties into account, described by G_{open }and G_{ext}. The size of the matrix H is (n + 1) × (m + 1), where n is the number of residues in the first sequence s_{1 }and m  in the second sequence s_{2}.
The matrix H is filled using the following formulae:
where i = 1...n and j = 1...m.
The first row and the first column are filled according to the following formulae:
Moreover, the E and F matrices are initialized by putting ∞ value into the first row and column. The result for this part of the algorithm is the value of similarity, so called score. Let us denote the coordinates of the cell with the similarity score by (i*, j*). In case of the NW algorithm, this value can be found in the H(n, m) cell of the matrix H.
The goal of the second stage  backtracking, is to retrieve the final alignment of two sequences. The idea of backtracking is that the algorithm performs backward moves starting from the (i*, j*) cell in the matrix H until it reaches the (0, 0) cell. Every time when the algorithm moves to the upper cell, a gap character is inserted into the sequence s_{1 }in the final alignment. If the algorithm moves left, a gap is added analogically to the sequence s_{2}, and finally the diagonal move means that the corresponding residues are aligned. The backtracking procedure is deeply analyzed in Section "The idea of backtracking procedure and GPU limitations".
The semiglobal pairwise alignment
A semiglobal version of dynamic programming for pairwise alignment differs from the previous one in three points. The first one is the way how the matrix H is initialized. For semiglobal alignment the formulae (4) and (5) should be replaced respectively by:
The second difference concerns the coordinates of the cell where the similarity score can be found in the matrix H. For semiglobal alignment this cell is the one with the highest value from the last row or column of the matrix H.
The last difference involves the stop criterion for the backtracking procedure. In this case backtracking is finished when the cell (k, 0) or (0, l) is reached, where k = 0, ..., n and l = 0, ..., m.
SmithWaterman algorithm for the local pairwise alignment
The SmithWaterman algorithm (called later SW algorithm) also differs from the NeedlemanWunsch algorithm in three points. The first one is again the way of initializing the matrix H. The initializing values should be the same as in the semiglobal version of the algorithm.
The second difference concerns the formulae describing the process of filling the matrix H. The formula (1) should be replaced by the following one:
where i = 1...n and j = 1...m.
The next difference covers the coordinates of the cell with the final score for the local alignment. In this case the (i*, j*) cell is the one with the highest value within the entire matrix H.
The last difference concerns the stop criterion of the backtracking procedure. The SmithWaterman algorithm is finished when a cell with zero value is reached.
Implementation
The idea of backtracking procedure and GPU limitations
To obtain the alignment efficiently four boolean matrices have been defined in our approach, each of size (n + 1) × (m + 1). The purpose of these matrices is to indicate the proper direction of backward moves for the algorithm being at a certain position during the process of backtracking. Although their memory usage is quadratic, the advantage is that they enable to perform the backtracking procedure in a linear time, in contrast to the Mayers and Miller's idea.
The backtracking matrices are defined as follows:
• C^{up } indicates whether the algorithm should continue moving up,
• C^{left } indicates whether the algorithm should continue moving left,
• B^{up } indicates whether the algorithm should move up, if it does not continue its way up or left,
• B^{left } indicates whether the algorithm should move left, if it does not continue its way up or left.
Two special cases should be stressed:
• if C^{up }= false, C^{left }= false, B^{up }= true and B^{left }= true then the algorithm should move to the diagonal cell in the up left direction,
• if C^{up}, C^{left}, B^{up }and B^{left }have logical value false then the backtracking procedure is finished.
In the case of global and semiglobal alignment algorithms, the matrices are filled according to the following formulae:
The additional condition of H_{i,j }≠ E_{i,j }in the formula (12), as compared to the formula (11), prevents the algorithm from an ambiguous situation, when both directions, up and left, are equally good. In this case, to avoid nondeterministic behavior, the algorithm should prefer only one, predefined direction.
For the local alignment algorithm, the C^{up }and C^{left }matrices are filled according to formulae (9) and (10), respectively. However, the B^{up }and B^{left }matrices are filled using the following formulae:
An important issue, one should take into consideration, is that during the process of filling the matrix H any cell value can be computed only if the values of the left, above and diagonal cells are known. It means that only these cells that are on the same antidiagonal can be processed simultaneously. As a result, there is not much to parallelize (in the context of massively parallel GPU architecture) in a single run of NW or SW algorithm. However, progressive multiple sequence alignment algorithms require aligning of many sequences (every sequence with each other). Our idea was to design an algorithm for efficient execution of many pairwise alignment instances running concurrently. To utilize the GPU resources properly one has to load it with a sufficient amount of work. To fulfill this requirement at least 80 × 80 NW/SW instances should be computed concurrently (this number will be explained in Section "Implementation of the algorithms"). The problem to overcome was that the amount of available RAM on graphics cards was, for this purpose, relatively small (e.g. the GeForce GTX 280 usually has 1 GB of RAM). In fact, the H, E and F matrices do not have to be kept entirely in RAM (see Section "Implementation of the algorithms"). However, the backtracking tables (C^{up}, C^{left}, B^{up}, B^{left}) must be kept in the global memory. Hence, they would take a lot of memory space if they were held in normal C/C++ boolean arrays, e.g. for sequences with lengths of 500 residues one would need 80 × 80 × 500^{2 }× 4 bytes i.e. 6103.5 MB only for backtracking arrays. Thus, a special emphasis has been put to make this figure smaller, so that the algorithm can be run on any CUDAcapable device.
Implementation of the algorithms
All the algorithms in our implementation, namely the NW algorithm, its semiglobal version and the SW algorithm, have a few input parameters such as a substitution matrix, G_{open }and G_{ext }values, a file in fasta format with sequences to be aligned, etc. When the algorithm is launched, it performs an alignment of every given sequence with each other. The result of the algorithm consists of the score and the alignment for each pair of sequences.
Let S be the set of input sequences. The total number N of sequences' pairs to be aligned is given by the following formula:
Because the problem of aligning many sequences simultaneously is a memoryconsuming task, it has been split into subproblems. The whole matrix of tasks, where a single task is a pairwise alignment, was divided into smaller matrices, called windows (see Figure 1). The size of each window, denoted by window size, is a tradeoff between the amount of global memory required and the number of tasks running concurrently. Obviously, the global memory is limited and the number of tasks running concurrently is directly connected with the efficiency. Performed tests showed that a good value for the window size parameter is 80  this number can vary depending on the hardware. The vast majority of memory allocated by the algorithm is used for backtracking matrices. Therefore, one of the most significant problems was to store them properly. It is crucial to pack backtracking matrices into as small memory space as possible. To achieve this, any single cell of previously defined boolean matrices (C^{up}, C^{left}, B^{up}, B^{left}) is represented by one bit. Hence, in one 32bit memory word, a total of 32 values can be stored. These enhancements, i.e. windowing and backtracking matrices with their binary representations, enable the algorithm to run on any CUDAcapable device.
Figure 1. Division of the problem. Division of the problem into subproblems called windows and blocks. The input sequences are sorted from the longest to the shortest one.
One window can be considered as a grid consisting of constantsized blocks, as shown in Figure 1. Any single window is executed entirely on one GPU. Many windows, though, can be executed on many different GPUs. The block size is set to 16, because of our lowlevel optimizations of the algorithm. It means that in one block there are 16 · 16 = 256 threads. Each thread is responsible for aligning one pair of sequences. Although the window size value must be divisible by the block size, the algorithm ensures that all input sequences will be aligned, despite of their number. Execution of a block is over if its last thread finishes. Hence, to fully utilize the hardware resources, the lengths of all of the sequences within a block should be similar. The same applies to any window and its blocks. Therefore, the input sequences are sorted from the longest to the shortest one in the preprocessing step. This enhancement improves the algorithm performance significantly.
All alignment algorithms are divided into two main procedures, called kernels. The first kernel computes the alignment score and fills the backtracking matrices, the second one performs the backtracking stage. In the first kernel every thread fills its H, E and F matrices as well as its backtracking arrays C^{up}, C^{left}, B^{up}, B^{left }horizontally. In each iteration a total of eight cells are computed, as shown in Figure 2. The global memory is accessed at the beginning of the iteration (when one element of the H and E matrices is read) and at the end (when the results, i.e. one element of the H and E matrices, and eight elements of backtracking arrays, are written back). A pair of H and E elements are stored together as one 32bit word.
Figure 2. Processing of the dynamic programming matrix. Processing of the dynamic programming matrix. The cells are processed horizontally in a group of eight. The cells that have already been processed are marked as grey, cells that are currently processed are black and cells to be processed are white.
Also the elements of backtracking arrays are stored in a 32bit word  eight elements in each of four matrices give totally 32 bits. Moreover, one can notice that the elements of the matrix F do not have to be transferred from/to the global memory, because they can be stored in the fast, shared memory. Although utilization of the shared memory greatly speeds up the algorithm, not all the solutions, e.g. Manavski et al. [13], leverage its potential. Additionally, in our implementation the elements of the substitution matrix are stored in the constant memory and the sequences are stored as a texture. As a result, to process eight elements of the dynamic programming matrix one 32bit word is read from the slow, global memory and two 32bit words are written back. Apart from this all the operations are performed using registers, shared memory, cached constant memory and textures. The pseudocode of the first kernel has been shown in Figure 3.
Figure 3. Pseudocode of the first kernel. Pseudocode of the inner loops in the first kernel. The H matrix is filled in a way specific to the NW algorithm.
The idea of processing the dynamic programming matrix in vectors of eight elements in the first kernel is similar to the one proposed by Liu Y. et al. in CUDASW++ [15]. However, CUDASW++ kernel performs a database scan and, as such, takes advantage of storing the query sequence in the constant memory what results in significant performance boost. This idea was further exploited in CUDASW++2.0 [16] by using so called query profile. These improvements are not applicable for our solution in which there is no single query sequence that could be effectively shared across all the threads.
The second stage of the algorithm  backtracking, is executed by the second kernel. Also in this case, one thread is responsible for processing of only one alignment. The kernel starts from the (i*, j*) cell, computed in the first stage, and performs the up, left or diagonal moves, depending on the backtracking matrices, until the stop condition is fulfilled. When the algorithm moves up, the elements of C^{up}, C^{left}, B^{up }and B^{left }matrices do not have to be read from the global memory, because in most cases, they are already in registers  one 32bit word contains the information about eight elements of backtracking arrays. However, when the algorithm moves left or diagonal, one word is read from the global memory. This kernel, launched in grids of blocks, produces the final alignments of every sequence with each other. Its pseudocode has been shown in Figure 4. The second stage of the algorithm is very quick and usually comprises less than 1 percent of total runtime.
Figure 4. Pseudocode of the second kernel. Pseudocode of the backtracking kernel.
The advantage of using the backtracking arrays is that the backtracking stage can be performed very quickly in a linear time leading to very good solutions for short and mediumlength sequences. However, its main drawback is quadratic memory complexity, discussed in Section "The idea of backtracking procedure and GPU limitations". Thus, the question is: what is the length of the longest sequences that can be processed by our program? Table 2 shows the maximum lengths of sequences, that can be aligned by the algorithm, depending on the amount of RAM available on the graphics card and the value of window size parameter. E.g. to utilize the resources of the GeForce GTX 280 with 1 GB of RAM properly, it is sufficient to set window size parameter to 80. It means that the input sequences, regardless of their number, can be as long as about 547 residues each. Processing of longer sequences is also possible, but the window size parameter should be decreased, e.g. the proper window size value for sequences with the length of 900 residues is 48. Although this change will have an influence on the overall performance of the algorithm, its speed may be still satisfactory. Taking this into consideration, we can conclude that the algorithm can process sequences with reasonable length. On the other hand, while aligning short sequences, one can try to increase the value of window size. This may improve the algorithm's performance.
Table 2. The maximum sequence length depending on the GPU RAM and window size parameter
Bearing in mind, that nowadays many computer systems are equipped with more than one graphics card, we have designed and implemented a multiGPU support. To ensure that all graphics cards used are equally loaded with work, regarding to their individual speeds, we have also implemented a task manager. Its role is to balance work among available GPUs. First, it sorts the tasks (here: windows) in descending order of their estimated complexity. Then, the tasks are assigned consecutively to any GPU that becomes idle. This type of scheduling, i.e. largest processing time first (LPT), although not optimal, ensures that the upper bound of execution time is equal to , where t_{opt }is the optimal execution time and m  number of processing units [28,29]. In practice, applying LPT rule results in very good run times.
Results
The main goal of this section is to compare the performance of the algorithm to other stateoftheart approaches. However, before proceeding to the actual tests, the measure of cell updates per second (CUPS) should be well understood. The measure represents the average time needed to compute one cell in the matrix H, including the time of all side operations like computation of the values in the E and F matrices or performing the backtracking procedure. In practice, the number of computed cells in the matrix H is divided by the overall runtime of the algorithm. In our case it is:
where n is the number of input sequences, length_{i }is the length of the ith sequence, t represents the time in seconds and the result is given in giga (10^{9}) CUPS.
It should be stressed, however, that this measure underestimates the performance of the algorithms with a backtracking routine, because while the number of cells in the matrix H does not change, the time needed by backtracking is added.
The first implementation of the SW algorithm taking advantage of CUDAcapable GPUs has been developed by Manavski S. et al. [13]. The SWCUDA algorithm running on two NVIDIA GeForce 8800GTX graphics cards achieves its peak performance of about 3.5 GCUPS. Another approach, developed by Ligowski L. et al. [14], with optimized shared memory usage was able to achieve up to 7.5 GCUPS using one and up to 14.5 GCUPS using both GPUs of the GeForce 9800 GX2. The CUDASW++ implementation by Liu Y. et al. [15] achieves a performance close to 10 GCUPS on a GeForce GTX 280 graphics card and up to 16 GCUPS on a dualGPU GeForce GTX 295. This approach has been further explored by its authors resulting in optimized SIMT and partitioned vectorized algorithm CUDASW++ 2.0 [16] with an astonishing performance of up to 17 GCUPS on a GeForce GTX 280 and 30 GCUPS on a dualGPU GeForce GTX 295. In the meantime also a couple of solutions addressing the Cell/BE [3032] or FPGA [33,34] processors have been developed, all showing a great potential of new computing architectures. However, all implementations mentioned above solve a different problem  they perform a database scan, which is an easier problem to optimize for a couple of reasons, e.g. the query sequence may be kept all the time in fast onthechip memory. Moreover, all mentioned approaches concentrate on computing only the alignment score, not the alignment itself.
In search for an application to which our algorithm could be compared, we have come across the KhajehSaeed A. et al. solution [17]. This application, in one of its configurations, is able to perform the SW algorithm together with the backtracking stage and moreover apply this for any possible pair of sequences from a given input set. Thus, it is quite similar to our algorithm, but its performance is rather poor. Tests shown in the article indicate a performance of around 0.08 GCUPS for a GeForce GTX295 and about 0.17 GCUPS for four such graphics cards. Another application that performs the same computations is MSACUDA [18]. To be more precise, the first step of implemented ClustalW algorithm requires every input sequence to be aligned with each other. Unfortunately, the authors have not reported how fast this step is. Only the overall MSA times have been presented. Moreover, although the algorithm of Myers and Miller has been applied as a backtracking routine, sequences up to only around 860 residues have been tested in the article. Because the application is not available, we could not include it in our comparison. Since the number of stateoftheart applications performing a backtracking procedure is very limited, we have decided to compare our algorithm to scoreonly implementations. In order to make them more similar to our approach, the score in a reference application should be calculated for any pair of sequences from a given input set. This assumption, however, ruins the performance of all GPUbased database scan solutions. In this case, they would have to be launched n  1 times, where n is the number of input sequences, each time with decreased size of the database. Obviously, a good parallelism with a small database cannot be obtained for these algorithms. Hence, it would be unfair to include such tests in the paper. Instead, we decided to make use of a wellestablished algorithm of Farrar M. [35] which is a CPUbased, highly optimized database scan method. Yet, because a single run of the SmithWaterman algorithm for two sequences is parallelized here, it can be easily modified into a version that computes a score for each pair of sequences without any loss of its performance. A detailed description of such a modification is provided is the next subsection.
Comparison to the Farrar's implementation
In this test our algorithm is compared to the Farrar's implementation of the SmithWaterman algorithm [35]. Farrar's approach utilizes the set of SSE2 instructions available in modern CPUs which makes the algorithm very efficient. The strength of the method does not rely on the great number of input sequences which can be processed simultaneously, but on SIMD operations performed within a single run of the SW algorithm. Therefore, the algorithm could be easily converted from scanning a database to computing scores for each pair of sequences from a given input set. All changes needed were made in the source code, so that the application does not have to be launched many times. Special tests were conducted to assure that the performance of the algorithm (in GCUPS) has not been affected.
For testing purposes we used the Ensembl Databases  Release 55 [36], which contains genomic information regarding selected vertebrate species. All tests were performed on randomly selected subsets of sequences from the Homo Sapiens translation predictions.
In order to see if the performance of the algorithms depends on the length of the sequences, the input data was divided into six groups with different average lengths: 51, 154, 257, 459, 608 and 1103 amino acids. Moreover, for each group three sets with different number of sequences are considered to see if their number has any significant impact on the performance. The substitution matrix (BLOSUM50) as well as gap penalties (G_{open }= 10, G_{ext }= 2) were fixed the same for all tests.
The tests were run on the following hardware:
• CPU: 2 × Intel Xeon E5405, 2.0 GHz,
• GPU: NVIDIA Tesla S1070 with 16 GB of RAM,
• RAM: 16 GB,
• OS for our algorithm: 64bit Linux,
• OS for Farrar's method: 64bit Microsoft Windows 7.
Each algorithm was launched with each input data ten times. Table 3 presents the average execution times measured in seconds whereas Table 4 shows the performance in GCUPS. The standard deviation values σ have been omitted, because they do not give any significant information (in each case the value σ made up less than 1% of measurement). Additionally, our algorithm has been tested in one and four GPU configurations, respectively.
Table 3. Time comparison between our solution and the Farrars' implementation
Table 4. Performance comparison between our solution and the Farrars' implementation
The performance of the Farrar's algorithm grows significantly with increasing sequence length, reaching around 3.15 GCUPS for the sets with the longest sequences. In contrast, our algorithm with the performance of about 2.8 GCUPS seems to be insensitive to the sequence length. Its speed slightly decreases only for the groups with long sequences (608 and 1103). The reason behind this is that longer sequences require more global memory and thus the value of the window size parameter needs to be reduced. This corresponds directly to the number of tasks running in parallel. Hence the slowdown.
Farrar's solution, that has been used in this test, uses only one CPU core, but obviously we can expect a speedup close to linear if all CPU cores were used. However, our approach is also scalable  the execution times drop by a factor of nearly four when all four GPUs are used and the algorithm reaches up to 11.5 GCUPS. The number of input sequences does not affect the performance of Farrar's approach and it was high enough to have no influence on the performance of our algorithm. We can conclude that for sequences of average length (459) both implementations run comparably fast, but the GPUbased algorithm tends to be much faster when short sequences are processed. Moreover, it is worth noting that our algorithm additionally performs the backtracking step and computes the actual alignments of the sequences.
The speedup is much higher if the algorithm is compared to the one presented by KhajehSaeed A. et al. in [17]. Up to our knowledge this is the only GPUbased solution addressing the same problem where the performance is reported. Our approach is about 35 and 68 times faster for one and four graphics cards, respectively.
Time comparison to the Emboss implementation
Apart from comparison to the stateoftheart solutions, we decided to compare the algorithm to the implementation available in a popular package with bioinformatics tools  the Emboss [37]. The input sequences were also chosen from the Ensembl Databases  Release 55 (see Section "Comparison to the Farrar's implementation"). We tested ten sets of sequences, each containing 2800 entries with lengths between 100 and 420 amino acids. The execution times of our NW and SW algorithms' implementations were compared with needle and water programs, which are available in the Emboss package. The needle program performs semiglobal alignments using the NW algorithm, the water program computes local alignments using the SW algorithm. These programs have been designed for aligning one sequence with a set of sequences. Instead of changing the source code, a special shell script was prepared that allows to align every sequence with each other. As a side effect, the programs from the Emboss package had to be launched 2800 times. In order to make the comparison reasonably fair their execution times have been reduced appropriately. We prepared a set of 2800 sequences with lengths equal to 1. Then, we measured the execution times of the programs from the Emboss package for this special set. The times were subtracted from the execution times of the test cases containing real sequences. It is worth noting that needle and water programs work using one CPU thread. Obviously, the NW and SW algorithms are deterministic  they always find the optimal solution. Therefore, there is no need to compare the quality of the results.
The tests were run on the following hardware:
• CPU: Intel Core 2 Quad Q8200, 2.33 GHz,
• GPU: NVIDIA GeForce GTX 280 with 1 GB of RAM,
• RAM: 8 GB,
• OS: 64bit Linux.
Each algorithm was run ten times (once for each input set). Figure 5 shows the average execution times measured in seconds. The standard deviation values σ have been omitted, because they do not give any significant information (in each case the value σ comprised less than 1% of the measure).
Figure 5. Average time of aligning a set of 2800 sequences. Average time of aligning a set of 2800 sequences. The algorithms marked as needle and water come from the Emboss package and ran on the CPU. The algorithms marked as semiglobal NW and SW were launched on a single GPU  GeForce GTX 280. The scale of the time axis is logarithmic.
The average times of computation for the needle and water programs were 6157 and 10957 seconds respectively (about 102 and 182 minutes), whereas the times for our implementation were as follows: 89.7 seconds for the NW and 100.8 seconds for the SW algorithm. Thus, the GPU implementation of the semiglobal version of NW was about 68 times faster than the CPUbased needle. In case of the SW algorithm the difference was even higher: the GPU version was about 108 times faster. To show this relationship properly, the scale of the time axis in Figure 5 is logarithmic.
MultiGPU test
The multiGPU test was performed to see how the time of the computations depends on the number of graphics cards used. The sets of input sequences were the same as in the case of the test from Section "Comparison to the Farrar's implementation".
The tests were run on the following hardware:
• CPU: 2 × Intel Xeon E5405, 2.0 GHz,
• GPU: NVIDIA Tesla S1070 with 16 GB of RAM,
• RAM: 16 GB,
• OS: 64bit Linux.
The SmithWaterman algorithm was launched using one, two, three and four GPUs in turn. The algorithm was run ten times for each input set and the mean values of the computational times are shown in Table 5. To make the comparison easier, we added the columns with speedups. The execution times of the algorithm were nearly two times shorter for two graphics cards, nearly three and four times shorter when using three and four graphics cards, respectively. This shows that using more than one GPU one can gain almost linear speedup.
Table 5. Computational times of the SW algorithm depending on the number of GPUs used
To see if the same applies to the NW algorithm and its semiglobal version, we prepared a simple test. The input sequences were taken from the Ensembl Databases  Release 55 and the set of sequences contained 4000 entries with lengths between 100 and 420 amino acids.
All three algorithms, namely global and semiglobal versions of the NW, and the SW algorithm, were launched using again one, two, three and four GPUs, respectively. Each algorithm was run ten times and the mean values of the computational times as well as the speedups are shown in Table 6. The execution times of the algorithms strongly depend on the number of GPUs and the obtained speedup is almost linear. Note that in our implementation multiGPU support with load balancing works for each alignment algorithm.
Table 6. Performance of the algorithms depending on the number of GPUs used
Number of sequences needed to load the GPU
According to Gustafson's law [38], to gain a good speedup on a parallel system, the problem instances have to be sufficiently large. To investigate how large the problem instances must be, the following test was designed. Each dynamic programming algorithm was launched for different problem sizes. The smallest instance consisted of 16 randomly selected sequences from the Ensembl Databases  Release 55 (see Section "Comparison to the Farrar's implementation"). Each subsequent instance contained additional 16 sequences, the largest instance had 1024 sequences. The lengths of the sequences varied between 100 and 420 amino acids. The adopted measure was the average time needed to compute a single alignment of two sequences. To be more precise, the time needed to perform all alignments for a given problem instance was measured and then divided by the total number of computed alignments. The goal was to determine the minimal number of input sequences that could guarantee good performance.
The tests were run on the following hardware:
• CPU: Intel Core 2 Quad Q8200, 2.33 GHz,
• GPU: NVIDIA GeForce GTX 280 with 1 GB of RAM,
• RAM: 8 GB,
• OS: 64bit Linux.
Figure 6 shows that for sixteen input sequences the average times of performing a single pairwise alignment for the global and semiglobal versions of NW, and the SW algorithm were relatively long  about 1.72, 1.84 and 2.15 milliseconds, respectively. These times were significantly shorter for larger sizes of the problem.
Figure 6. Average time of computation of one alignment. Average time of computations of one alignment depending on the total number of input sequences. Tests were run on a single graphics card  GeForce GTX 280.
The reasonable times of 0.1, 0.1 and 0.12 ms, respectively, have been achieved for an instance with 80 sequences. The chart is limited to the maximum instance size of 256 sequences, because the curve almost reaches its asymptote. For an instance of this size the times were 0.04, 0.04 and 0.05 ms, respectively. All three algorithms needed only around 0.03 ms if 512 sequences were processed. Although further incrementing of the problem size resulted in some decreases in time, the benefits were not considerable. It means that even for relatively small instances the algorithms were able to gain a good speedup.
Test on the Fermi architecture
The algorithm was designed during the time when only the G80 and the GT200 GPU architectures were available on the market. However, recently a new architecture, called Fermi [39], has come along. Hence, we set out to check if the application can benefit from a doubled number of CUDA cores that are on the new chips.
The sets of input sequences were the same as in the case of the test from Section "Comparison to the Farrar's implementation". Only the longest sequences were excluded, because of more limited memory on the graphics cards.
The tests were run on the following hardware:
• CPU: Intel Core i7 950, 3.06 GHz,
• GPU: 2× NVIDIA GeForce GTX 480 with 1.5 GB of RAM,
• RAM: 8 GB,
• OS: 64bit Linux.
Each method, i.e. the SW algorithm as well as the NW algorithm and its semiglobal version, was launched ten times for each input data. Table 7 presents the average performance measured in GCUPS. The standard deviation values σ were insignificant (less than 1% of measurement) and hence omitted. Additionally, the table includes the performance of the previous generation architecture  GT200, represented here by one GPU from the Tesla S1070 (see Section "Comparison to the Farrar's implementation").
Table 7. Performance of the algorithms depending on the hardware architecture
The test shows that with the Fermi architecture the performance of the algorithms increases by a factor of nearly two, especially for short sequences. In case of longer sequences this dominance is slightly reduced, because only 1.5 GB of RAM was available on our Fermi graphics card whereas one Tesla has 4 GB. Obviously, Fermi GPU with 3 or 6 GB of memory may solve this performance issue. However, one should remember that the solution aims to process mainly short and mediumlength sequences.
The backtracking routine overhead
Although the algorithm has been designed especially to deal well with backtracking routine, we also carried out a special test to investigate its performance when the backtracking is not performed, i.e. for scoreonly version. In other words we set out to check the overhead needed by the backtracking procedure. The kernel responsible for the actual backtracking is very quick and as stated before comprises less than 1 percent of total runtime of the algorithm. It, however, requires the backtracking arrays to be filled by the kernel 1. Since these arrays are not used for any other purpose, we excluded them from computations.
Tests were conducted on workstation already described in Section "Test on the Fermi architecture". This time, though, only one GPU was used. The sets of input sequences were also the same. The results are presented in Table 8.
Table 8. The overhead of the backtracking procedure
The performance of the scoreonly algorithm increases considerably reaching up to 9.39 GCUPS. The results are not as good as e.g. in CUDASW++2.0, but one should be aware of the fact that our algorithm was optimized to work well together with the backtracking routine. Moreover, our method has a few assumptions (described earlier) that distinguish it from other implementations but at the same time make it more elaborate.
The BT share values were computed in the following way: the computational times of the algorithm with the backtracking routine were divided by the runtimes of corresponding scoreonly versions of the algorithm. As we can see the overhead of the backtracking is mainly caused by the necessity of filling the C^{up}, C^{left}, B^{up }and B^{left }matrices. However, the additional 45%  65% is still low in comparison to the method proposed by Myers et al. [19]. Their approach requires twice as much computations and hence the overhead is around 100%. Moreover, we expect that the overhead on GPU would be even higher because of multiple reads of entire input sequences from the global memory. We may conclude that for short and mediumlength sequences our method appears to be more suitable.
Conclusions
Although a few GPUbased implementations of the SW algorithm are already known [1316] most of them address the problem of database scanning and computing only the alignment scores. Our algorithm is able to compute scores and pairwise alignments. Apart from the SW algorithm, we have also implemented global and semiglobal versions of the NW algorithm. Performed tests show that the efficiency of the implementation is excellent. The algorithm is able to align sequences in roughly the same time as the Farrar's solution needs to compute only scores. Yet, its real dominance reveals while short sequences are processed with no performance loss. Moreover, the speed of our GPUbased algorithms can be almost linearly increased when using more than one graphics card. We have also checked what is a minimum reasonable number of input sequences. Performed tests show, that even for about 80 sequences our algorithms are able to gain a good speedup. What is worth noting, all the tests were performed using real sequences.
The NW and SW algorithms with a backtracking routine may have a lot of applications. They can be used as a robust method for multipairwise sequence comparisons performed during the first step in all of the multiple sequence alignment methods based on the progressive approach. Another area of interest can be the usage of a GPUbased semiglobal alignment procedure as a part of the algorithm for the DNA assembly problem, being one of the most challenging problem in current biological studies. It has already been shown that in this case a parallel solution can be successfully applied [20]. Using GPUbased approaches we expect that its execution time would be even shorter, because the large number of short sequences is perfectly in line with the benefits of our algorithm.
Availability and requirements
• Project name: gpupairAlign
• Project home page: http://gpualign.cs.put.poznan.pl webcite
• Operating system: Linux
• Programming language: C/C++
• Other requirements: CUDA 2.0 or higher, CUDA compliant GPU, make, g++
• License: GNU GPLv3
• Any restrictions to use by nonacademics: none
Abbreviations
NW: the NeedlemanWunsch algorithm; SW: the SmithWaterman algorithm; CPU: central processing unit; GPU: graphics processing unit; GPGPU: generalpurpose computing on graphics processing units; CUDA: Compute Unified Device Architecture; RAM: random access memory; OS: operating system.
Authors' contributions
WF, MK and PW conceived of the study and participated in its design. WF proposed the idea of backtracking arrays. WF and MK contributed equally to algorithm design and implementation. MK carried out computational tests. JB and EP participated in the coordination of the project. All authors were involved in writing the manuscript and all of them read and approved its final version.
Acknowledgements
This research has been partially supported by the Polish Ministry of Science and Higher Education under Grant No. N N519 314635.
References

Needlemana S, Wunsch C: A general method applicable to the search for similarities in the amino acid sequence of two proteins.
J Mol Biol 1970, 48(3):4433.
48: 44353
PubMed Abstract  Publisher Full Text 
Smith T, Waterman M: Identification of Common Molecular Subsequences.
Journal of Molecular Biology 1981, 147:19597. PubMed Abstract  Publisher Full Text

Computational Molecular Biology: An Algorithmic Approach. The MIT Press. 2000.

Blazewicz J, Bryja M, Figlerowicz M, Gawron P, Kasprzak M, Kirton E, Platt D, Przybytek J, Swiercz A, Szajkowski L: Whole genome assembly from 454 sequencing output via modified DNA graph concept.
Computational Biology and Chemistry 2009, 33:224230. PubMed Abstract  Publisher Full Text

Wang L, Jiang T: On the complexity of multiple sequence alignment.
J Comput Biol 1994, 1(4):337348. PubMed Abstract  Publisher Full Text

Pei J: Multiple protein sequence alignment.
Current Opinion in Structural Biology 2008, 18(3):382386. PubMed Abstract  Publisher Full Text

Do CB, Mahabhashyam MS, Brudno M, Batzoglou S: ProbCons: Probabilistic consistencybased multiple sequence alignment.
Genome Res 2005, 15(2):330340. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lassmann T, Sonnhammer EL: Kalign  an accurate and fast multiple sequence alignment algorithm. [http://www.biomedcentral.com/14712105/6/298] webcite
BMC Bioinformatics 2005, 6:298. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Thompson J, Higgins D, Gibson T: CLUSTAL W: improving the sensitivity of progressivemultiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice.
Nucleic Acids Research 1994, 22:46734680. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Notredame C, Higgins D, Heringa J: TCoffee: A novel method for multiple sequence alignments.
Journal of Molecular Biology 2000, 302:205217. PubMed Abstract  Publisher Full Text

Kemena C, Notredame C: Upcoming challenges for multiple sequence alignment methods in the highthroughput era.
Bioinformatics 2009, 25(19):24552465. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Manavski S, Valle G: CUDA compatible GPU cards as efficient hardware accelerators for SmithWaterman sequence alignment.

Ligowski L, Rudnicki W: An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases.

Liu Y, Maskell DL, Schmidt B: CUDASW++: optimizing SmithWaterman sequence database searches for CUDAenabled graphics processing units.

Liu Y, Maskell DL, Schmidt B: CUDASW++2.0: enhanced SmithWaterman protein database search on CUDAenabled GPUs based on SIMT and virtualized SIMD abstractions.

KhajehSaeed A, Poole S, Perot J: Acceleration of the SmithWaterman algorithm using single and multiple graphics processors.
Journal of Computational Physics 2010, 229:42474258. Publisher Full Text

Liu Y, Maskell DL, Schmidt B: MSACUDA: Multiple Sequence Alignment on Graphics Processing Units with CUDA.
20th IEEE International Conference on Applicationspecific Systems, Architectures and Processors 2009.

Myers EW, Miller W: Optimal alignments in linear space.
Comput Appl Biosci 1988, 4:1117. PubMed Abstract

Blazewicz J, Kasprzak M, Swiercz A, Figlerowicz M, Gawron P, Platt D, Szajkowski L: Parallel Implementation of the Novel Approach to Genome Assembly.

ATI Stream [http://www.amd.com/stream] webcite

OpenCL [http://www.khronos.org/opencl] webcite

NVIDIA CUDA [http://www.nvidia.com/object/cuda_home.html] webcite

NVIDIA CUDA C Programming Best Practices Guide [http://www.nvidia.com/object/cuda_develop.html] webcite

Gotoh O: An Improved Algorithm for Matching Biological Sequences.

Henikoff S, Henikoff J: Amino Acid Substitution Matrices from Protein Blocks.
PNAS 1992, 89:1091510919. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dayhoff M, Schwartz R, Orcutt B: A model of Evolutionary Change in Proteins.
Atlas of protein sequence and structure, Nat Biomed Res Found 1978, 5(supp 3):34558.

Graham RL: Bounds on multiprocessing timing anomalies.
SIAM J Appl Math 1969, 17(2):416429. Publisher Full Text

Blazewicz J, Ecker K, Pesch E, Schmidt G, Weglarz J:
Handbook on Scheduling: From Theory to Applications. Springer. 2007.

Szalkowski A, Ledergerber C, Krahenbuhl P, C D: SWPS3  fast multithreaded vectorized SmithWaterman for IBM Cell/B.E. and x86/SSE2.

Farrar M: Optimizing SmithWaterman for the Cell Broadband Engine. [http://farrar.michael.googlepages.com/SWCellBE.pdf] webcite

Wirawan A, Kwoh C, Hieu N, Schmidt B: CBESW: Sequence Alignment on Playstation 3.
BMC Bioinformatics 2008, 9:377. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Oliver T, Schmidt B, DL M: Reconfigurable architectures for biosequence database scanning on FPGAs.

Li T, Shum W, Truong K: 160fold acceleration of the SmithWaterman algorithm using a field programmable gate array (FPGA).
BMC Bioinformatics 2007, 8:185. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Farrar M: Striped SmithWaterman speeds database searches six times over other SIMD implementations.
Bioinformatics 2007, 23(2):156161. PubMed Abstract  Publisher Full Text

Ensembl Databases  Release 55 [ftp://ftp.ensembl.org/pub/release55] webcite

Rice P, Longden I, Bleasby A: EMBOSS: The European Molecular Biology Open Software Suite.
Trends in Genetics 2000, 16:27677. PubMed Abstract  Publisher Full Text

Gustafson J: Reevaluating Amdahl's Law.
Communications of the ACM 1988, 31:532533. Publisher Full Text

NVIDIA Fermi Architecture [http://www.nvidia.com/object/Fermi_architecture.html] webcite