Abstract
Background
Pairwise statistical significance has been recognized to be able to accurately identify related sequences, which is a very important cornerstone procedure in numerous bioinformatics applications. However, it is both computationally and data intensive, which poses a big challenge in terms of performance and scalability.
Results
We present a GPU implementation to accelerate pairwise statistical significance estimation of local sequence alignment using standard substitution matrices. By carefully studying the algorithm's data access characteristics, we developed a tilebased scheme that can produce a contiguous data access in the GPU global memory and sustain a large number of threads to achieve a high GPU occupancy. We further extend the parallelization technique to estimate pairwise statistical significance using positionspecific substitution matrices, which has earlier demonstrated significantly better sequence comparison accuracy than using standard substitution matrices. The implementation is also extended to take advantage of dualGPUs. We observe endtoend speedups of nearly 250 (370) × using singleGPU Tesla C2050 GPU (dualTesla C2050) over the CPU implementation using Intel^{© }Core™i7 CPU 920 processor.
Conclusions
Harvesting the high performance of modern GPUs is a promising approach to accelerate pairwise statistical significance estimation for local sequence alignment.
Background
Introduction
The past decades have witnessed dramatically increasing trends in the quantity and variety of publicly available genomic and proteomic sequence data. Dealing with the massive data and making sense of them are big challenges in bioinformatics [1,2]. One of the most widely used procedures for extracting information from proteomic and genomic data is pairwise sequence alignment (PSA). Given two sequences, PSA finds the extent of similarity between them. Many bioinformatics applications have been developed based on pairwise sequence alignment, such as BLAST [3], PSIBLAST [46], and FASTA [7]. PSA produces a score for an alignment as a measure of the similarity between two sequences. Generally, the higher the score, the more related the sequences. However, the alignment score depends on various factors such as alignment methods, scoring schemes, sequence lengths, and sequence compositions [8,9]. Judging the relationship between two sequences solely based on the scores can often lead to wrong conclusion. Therefore, it is more appropriate to measure the quality of PSA using the statistical significance of the score rather than the score itself [10,11]. Statistical significance of sequence alignment scores is very important to know whether an observed sequence similarity could imply a functional or evolutionary link, or is a chance event [8,12]. Accurate estimation of statistical significance of gapped sequence alignment has attracted a lot of research in recent years [1326].
Pairwise statistical significance
Consider a pair of sequences s_{1 }and s_{2 }of lengths m and n, respectively, the scoring scheme, SC (substitution matrix, gap opening penalty, gap extension penalty), and the number of permutations N of s_{2}, pairwise statistical significance (PSS) of the two sequences is calculated by the following function [21], which is described below:
Through permuting s_{2 }N times randomly, the function generates N scores by aligning s1 against each of the N permuted sequences and then fits these scores to an extreme value distribution (EVD) [8,27,28] using censored maximum likelihood [29]. The returned value is the pairwise statistical significance of s_{1 }and s_{2}. The EVD describes an approximate distribution of optimal scores of a gapless alignment [4]. Based on this distribution, the probability (i.e., Pvalue) of observing an sequence pair with a score S greater than x, can be given by:
where λ and K are calculational constants and E(x), also known as Evalue, is the expected number of distinct local alignments with score values of at least x.
Note that the above distribution is for a gapless alignment. For the cases of gapped alignment, although no asymptotic score distribution has yet been established analytically, computational experiments strongly indicate these scores still roughly follow Gumbel law [8,13,30].
In addition to not needing a database to estimate the statistical significance of an alignment, pairwise statistical significance is shown to be more accurate than database statistical significance reported by popular database search programs like BLAST, PSIBLAST, and SSEARCH [21]. However, it involves thousands of such permutations and alignments, which are enormously time consuming and can be impractical for estimating pairwise statistical significance of a large number of sequence pairs. Hence, use of highperformance computing (HPC) techniques (such as multicores CPU, manycore graphics processing units (GPUs), FPGAs, etc.) is highly conducive to accelerate the computation of PSSE. Moreover, large data sets demand more computing power. In many demanding bioinformatics applications, such as sequence alignment [31,32], and protein sequence database search [33,34], manycore GPU has demonstrated its extreme computing power. This strongly motivates the use of GPUs to accelerate the PSSE. Acceleration of PSSE using MPI [35] and FPGA [36] has been explored earlier, and in this work, we design a GPU implementation for the same. Compared to [35,36], we consider the estimation of multipair PSS along with singlepair PSS using GPUs.
Contributions
We present a GPU implementation to accelerate pairwise statistical significance estimation of local sequence alignment using standard substitution matrices. Our parallel implementation makes use of CUDA (Compute Unified Device Architecture) parallel programming paradigm. Our design uses an efficient data reorganization method to produce coalesced global memory access, and a tiledbased memory scheme to increase the GPU occupancy, a key measure for GPU efficiency [37]. Through careful analysis of the data dependency and access patterns of PSSE, we reorganize the data sequences into aligned arrays that coalesce the global memory access pattern. Such data access contiguity keeps the GPU cores occupied with computation and allows the thread scheduler to overlap the global memory access with the computation. In addition to the ability to calculate the optimal tile size for data to be shipped to the GPU, our design can also issue a large enough number of threads to maximize the occupancy.
We further extend the parallelization technique to estimate pairwise statistical significance using positionspecific substitution matrices, which has earlier demonstrated significantly better sequence comparison accuracy than using standard substitution matrices [11]. The implementation is also extended to take advantage of dualGPUs to accelerate those computations. As a result, maximum performance could be obtained by harvesting the power of manycore GPUs.
The performance evaluation was carried out on NVIDIA Telsa C2050 GPU. For multipair PSSE implementation, we observe nearly 250 (370)× speedups using a singleGPU Tesla C2050 GPU (dualTesla C2050) over the CPU implementation using an Intel^{© }Core ™i7 CPU 920 processor. The proposed optimizations and efficient framework for PSSE, which have direct applicability to speedup homology detection, are also applicable to many other sequence comparison based applications, such as DNA sequence mapping, phylogenetic tree construction and database search especially in the era of nextgeneration sequencing.
Methods
In this section we present GPU implementations both for singlepair PSSE and multipair PSSE. Along with the methodological details, we also discuss several performance optimization techniques used in this paper, such as, GPU memory access optimization, occupancy maximization, and substitution matrix customization. These techniques have significantly speeded up PSSE.
Careful analysis of the data pipelines of PSSE shows that the computation of PSSE can be decomposed into three computation kernels: Permutation, Alignment and Fitting. Permutation and Alignment comprise the overwhelming majority (a more than 99.8%) of the overall execution time [35]. Therefore, efforts should be spent to optimize these two kernels to achieve high performance. Also, we observe that permutation presents high degrees of data independency that are naturally suitable for singleinstruction, multiplethread (SIMT) architectures [38] and therefore, can be mapped very well to task parallelism models of GPU. Moreover, even though the alignment task suffers from data dependency, we show that with clever optimizations, it can be heavily accelerated using GPUs.
Design
GPU memory access optimization
It is especially important to optimize global memory access as its bandwidth is low and its latency is hundreds of clock cycles [38]. Moreover, global memory coalescing is the most critical optimization for GPU programming [39]. Since the kernels of PSSE usually work over large numbers of sequences that reside in the global memory, the performance is highly dependent on hiding memory latency. When a GPU kernel is accessing global memory, all threads in groups of 32 (i.e. warp) access a bank of memory at one time. A batch of memory accesses is considered coalesced when the data requested by a warp of threads are located in contiguous memory addresses. For example, if the data requested by threads within a warp are located in 32 consecutive memory addresses (such that the i^{th }address is accessed by the i^{th }thread), the memory can be read in a single access. Hence, this memory access operation runs 32 times faster. If the memory access is not coalesced, it is divided into multiple reads and hence serialized [37].
After permutation, if the sequence s_{2 }and its N permuted copies were stored contiguously one after another in the global memory, the intuitive memory layout would be as shown in Figure 1 (a) Note that, we need one byte (uchar) to store each amino acid residue. Moreover, GPU can read fourbyte (packed as a CUDA builtin vector data type uchar4) of data from the global memory to registers in one instruction. To achieve high parallelism of global memory access, uchar4 is used to store the permuted sequences. Dummy amino acid symbols are padded in the end to make the length of sequences a multiple of 4.
Figure 1. Two GPU memory layout strategies. (a) The intuitive layout that one sequence is appended after another. (b) The reorganized layout such that the uchar4 at the same indices in different sequences stay at neighboring positions.
Considering intertask parallelism, where each thread works on the alignment of one of the permuted copies of s_{2 }to s_{1}, in this layout the gap between the memory accesses by the neighboring threads is at least the length of the sequence. For example, in the intuitive layout, if thread T_{0 }accesses the first residue (i.e., 'R'), and thread T_{1 }accesses the first residue (i.e., 'E'), the gap between the access data is n. This results in noncoalesced memory reads (i.e., serialized reads), which significantly deteriorates the performance.
We therefore reorganize the layout of sequence data in memory to obtain coalesced reads. Now, to achieve coalesced access, we reorganize layout of sequences in memory as aligned structure of arrays, as shown in Figure 1 (b) In the optimized layout, the characters (in granularity of 4 bytes) that lie at the same index in different permuted sequences stay at neighboring positions. Then if the first uchar4 of the first permuted sequence (i.e. 'REGN') is requested by thread T_{0}, the first uchar4 of the second permuted sequence (i.e. 'ARNE') is requested by T_{1}, and so on. This results in reading a consecutive memory (each thread reads 4 bytes) by a warp of threads in a single access. Thus the global memory access is coalesced, and therefore high performance is achieved.
As the sequences remain unchanged during the alignment, they can be thought of as readonly data, which can be bound to texture memory. For read patterns, texture memory fetches is a better alternative to global memory reads because of texture memory cache, which can further improve the performance.
Occupancy maximization
Hiding global memory latency is very important to achieve high performance on the GPU. This can be done by creating enough threads to keep the CUDA cores always occupied while many other threads are waiting for global memory accesses [39]. GPU occupancy, as defined below, is a metric to determine how effectively the hardware is kept busy:
where T_{max }is maximum number of resident threads that can be launched on a streaming multiprocessor (SM) (which is a constant for a specific GPU), T_{num }is the number of active threads per block and B is the number of active blocks per Streaming Multiprocessor (SM).
B also depends upon the GPU physical limitations (e.g. the amount of registers, shared memory and threads supported in each model). It can be given in the following way:
where B_{hw }is the hardware limit (only 8 blocks are allowed per SM), and B_{reg}, B_{shr}, are the potential blocks determined by the available registers, shared memory, respectively and B_{user }is blocks set by the user. Note that B ≤ B_{hw}, therefore we obtain:
Based on the above analysis, higher occupancy can be pursued according to the following rules:
• To avoid wasting computation on underpopulated warps, the number of threads per block should be chosen as a multiple of the warp size (currently, 32).
• Total number of active threads per SM should be close to maximum number of resident threads per multiprocessor (T_{max}). In short, we should use all the available threads if possible.
• To achieve 100% occupancy, (B_{hw }× T_{num})/T_{max }≥ 1. Hence, setting T_{num }such that T_{num}≥ T_{max}/B_{hw }is preferable.
Substitution matrix customization
The SmithWaterman (SW) algorithm [40] looks up the substitution score matrix very frequently while computing the alignment scores. Lowering the number of lookups obviously reduces the overall execution time. As suggested by [41], performance improvement could further be obtained by customizing the matrix. Usually, the substitution score matrix is indexed by querysequence symbol and the subjectsequence symbol as in BLOSUM62. Another way is to index the substitution score matrix by querysequence position and the subjectsequence symbol. We refer to this customized substitution matrix as local query profile. For example, let us consider a query sequence Q of length L over a set of residue alphabet ω. For each residue, we store a substitution score for every query sequence position. For example, in Figure 2 (a) the substitution scores for matching residue 'A' with each symbol of the query sequence Q are stored in the first row, and the substitution scores for matching residue 'C' are stored in the next row, and so on. Here the substitution score of the residue of subject sequence against the same symbol of query sequence at different position is always same.
Figure 2. The customized substitution matrix. (a) Local query profile: the same symbol (e.g., 'A') at different position in query has same scores. (b) Position specific query profile: the same symbol (e.g., 'A') at different position in query has different scores.
By contrast, positionspecific score matrix (PSSM) [4] offers a variation of this approach. We call this a position specific query profile in which scores are further refined. In this case the same residue (e.g. 'A') appearing at different position of query sequence has different scores, as shown in Figure 2 (b). For a given query sequence, PSSM can be preconstructed by PSIBLAST [3,4].
Using the customized approach, in both cases, the dimension ω × ω of the substitution matrix is replaced by a queryspecific matrix of dimension ω×L. This increases the memory requirement compared to the original layout, but reduces the lookups of the substitution matrix significantly as explained below.
In traditional matrix layout, if the substitution scores of a subject sequence residue (i.e., 'A') against query sequence residues (i.e., 'A','K','L','G') are required subsequently, GPU has to look up the matrix four times one by one, because the score S[A][A], S[A][K], S[A][L], S[A][G] are usually stored far from each other. In contrast, if we use CUDA builtin vector data type int4 (which packs 4 integers together) to store the customized substitution matrix in the texture memory, the same four scores are stored at neighboring memory locations (as shown in Figure 2 (b). By reading an int4, the GPU can get these four scores S[A][i] in one instruction, where i is the position of residues in query sequence. This reduces the number of lookups by a factor of 4, therefore, a higher performance can be obtained.
In essence, both coalesced memory layout and customized substitution matrix optimization heighten performance of memory access by improving data locality among threads in GPU programming.
Singlepair PSSE implementation
As mentioned previously, to simulate the required random sequences, a lot of random numbers are needed, since each s_{2 }copy has to be permuted thousands of times. Although the most widely used pseudorandom number generators such as linear congruential generators (LCGs) can meet our requirements, the current version of CUDA does not support calls to the host random function. Hence, we develop an efficient random number generator similar to lrand48() on GPUs.
The singlepair PSSE processes only one pair of query and subject sequences. The idea of computing singlepair PSSE is as follows. Given the query sequence s_{1 }and the subject sequence s_{2}, to compute PSSE, thousands (say N) of randomly permutations of s_{2 }are needed. To obtain these N random sequences of s_{2}, first, a set of N random numbers is generated in CPU as described in previous section. These numbers are then transferred to GPU and considered as seeds by the threads of GPU. Each thread then generates new random numbers using its own seed and swap the symbols of s_{2 }accordingly to obtain a permuted sequence. Thus the N random permutations of s_{2 }are obtained in parallel. The algorithm then uses SW algorithm to compute alignment score of s_{1 }and the N permuted copies of s_{2 }in parallel on GPU. The scores are then transferred to CPU for fitting. The details of computing singlepair PSSE is outlined by Algorithm 1. Note that Step 4 in Algorithm 1 uses the optimized memory layout, explained previously in detail, for s_{2 }and its N permuted copies.
SmithWaterman is a dynamic programming algorithm to identify the optimal local alignment between a pair of sequences. In general, there are two different methods for parallelizing the alignment task [33]. The first method is regarded as intertask parallelism. In this case, each thread performs alignment of one pair of sequences. Hence, in a thread block, multiple alignment tasks are performed in parallel [42]. The second one is intratask parallelism. Here, alignment of each pair of sequences is assigned to a block of threads, splitting the whole task into a number of subtasks. Each thread in the thread block then performs its own subtasks, cooperating to exploit the parallel characteristics of cells in the antidiagonals of the local alignment matrix [43]. We use a similar alignment kernel as proposed in [34] with some modifications for further improvement. It is worthwhile to mention that Step 7 describes fitting, which is implemented on the CPU. This is because it involves recursion that is not supported very well on GPU.
Multipair PSSE implementation
The multipair PSSE processes multiple queries and subject sequences. We can obtain some hints about improving the performance by analyzing the singlepair PSSE implementation (which we do in a later section). Owing to a dramatic increase of data set for multiplepairs implementation, there are some significant differences relative to performance of GPU hardware between the two implementations. Through analyzing our experimental results of the singlepair PSSE, we observe that intertask parallelism performs better than intratask parallelism (results shown later). Hence, we consider intertask parallelism in computing multipair PSSE. Based on the guidelines for optimizing memory and occupancy described earlier, we compare three implementation strategies.
Algorithm 1: Pseudocode of singlepair PSSE
Input: (s_{1}, s_{2})  Sequencepair; M  Substitution matrix; G  Gap opening penalty; GE  Gap extension penalty; N  Number of permutes;
Output: pss Pairwise statistical significance
1. Initialization
(a) Generate a number N of random numbers in CPU;
(b) Copy LCG seeds to GPU global memory;
(c) Copy s_{1 }and s_{2 }to GPU global memory;
2. Generate random numbers (RNs) using LCG in GPU
3. Permute sequence s_{2 }N times using the RNs
4. Reorganize sequences s_{2 }and its N times permuted copies as shown in Figure 2 (b) if using intertask parallel SmithWaterman();
5. Align s_{2 }and its N permuted copies against s_{1 }using SmithWaterman() (intertask or intratask);
6. Transfer N alignment scores from GPU into CPU;
7. Fitting in CPU
(a) (K, λ) ← EV DCensordMLFit(Scores);
(b) pss ← 1  exp(Kmne ^{λx});
return pss
Intuitive strategy
Given Q query sequences and S subject sequences, the intuitive strategy is to simply perform the same 'singlepair' procedure Q × S times. In other words, in each iteration, we send a single pair of query and subject sequences to the GPU. The GPU processes that pair and returns the result to the CPU. The same procedure is repeated for all query and subject sequence pairs. However, this strategy suffers from low occupancy. We analyze the cause along with its performance results in the next section.
Data reuse strategy
In the first strategy, the subject sequences from the database are permuted for every querysubject sequence pair. Hence, the same subject sequence is permuted every time it is sent to the GPU. A better strategy is to create permutations of each subject sequence only once and reuse them to align with all the queries. "One permutation, all queries" is the idea of the second strategy. Because of the reuse of permuted sequences, higher performance is expected than the first strategy. However, the occupancy of GPU, to be shown in the next section, is still not elevated.
Adaptively tilebased strategy
The low occupancy of the above two strategies is due to the underutilized computing power of GPU. In addition, these two strategies do not work well when the size of subject sequence database becomes too big to be fitted into GPU global memory. For instance, if the size of the original subject sequence database is 5 MB, it becomes 5000 MB when each of the sequences is permuted, say, 1000 times. This prohibits transfer of all the subject sequences to GPU at the same time. We therefore need an optimal number of subject sequences to be shipped to GPU keeping in mind that the subject sequences and their permuted copies fit in global memory. Moreover, the number of new generated sequences should be enough to keep all CUDA cores busy, i.e., keep a high occupancy of GPU, which is very important to harness the GPU power.
Herein we develop a memory tiling technique that is selftuning based on the hardware configuration and can achieve a closetooptimal performance. The idea behind the technique is as follows. In outofcore fashion, the data in the main memory is divided into smaller chunks called tiles and transferred to the GPU global memory. In our case, the tiles are the number of subject sequences to be transferred to the GPU at a time. The tile size T can be calculated using the following equation:
where SM_{num }is the total number of SMs in GPU, T_{max }is the maximum number of resident threads per SM, and N is the number of permutations.
In Tesla C2050 used in our experiments, there are 14 SMs and the maximum number of resident threads per SM is 1024. Let N = 1000, then, T = ⌊14 × 1024/1000⌋ = 14, which means that there are 14 distinct subject sequences to be transferred to the GPU's global memory at a time. Based on the second strategy (i.e., data reuse), 14 subject sequences and their permuted copies are aligned against one query sequence at a time, until all the query sequences are processed. As a result, 14 × 1000 alignment scores in total are obtained in each round, which are subsequently transferred to CPU for fitting. The CPU takes the 1000 alignment scores for each subjectquery sequence pair and uses them to compute the corresponding pss. The tilebased strategy has been described in Figure 3.
Figure 3. Adaptive tilebased strategy. The optimal tile size T can be calculated according to the hardware configuration of GPU. After calculated, the T subject sequences together are transferred to GPU global memory. Permutations and alignments are done in parallel in GPU. Then T × 1000 alignment scores are moved back to CPU for T fittings.
Results and discussion
All our experiments have been are carried out using Intel^{© }Core™i7 CPU 920 processors running @2.67 GHz. The system has 4 cores, 4 GB of memory, dual Tesla C2050 GPU (each with 448 CUDA cores) and is running a 64bit Linuxbased operating system. Our program has been compiled using gcc 4.4.1 and CUDA 4.0. The sequence data used in this work comprises of a nonredundant subset of the CATH 2.3 database [44]. This dataset consists of 2771 domain sequences as our subject library and includes 86 CATH queries as our query set. We derive PSSMs for the 86 test queries against the nonredundant protein database using PSIBLAST (provided along with the BLAST+ package [3]) over a maximum of five iterations and with other default parameters. BLOSUM62 and PSSMs have been used as the scoring matrices with affine gap penalty of 10 + 2k for a gap of length k. We permute the 2771 subject sequences N = 1000 time as in several previous studies [21,35,36,45].
Singlepair PSSE result analysis
In singlepair PSSE implementation, we use both the intratask and intertask parallelism methods. We choose four pairs of query and subject sequences of length 200, 400, 800, and 1600 from CATH 2.3 database. We compute PSSE for all these pairs using 64, 128, 256, and 512 threads per block. The experimental results have been plotted in Figure 4. All speedups are computed over the corresponding the CPU implementation. We observe that the intertask parallel implementation performs significantly better than the intratask parallel implementation. Employing further optimizations and newer version of CUDA, both methods show a higher speedup compared to our previous results [31].
Figure 4. Intratask and intertask parallelism for singlepair PSSE. We choose four pairs of query and subject sequences of length 200, 400, 800, and 1600 from CATH database.
For intratask parallel implementation, the best speedups for sequences of length 200, 400, and 800 are 24.57×, 35.09×, and 43.63×, respectively. All these are obtained using 128 threads per block. But the best speedup for sequences of length 1600 is 46.34× and used 256 threads per block. These results tell us that it is hard to find a general rule to parameterize the settings to achieve peak performance. A possible reason is that many factors, such as the number of threads per block, the available registers, and shared memory, may contradict with each other.
The intratask parallel implementation creates enough thread blocks (one block for each alignment task) to keep the occupancy high. However, the SW alignment matrix must be serially computed from the first to the last antidiagonal. Only the cells of the alignment matrix belonging to the same antidiagonal can be computed in parallel. In this case, most threads in a block have no work to do. As a result, the performance of the intratask parallel implementation is worse than that of the intertask parallel implementation even though it has higher occupancy.
In contrast, intertask parallelism would have a total of 1000 threads (one for each alignment task). If each block contains 64 threads, then the total number of blocks is B = ⌈1000/64⌉ = 16. The assignment of 16 blocks to 14 available SMs will result in 12 SMs with one active block and two SMs with two blocks. Therefore, the occupancy for the two SMs with two blocks is (2 × 64)/1024 = 12.5%. For the 12 SMs with only one block its occupancy is (1 × 64)/1024 = 6.25%. Because of the low occupancy, there are not sufficient threads to keep CUDA cores busy when the global memory is accessed. As a result, the latencyhiding capabilities of this method are limited. As the number of threads per block T_{num }increases, the total active blocks (B) decreases and some SMs even have no blocks assigned. For example, if T_{num }= 512, then B = ⌈1000/512⌉ = 2. Hence only two SMs are working with one block each.
However, these SMs have a higher occupancy (1 × 512)/1024 = 50%, which compensates for the decrease in the number of working SMs. Consequently, even though there is a reduction in speedups, it is not directly proportional to the reduction in the number of active SMs. When the number of threads per block is 64, for the sequence length of 200, 400, 800, and 1600, the best speedups are 52.14×, 56.36×, 73.39×, and 73.16×, respectively.
In brief, getting performance out of a GPU is about keeping the CUDA cores busy. Both intertask parallelism and intratask parallelism, under the situations of small data set being processed, seem to fail in this regard, but not necessarily for the same reason.
Multipair PSSE result analysis
In the intuitive and data reuse strategies, most of the SMs suffer from the same low occupancy as the singlepair PSSE implementation using the intertask parallelism method. As expected, we observe poor performance for both strategies, as shown in Figure 5. The data reuse strategy produces higher performance than the intuitive strategy, because the number of permutations is reduced by a factor of Q, the number of query sequences. To alleviate the low occupancy problem, our proposed tilebased strategy uses a carefully tuned tile size that effectively increase the occupancy. Recall that, as a result of tiling, we send 14 subject sequences and one query sequence to the GPU in each round. After the permutation step, we have 14 × 1000 alignments to be performed, assuming each subject sequence is permuted 1000 times. Consequently, each SM has 1000 alignments to perform. Also, note that, the maximum number of blocks that can be launched simultaneously on an SM is 8. Therefore, when the number of threads per block is 64, the number of blocks per SM is min(⌈1024/64⌉, 8) = 8, resulting in an occupancy of (8 × 64)/1024 = 50%, based on Equation (2) and (3). This is a significant improvement over the intuitive and datareuse strategies. The maximum theoretical occupancy of an SM for the three strategies is given in Table 1.
Figure 5. Performance of three strategies for multipair PSSE. All experiments are run using 2771 subject sequences and 86 query sequences from the CATH 2.3 database.
Table 1. The maximum occupancy for three strategies
Due to a high occupancy, the tilebased strategy using single (dual) GPU(s) achieves speedups of 240.31 (338.63)×, 243.51 (361.63)×, 240.71 (363.99)×, and 243.84 (369.95)× using 64, 128, 256, and 512 threads per block, respectively, as shown in Figure 5.
The billion cell updates per second (GCUPS) value is another commonly used performance measure in bioinformatics [33]. The tilebased strategy using single (dual) GPU(s) achieves performance results in the range of 16.55 (23.34) to 16.79 (25.50) GCUPS, as shown in Figure 5. Although a direct comparison across different GPU implementations and hardware in not fair, just for the sake of completeness, we cited below the performance in GCUPS reported by some existing implementations of the SmithWaterman alignment task. It is worthwhile to mention here that [46] has a peak performance of 4.65 to 8.99 GCUPS for various query lengths on an NVIDIA 9800 and [42] has a peak performance of 3.5 GCUPS on a workstation running two GeForce 8800 GTX. Our implementation show higher GCUPS.
In summary, low occupancy is known to interfere with the ability to hide latency on memorybound kernels, causing performance degradation. However, increasing occupancy does not necessarily increase performance. In general, once a 50% occupancy is achieved, further optimization to gain additional occupancy has little effect on performance [37]. Our experiments verify this claim.
Since the GPU implementation presented in this paper uses the same algorithm for PSSE (specifically the SmithWaterman algorithm for getting alignment scores, the same fitting routine to get statistical parameters K and λ, and the same algorithm parameters) as in [11], the retrieval accuracy of the proposed implementation is expected to be identical to [11]. According to [11], pairwise statistical significance with standard substitution matrices performs at least comparable or significantly better than database statistical significance (using BLAST, PSIBLAST, and SSEARCH). Moreover, pairwise statistical significance with PSSMs performs significantly better than using standard substitution matrices, and also better than PSIBLAST using preconstructed positionspecific substitution matrices. More details can be found in [11]. The implementation of the proposed method and related programs in CUDA is available for free academic use at http://cucis.ece.northwestern.edu/projects/PSSE/ webcite.
Conclusions
In this paper, we present a high performance accelerator to estimate the pairwise statistical significance of local sequence alignment, which supports standard substitution matrix like BLOSUM62 as well as PSSMs.
Our accelerator harvests the computation power of manycore GPUs by using CUDA, which results in high endtoend speedups for PSSE. We also demonstrate a comparative performance analysis of singlepair and multipair implementations. The proposed optimizations and efficient framework are applicable to a wide variety of nextgeneration sequencing comparison based applications, such as, DNA sequence mapping and database search. As the size of biological sequence databases are increasing rapidly, even more powerful high performance computing accelerator platforms are expected to be more and more common and imperative for sequence analysis, for which our work can serve as a meaningful stepping stone.
Abbreviations
BLAST: Basic Local Alignment Search Tool; BLOSUM: BLOcks of Amino Acid SUbstitution Matrix; CUDA: Compute Unified Device Architecture; EVD: Extreme Value Distribution; GCUPS: Billion Cell Updates Per Second; GPU: Graphics Processing Unit; PSSM: Position Specific Scoring Matrix; PSSE: Pairwise Statistical Significance Estimation; PSI BLAST: Position Specific Iterative BLAST; PSA: Pairwise Sequence Alignment; PSS: Pairwise Statistical Significance; SIMT: Single: Instruction, Multiple: Thread; SSM: Standard Substitution Matrix SM: Streaming Multiprocessor; SW: SmithWaterman.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
YZ conceptualized the study, carried out the design and implementation of the algorithm, analyzed the results and drafted the manuscript; SM, AA, MMAP, WL, ZQ, and AC participated analysis of the results and contributed to revise the manuscript. All authors read and approved the final manuscript.
Acknowledgements
This work is supported in part by NSF award numbers CCF0621443, OCI0724599, CCF0833131, CNS0830927, IIS0905205, OCI0956311, CCF0938000, CCF1043085, CCF1029166, and OCI1144061, and in part by DOE grants DEFC0207ER25808, DEFG0208ER25848, DESC0001283, DESC0005309, and DESC0005340.This work is also supported in part by National Nature Science Foundation of China (NO. 60973118, NO. 61133016), Sichuan provincial science and technology agency (NO. M110106012010HH2027), Ministry of Education of China (NO. 708078), and China Scholarship Council.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 5, 2012: Selected articles from the First IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2011): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S5.
References

Roos DS: COMPUTATIONAL BIOLOGY: Bioinformaticstrying to swim in a sea of data.
Science 2001, 291:12601261. PubMed Abstract  Publisher Full Text

Yooseph S, Sutton G, Rusch D, Halpern A, Williamson S, Remington K, Eisen J, Heidelberg K, Manning G, Li W, et al.: The sorcerer II global ocean sampling expedition: expanding the universe of protein families.
PLoS Biology 2007, 5(3):e16. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos JS, Bealer K, Madden TL: BLAST+: architecture and applications.
BMC Bioinformatics 2009, 10:421. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Altschul S, Madden T, Schäffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSIBLAST: a new generation of protein database search programs.
Nucleic Acids Research 1997, 25(17):33893402. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Schäffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF: Improving the accuracy of PSIBLAST protein database searches with compositionbased statistics and other refinements.
Nucleic Acids Research 2001, 29(14):29943005. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yu Y, Altschul S: The construction of amino acid substitution matrices for the comparison of proteins with nonstandard compositions.
Bioinformatics 2005, 21(7):902911. PubMed Abstract  Publisher Full Text

Pearson WR: Flexible sequence similarity searching with the FASTA3 program package.
Methods in molecular biology 2000, 132:185219. PubMed Abstract

Mott R: Accurate formula for pvalues of gapped local sequence and profile alignments.
Journal of Molecular Biology 2000, 300:649659. PubMed Abstract  Publisher Full Text

Pearson W, Lipman D: Improved tools for biological sequence comparison.
Proceedings of the National Academy of Sciences 1988, 85(8):2444. Publisher Full Text

Pagni M, Jongeneel C: Making sense of score statistics for sequence alignments.
Briefings in Bioinformatics 2001, 2:5167. PubMed Abstract  Publisher Full Text

Agrawal A, Huang X: Pairwise statistical significance of local sequence alignment using sequencespecific and positionspecific substitution matrices.
IEEE/ACM Transactions on Computational Biology and Bioinformatics 2011, 8:194205. PubMed Abstract  Publisher Full Text

Mitrophanov A, Borodovsky M: Statistical significance in biological sequence analysis.
Briefings in Bioinformatics 2006, 7:224. PubMed Abstract  Publisher Full Text

Pearson W: Empirical statistical estimates for sequence similarity searches.
Journal of Molecular Biology 1998, 276:7184. PubMed Abstract  Publisher Full Text

Altschul SF, Bundschuh R, Olsen R, Hwa T: The estimation of statistical parameters for local alignment score distributions.
Nucleic Acids Research 2001, 29(2):351361. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Agrawal A, Brendel V, Huang X: Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment.
International Journal of Computational Biology and Drug Design 2008, 1(4):347367. PubMed Abstract  Publisher Full Text

Poleksic A, Danzer JF, Hambly K, Debe DA: Convergent island statistics: a fast method for determining local alignment score significance.
Bioinformatics 2005, 21(12):28272831. PubMed Abstract  Publisher Full Text

Agrawal A, Huang X: PSIBLAST PairwiseStatSig: reordering PSIBLAST hits using pairwise statistical significance.
Bioinformatics 2009, 25(8):10821083. PubMed Abstract  Publisher Full Text

Agrawal A, Huang X: Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty.
BMC bioinformatics 2009, 10(Suppl 3):S1. PubMed Abstract  BioMed Central Full Text

Sierk ML, Smoot ME, Bass EJ, Pearson WR: Improving pairwise sequence alignment accuracy using nearoptimal protein sequence alignments.
BMC Bioinformatics 2010, 11:146. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Agrawal A, Choudhary A, Huang X: Sequencespecific sequence comparison using pairwise statistical significance. In Software Tools and Algorithms for Biological Systems, Volume 696 of Advances in Experimental Medicine and Biology. Springer New York; 2011:297306.

Agrawal A, Brendel V, Huang X: Pairwise statistical significance versus database statistical significance for local alignment of protein sequences.

Agrawal A, Choudhary A, Huang X: Derived distribution points heuristic for fast pairwise statistical significance estimation.
Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology, ACM 2010, 312321.

Agrawal A, Misra S, Honbo D, Choudhary AN: Parallel pairwise statistical significance estimation of local sequence alignment using Message Passing Interface library.
Concurrency and Computation: Practice and Experience 2011, 23(17):22692279. Publisher Full Text

Yu Y, Gertz E, Agarwala R, Schäffer A, Altschul S: Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches.
Nucleic Acids Research 2006, 34(20):5966. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zuyderduyn S: Statistical analysis and significance testing of serial analysis of gene expression data using a Poisson mixture model.
BMC bioinformatics 2007, 8:282. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Aleksandar P: Island method for estimating the statistical significance of profileprofile alignment scores.

Karlin S, Altschul S: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.
Proceedings of the National Academy of Sciences 1990, 87(6):2264. Publisher Full Text

O R, B R, H T: Rapid assessment of extremal statistics for gapped local alignment.
Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology 1999, 211222.

Eddy SR: Maximum likelihood fitting of extreme value distributions.
1997.
[citeseer.ist.psu.edu/370503. html]. [Unpublished work]

Waterman M, Vingron M: Rapid and accurate estimates of statistical significance for sequence data base searches.
Proceedings of the National Academy of Sciences 1994, 91(11):46254628. Publisher Full Text

Zhang Y, Misra S, Honbo D, Agrawal A, keng Liao W, Choudhary A: Efficient pairwise statistical significance estimation for local sequence alignment using GPU.
IEEE 1st International Conference on Computational Advances in Bio and Medical Sciences (ICCABS) 2011, 226231.

Samuel A, Malcolm M, Aaron G, Kevin G, Mahesh V, David R, Cesar A, James W, Owen W, Florian F: CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing.
BMC Bioinformatics 2011, 12:115. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Liu Y, Maskell DL, Schmidt B: CUDASW++: optimizing SmithWaterman sequence database searches for CUDAenabled graphics processing units.
BMC Research Notes 2009, 2:73. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Liu Y, Schmidt B, Maskell DL: CUDASW++2.0: enhanced SmithWaterman protein database search on CUDAenabled GPUs based on SIMT and virtualized SIMD abstractions.
BMC Research Notes 2010, 3:93. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Agrawal A, Misra S, Honbo D, Choudhary AN: MPIPairwiseStatSig: parallel pairwise statistical significance estimation of local sequence alignment.
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing 2010, 470476.

Honbo D, Agrawal A, Choudhary AN: Efficient pairwise statistical significance estimation using FPGAs.

Ryoo S, Rodrigues C, Baghsorkhi S, Stone S, Kirk D, Hwu W: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA.
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM 2008, 7382.

Smith T, Waterman M: Identification of common molecular subsequences.
Journal of Molecular Biology 1981, 147:195197. PubMed Abstract  Publisher Full Text

Rognes T, Seeberg E: Sixfold speedup of SmithWaterman sequence database searches using parallel processing on common microprocessors.
Bioinformatics 2000, 16(8):699706. PubMed Abstract  Publisher Full Text

Manavski S, Valle G: CUDA compatible GPU cards as efficient hardware accelerators for SmithWaterman sequence alignment.
BMC Bioinformatics 2008, 9(Suppl 2):S10. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Liu W, Schmidt B, Voss G, MüllerWittig W: Streaming algorithms for biological sequence alignment on GPUs.
IEEE Transactions on Parallel and Distributed Systems 2007, 18(9):12701281.

Sierk ML, Pearson WR: Sensitivity and selectivity in protein structure comparison.
Protein Science 2004, 13(3):773785. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhang Y, Patwary M, Misra S, Agrawal A, Liao W, Choudhary AN: Enhancing parallelism of pairwise statistical significance estimation for local sequence alignment.

Ligowski L, Rudnicki W: An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases.
IEEE International Symposium on Parallel & Distributed Processing 2009, 18. PubMed Abstract  Publisher Full Text