School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China

Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, USA

Abstract

Background

Pairwise statistical significance has been recognized to be able to accurately identify related sequences, which is a very important cornerstone procedure in numerous bioinformatics applications. However, it is both computationally and data intensive, which poses a big challenge in terms of performance and scalability.

Results

We present a GPU implementation to accelerate pairwise statistical significance estimation of local sequence alignment using standard substitution matrices. By carefully studying the algorithm's data access characteristics, we developed a tile-based scheme that can produce a contiguous data access in the GPU global memory and sustain a large number of threads to achieve a high GPU occupancy. We further extend the parallelization technique to estimate pairwise statistical significance using position-specific substitution matrices, which has earlier demonstrated significantly better sequence comparison accuracy than using standard substitution matrices. The implementation is also extended to take advantage of dual-GPUs. We observe end-to-end speedups of nearly 250 (370) × using single-GPU Tesla C2050 GPU (dual-Tesla C2050) over the CPU implementation using Intel^{© }Core™i7 CPU 920 processor.

Conclusions

Harvesting the high performance of modern GPUs is a promising approach to accelerate pairwise statistical significance estimation for local sequence alignment.

Background

Introduction

The past decades have witnessed dramatically increasing trends in the quantity and variety of publicly available genomic and proteomic sequence data. Dealing with the massive data and making sense of them are big challenges in bioinformatics

Pairwise statistical significance

Consider a pair of sequences _{1 }and _{2 }of lengths _{2}, pairwise statistical significance (PSS) of the two sequences is calculated by the following function

Through permuting _{2 }_{1 }and _{2}. The EVD describes an approximate distribution of optimal scores of a gapless alignment

where

Note that the above distribution is for a gapless alignment. For the cases of gapped alignment, although no asymptotic score distribution has yet been established analytically, computational experiments strongly indicate these scores still roughly follow Gumbel law

In addition to not needing a database to estimate the statistical significance of an alignment, pairwise statistical significance is shown to be more accurate than database statistical significance reported by popular database search programs like BLAST, PSI-BLAST, and SSEARCH

Contributions

We present a GPU implementation to accelerate pairwise statistical significance estimation of local sequence alignment using standard substitution matrices. Our parallel implementation makes use of CUDA (Compute Unified Device Architecture) parallel programming paradigm. Our design uses an efficient data reorganization method to produce coalesced global memory access, and a tiled-based memory scheme to increase the GPU occupancy, a key measure for GPU efficiency

We further extend the parallelization technique to estimate pairwise statistical significance using position-specific substitution matrices, which has earlier demonstrated significantly better sequence comparison accuracy than using standard substitution matrices

The performance evaluation was carried out on NVIDIA Telsa C2050 GPU. For multi-pair PSSE implementation, we observe nearly 250 (370)× speedups using a single-GPU Tesla C2050 GPU (dual-Tesla C2050) over the CPU implementation using an Intel^{© }Core ™i7 CPU 920 processor. The proposed optimizations and efficient framework for PSSE, which have direct applicability to speedup homology detection, are also applicable to many other sequence comparison based applications, such as DNA sequence mapping, phylogenetic tree construction and database search especially in the era of next-generation sequencing.

Methods

In this section we present GPU implementations both for single-pair PSSE and multi-pair PSSE. Along with the methodological details, we also discuss several performance optimization techniques used in this paper, such as, GPU memory access optimization, occupancy maximization, and substitution matrix customization. These techniques have significantly speeded up PSSE.

Careful analysis of the data pipelines of PSSE shows that the computation of PSSE can be decomposed into three computation kernels: Permutation, Alignment and Fitting. Permutation and Alignment comprise the overwhelming majority (a more than 99.8%) of the overall execution time

Design

GPU memory access optimization

It is especially important to optimize global memory access as its bandwidth is low and its latency is hundreds of clock cycles ^{th }^{th }

After permutation, if the sequence _{2 }and its

Two GPU memory layout strategies

**Two GPU memory layout strategies**. (a) The intuitive layout that one sequence is appended after another. (b) The reorganized layout such that the uchar4 at the same indices in different sequences stay at neighboring positions.

Considering inter-task parallelism, where each thread works on the alignment of one of the permuted copies of _{2 }to _{1}, in this layout the gap between the memory accesses by the neighboring threads is at least the length of the sequence. For example, in the intuitive layout, if thread _{0 }accesses the first residue (i.e., 'R'), and thread _{1 }accesses the first residue (i.e., 'E'), the gap between the access data is

We therefore reorganize the layout of sequence data in memory to obtain coalesced reads. Now, to achieve coalesced access, we reorganize layout of sequences in memory as aligned structure of arrays, as shown in Figure _{0}, the first _{1}, and so on. This results in reading a consecutive memory (each thread reads 4 bytes) by a warp of threads in a single access. Thus the global memory access is coalesced, and therefore high performance is achieved.

As the sequences remain unchanged during the alignment, they can be thought of as read-only data, which can be bound to texture memory. For read patterns, texture memory fetches is a better alternative to global memory reads because of texture memory cache, which can further improve the performance.

Occupancy maximization

Hiding global memory latency is very important to achieve high performance on the GPU. This can be done by creating enough threads to keep the CUDA cores always occupied while many other threads are waiting for global memory accesses

where _{max }_{num }

where _{hw }_{reg}_{shr}_{user }_{hw}

Based on the above analysis, higher occupancy can be pursued according to the following rules:

• To avoid wasting computation on under-populated warps, the number of threads per block should be chosen as a multiple of the warp size (currently, 32).

• Total number of active threads per SM should be close to maximum number of resident threads per multiprocessor (_{max}

• To achieve 100% occupancy, (_{hw }_{num}_{max }_{num }_{num}_{max}_{hw }

Substitution matrix customization

The Smith-Waterman (SW) algorithm

The customized substitution matrix

**The customized substitution matrix**. (a) Local query profile: the same symbol (e.g., 'A') at different position in query has same scores. (b) Position specific query profile: the same symbol (e.g., 'A') at different position in query has different scores.

By contrast, position-specific score matrix (PSSM)

Using the customized approach, in both cases, the dimension |

In traditional matrix layout, if the substitution scores of a subject sequence residue (i.e., 'A') against query sequence residues (i.e., 'A','K','L','G') are required subsequently, GPU has to look up the matrix four times one by one, because the score

In essence, both coalesced memory layout and customized substitution matrix optimization heighten performance of memory access by improving data locality among threads in GPU programming.

Single-pair PSSE implementation

As mentioned previously, to simulate the required random sequences, a lot of random numbers are needed, since each _{2 }copy has to be permuted thousands of times. Although the most widely used pseudorandom number generators such as linear congruential generators (LCGs) can meet our requirements, the current version of CUDA does not support calls to the host random function. Hence, we develop an efficient random number generator similar to

The single-pair PSSE processes only one pair of query and subject sequences. The idea of computing single-pair PSSE is as follows. Given the query sequence _{1 }and the subject sequence _{2}, to compute PSSE, thousands (say _{2 }are needed. To obtain these _{2}, first, a set of _{2 }accordingly to obtain a permuted sequence. Thus the _{2 }are obtained in parallel. The algorithm then uses SW algorithm to compute alignment score of _{1 }and the _{2 }in parallel on GPU. The scores are then transferred to CPU for fitting. The details of computing single-pair PSSE is outlined by Algorithm 1. Note that Step 4 in Algorithm 1 uses the optimized memory layout, explained previously in detail, for _{2 }and its

Smith-Waterman is a dynamic programming algorithm to identify the optimal local alignment between a pair of sequences. In general, there are two different methods for parallelizing the alignment task

Multi-pair PSSE implementation

The multi-pair PSSE processes multiple queries and subject sequences. We can obtain some hints about improving the performance by analyzing the single-pair PSSE implementation (which we do in a later section). Owing to a dramatic increase of data set for multiple-pairs implementation, there are some significant differences relative to performance of GPU hardware between the two implementations. Through analyzing our experimental results of the single-pair PSSE, we observe that inter-task parallelism performs better than intra-task parallelism (results shown later). Hence, we consider inter-task parallelism in computing multi-pair PSSE. Based on the guidelines for optimizing memory and occupancy described earlier, we compare three implementation strategies.

**Algorithm 1: **Pseudo-code of single-pair PSSE

**Input**: (_{1}, _{2}) - Sequence-pair;

**Output**:

1. Initialization

(a) Generate a number

(b) Copy LCG seeds to GPU global memory;

(c) Copy _{1 }and _{2 }to GPU global memory;

2. Generate random numbers (RNs) using LCG in GPU

3. Permute sequence _{2 }

4. Reorganize sequences _{2 }and its

5. Align _{2 }and its _{1 }using Smith-Waterman() (

6. Transfer

7. Fitting in CPU

(a) (

(b) ^{λx}

**return **

Intuitive strategy

Given

Data reuse strategy

In the first strategy, the subject sequences from the database are permuted for every query-subject sequence pair. Hence, the same subject sequence is permuted every time it is sent to the GPU. A better strategy is to create permutations of each subject sequence only once and reuse them to align with all the queries. "One permutation, all queries" is the idea of the second strategy. Because of the reuse of permuted sequences, higher performance is expected than the first strategy. However, the occupancy of GPU, to be shown in the next section, is still not elevated.

Adaptively tile-based strategy

The low occupancy of the above two strategies is due to the underutilized computing power of GPU. In addition, these two strategies do not work well when the size of subject sequence database becomes too big to be fitted into GPU global memory. For instance, if the size of the original subject sequence database is 5 MB, it becomes 5000 MB when each of the sequences is permuted, say, 1000 times. This prohibits transfer of all the subject sequences to GPU at the same time. We therefore need an optimal number of subject sequences to be shipped to GPU keeping in mind that the subject sequences and their permuted copies fit in global memory. Moreover, the number of new generated sequences should be enough to keep all CUDA cores busy, i.e., keep a high occupancy of GPU, which is very important to harness the GPU power.

Herein we develop a memory tiling technique that is self-tuning based on the hardware configuration and can achieve a close-to-optimal performance. The idea behind the technique is as follows. In out-of-core fashion, the data in the main memory is divided into smaller chunks called

where _{num }_{max }

In Tesla C2050 used in our experiments, there are 14 SMs and the maximum number of resident threads per SM is 1024. Let

Adaptive tile-based strategy

**Adaptive tile-based strategy**. The optimal tile size

Results and discussion

All our experiments have been are carried out using Intel^{© }Core™i7 CPU 920 processors running @2.67 GHz. The system has 4 cores, 4 GB of memory, dual Tesla C2050 GPU (each with 448 CUDA cores) and is running a 64-bit Linux-based operating system. Our program has been compiled using gcc 4.4.1 and CUDA 4.0. The sequence data used in this work comprises of a non-redundant subset of the CATH 2.3 database

Single-pair PSSE result analysis

In single-pair PSSE implementation, we use both the intra-task and inter-task parallelism methods. We choose four pairs of query and subject sequences of length 200, 400, 800, and 1600 from CATH 2.3 database. We compute PSSE for all these pairs using 64, 128, 256, and 512 threads per block. The experimental results have been plotted in Figure

Intra-task and inter-task parallelism for single-pair PSSE

**Intra-task and inter-task parallelism for single-pair PSSE**. We choose four pairs of query and subject sequences of length 200, 400, 800, and 1600 from CATH database.

For intra-task parallel implementation, the best speedups for sequences of length 200, 400, and 800 are 24.57×, 35.09×, and 43.63×, respectively. All these are obtained using 128 threads per block. But the best speedup for sequences of length 1600 is 46.34× and used 256 threads per block. These results tell us that it is hard to find a general rule to parameterize the settings to achieve peak performance. A possible reason is that many factors, such as the number of threads per block, the available registers, and shared memory, may contradict with each other.

The intra-task parallel implementation creates enough thread blocks (one block for each alignment task) to keep the occupancy high. However, the SW alignment matrix must be serially computed from the first to the last anti-diagonal. Only the cells of the alignment matrix belonging to the same anti-diagonal can be computed in parallel. In this case, most threads in a block have no work to do. As a result, the performance of the intra-task parallel implementation is worse than that of the inter-task parallel implementation even though it has higher occupancy.

In contrast, inter-task parallelism would have a total of 1000 threads (one for each alignment task). If each block contains 64 threads, then the total number of blocks is _{num }_{num }

However, these SMs have a higher occupancy (1 × 512)/1024 = 50%, which compensates for the decrease in the number of working SMs. Consequently, even though there is a reduction in speedups, it is not directly proportional to the reduction in the number of active SMs. When the number of threads per block is 64, for the sequence length of 200, 400, 800, and 1600, the best speedups are 52.14×, 56.36×, 73.39×, and 73.16×, respectively.

In brief, getting performance out of a GPU is about keeping the CUDA cores busy. Both inter-task parallelism and intra-task parallelism, under the situations of small data set being processed, seem to fail in this regard, but not necessarily for the same reason.

Multi-pair PSSE result analysis

In the intuitive and data reuse strategies, most of the SMs suffer from the same low occupancy as the single-pair PSSE implementation using the inter-task parallelism method. As expected, we observe poor performance for both strategies, as shown in Figure

Performance of three strategies for multi-pair PSSE

**Performance of three strategies for multi-pair PSSE**. All experiments are run using 2771 subject sequences and 86 query sequences from the CATH 2.3 database.

The maximum occupancy for three strategies

**Threads/blocks**

**64**

**128**

**256**

**512**

Intuitive occupancy

12.5%

6.25%

6.25%

6.25%

Data-reuse occupancy

12.5%

6.25%

6.25%

6.25%

Tile-based occupancy

50%

100%

100%

100%

Due to a high occupancy, the tile-based strategy using single (dual) GPU(s) achieves speedups of 240.31 (338.63)×, 243.51 (361.63)×, 240.71 (363.99)×, and 243.84 (369.95)× using 64, 128, 256, and 512 threads per block, respectively, as shown in Figure

The billion cell updates per second (GCUPS) value is another commonly used performance measure in bioinformatics

In summary, low occupancy is known to interfere with the ability to hide latency on memory-bound kernels, causing performance degradation. However, increasing occupancy does not necessarily increase performance. In general, once a 50% occupancy is achieved, further optimization to gain additional occupancy has little effect on performance

Since the GPU implementation presented in this paper uses the same algorithm for PSSE (specifically the Smith-Waterman algorithm for getting alignment scores, the same fitting routine to get statistical parameters

Conclusions

In this paper, we present a high performance accelerator to estimate the pairwise statistical significance of local sequence alignment, which supports standard substitution matrix like BLOSUM62 as well as PSSMs.

Our accelerator harvests the computation power of many-core GPUs by using CUDA, which results in high end-to-end speedups for PSSE. We also demonstrate a comparative performance analysis of single-pair and multi-pair implementations. The proposed optimizations and efficient framework are applicable to a wide variety of next-generation sequencing comparison based applications, such as, DNA sequence mapping and database search. As the size of biological sequence databases are increasing rapidly, even more powerful high performance computing accelerator platforms are expected to be more and more common and imperative for sequence analysis, for which our work can serve as a meaningful stepping stone.

Abbreviations

BLAST: Basic Local Alignment Search Tool; BLOSUM: BLOcks of Amino Acid SUbstitution Matrix; CUDA: Compute Unified Device Architecture; EVD: Extreme Value Distribution; GCUPS: Billion Cell Updates Per Second; GPU: Graphics Processing Unit; PSSM: Position Specific Scoring Matrix; PSSE: Pairwise Statistical Significance Estimation; PSI BLAST: Position Specific Iterative BLAST; PSA: Pairwise Sequence Alignment; PSS: Pairwise Statistical Significance; SIMT: Single: Instruction, Multiple: Thread; SSM: Standard Substitution Matrix SM: Streaming Multiprocessor; SW: Smith-Waterman.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

YZ conceptualized the study, carried out the design and implementation of the algorithm, analyzed the results and drafted the manuscript; SM, AA, MMAP, WL, ZQ, and AC participated analysis of the results and contributed to revise the manuscript. All authors read and approved the final manuscript.

Acknowledgements

This work is supported in part by NSF award numbers CCF-0621443, OCI-0724599, CCF-0833131, CNS-0830927, IIS-0905205, OCI-0956311, CCF-0938000, CCF-1043085, CCF-1029166, and OCI-1144061, and in part by DOE grants DE-FC02-07ER25808, DE-FG02-08ER25848, DE-SC0001283, DE-SC0005309, and DE-SC0005340.This work is also supported in part by National Nature Science Foundation of China (NO. 60973118, NO. 61133016), Sichuan provincial science and technology agency (NO. M110106012010HH2027), Ministry of Education of China (NO. 708078), and China Scholarship Council.

This article has been published as part of