<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
	<ui>1471-2105-9-S2-S10</ui>
	<ji>1471-2105</ji>
	<fm>
		<dochead>Research</dochead>
		<bibl>
			<title>
				<p>CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment</p>
			</title>
			<aug>
				<au id="A1" ca="yes">
					<snm>Manavski</snm>
					<mi>A</mi>
					<fnm>Svetlin</fnm>
					<insr iid="I1"/>
					<insr iid="I2"/>
					<email>svetlin.manavski@cribi.unipd.it</email>
				</au>
				<au id="A2">
					<snm>Valle</snm>
					<fnm>Giorgio</fnm>
					<insr iid="I1"/>
					<email>giorgio.valle@unipd.it</email>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>CRIBI, University of Padova, Padova, Italy</p>
				</ins>
				<ins id="I2">
					<p>Elaide, Srl, Padova, Italy</p>
				</ins>
			</insg>
			<source>BMC Bioinformatics</source>
			<supplement>
				<title>
					<p>Italian Society of Bioinformatics (BITS): Annual Meeting 2007</p>
				</title>
				<editor>Graziano Pesole</editor>
				<note>Research</note>
			</supplement>
			<conference>
				<title>
					<p>Italian Society of Bioinformatics (BITS): Annual Meeting 2007</p>
				</title>
				<location>Naples, Italy</location>
				<date-range>26-28 April 2007</date-range>
				<url>http://conferences.ceinge.unina.it/conferenceDisplay.py?confId=2</url>
			</conference>
			<issn>1471-2105</issn>
			<pubdate>2008</pubdate>
			<volume>9</volume>
			<issue>Suppl 2</issue>
			<fpage>S10</fpage>
			<url>http://www.biomedcentral.com/1471-2105/9/S2/S10</url>
			<xrefbib>
				<pubidlist><pubid idtype="pmpid">18387198</pubid><pubid idtype="doi">10.1186/1471-2105-9-S2-S10</pubid>
				</pubidlist></xrefbib>
		</bibl>
		<history>
			<pub>
				<date>
					<day>26</day>
					<month>03</month>
					<year>2008</year>
				</date>
			</pub>
		</history>
		<cpyrt>
			<year>2008</year>
			<collab>Manavski and Valle; licensee BioMed Central Ltd.</collab>
			<note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
		</cpyrt>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<sec>
					<st>
						<p>Background</p>
					</st>
					<p>Searching for similarities in protein and DNA databases has become a routine procedure in Molecular Biology. The Smith-Waterman algorithm has been available for more than 25 years. It is based on a dynamic programming approach that explores all the possible alignments between two sequences; as a result it returns the optimal local alignment. Unfortunately, the computational cost is very high, requiring a number of operations proportional to the product of the length of two sequences. Furthermore, the exponential growth of protein and DNA databases makes the Smith-Waterman algorithm unrealistic for searching similarities in large sets of sequences. For these reasons heuristic approaches such as those implemented in FASTA and BLAST tend to be preferred, allowing faster execution times at the cost of reduced sensitivity. The main motivation of our work is to exploit the huge computational power of commonly available graphic cards, to develop high performance solutions for sequence alignment.</p>
				</sec>
				<sec>
					<st>
						<p>Results</p>
					</st>
					<p>In this paper we present what we believe is the fastest solution of the exact Smith-Waterman algorithm running on commodity hardware. It is implemented in the recently released CUDA programming environment by NVidia. CUDA allows direct access to the hardware primitives of the last-generation Graphics Processing Units (GPU) G80. Speeds of more than 3.5 GCUPS (Giga Cell Updates Per Second) are achieved on a workstation running two GeForce 8800 GTX. Exhaustive tests have been done to compare our implementation to SSEARCH and BLAST, running on a 3 GHz Intel Pentium IV processor. Our solution was also compared to a recently published GPU implementation and to a Single Instruction Multiple Data (SIMD) solution. These tests show that our implementation performs from 2 to 30 times faster than any other previous attempt available on commodity hardware.</p>
				</sec>
				<sec>
					<st>
						<p>Conclusions</p>
					</st>
					<p>The results show that graphic cards are now sufficiently advanced to be used as efficient hardware accelerators for sequence alignment. Their performance is better than any alternative available on commodity hardware platforms. The solution presented in this paper allows large scale alignments to be performed at low cost, using the exact Smith-Waterman algorithm instead of the largely adopted heuristic approaches.</p>
				</sec>
			</sec>
		</abs>
	</fm>
	<bdy>
		<sec>
			<st>
				<p>Background</p>
			</st>
			<sec>
				<st>
					<p>Related works</p>
				</st>
				<p>Searching databases of DNA and protein sequences is one of the fundamental tasks in bioinformatics. The Smith-Waterman algorithm guarantees the maximal sensitivity for local sequence alignments, but it is slow. It should be further considered that biological databases are growing at a very fast exponential rate, which is greater than the rate of improvement of microprocessors. This trend results in longer time and/or more expensive hardware to manage the problem. Special-purpose hardware implementations, as for instance super-computers or field-programmable gate arrays (FPGAs) are certainly interesting options, but they tend to be very expensive and not suitable for many users.</p>
				<p>For the above reasons, many widespread solutions running on common microprocessors now use some heuristic approaches to reduce the computational cost of sequence alignment. Thus a reduced execution time is reached at the expense of sensitivity. FASTA (Pearson and Lipman, 1988) <abbrgrp>
						<abbr bid="B1">1</abbr>
					</abbrgrp> and BLAST (Altschul et al., 1997) <abbrgrp>
						<abbr bid="B2">2</abbr>
					</abbrgrp> are up to 40 times faster than the best known straight forward CPU implementation of Smith-Waterman.</p>
				<p>A number of efforts have also been made to obtain faster implementations of the Smith-Waterman algorithm on commodity hardware. Farrar <abbrgrp>
						<abbr bid="B3">3</abbr>
					</abbrgrp> exploits Intel SSE2, which is the multimedia extension of the CPU. Its implementation is up to 13 times faster than SSEARCH <abbrgrp>
						<abbr bid="B14">14</abbr>
					</abbrgrp> (a quasi-standard implementation of Smith-Waterman).</p>
				<p>To our knowledge, the only previous attempt to implement Smith-Waterman on a GPU was done by W. Liu et al. (2006) <abbrgrp>
						<abbr bid="B4">4</abbr>
					</abbrgrp>. Their solution relies on OpenGL that has some intrinsic limits as it is based on the graphics pipeline. Thus, a conversion of the problem to the graphical domain is needed, as well as a reverse procedure to convert back the results. Although that approach is up to 5 times faster than SSEARCH, it is considerably slower than BLAST.</p>
				<p>In this paper we present the first solution based on commodity hardware that efficiently computes the exact Smith-Waterman alignment. It runs from 2 to 30 times faster than any previous implementation on general-purpose hardware.</p>
			</sec>
			<sec>
				<st>
					<p>The Smith-Waterman algorithm</p>
				</st>
				<p>The Smith-Waterman algorithm is designed to find the optimal local alignment between two sequences. It was proposed by Smith and Waterman (1981) <abbrgrp>
						<abbr bid="B5">5</abbr>
					</abbrgrp> and enhanced by Gotoh (1982) <abbrgrp>
						<abbr bid="B6">6</abbr>
					</abbrgrp>. The alignment of two sequences is based on the computation of an alignment matrix. The number of its columns and rows is given by the number of the residues in the query and database sequences respectively. The computation is based on a substitution matrix and on a gap-penalty function.</p>
				<p>Definitions:</p>
				<p>&#8226; <it>A: a<sub>1</sub>a<sub>2</sub>a<sub>3</sub>&#8230;.a<sub>n</sub>
					</it> is the first sequence,</p>
				<p>&#8226; <it>B: b<sub>1</sub>b<sub>2</sub>b<sub>3</sub>&#8230;.b<sub>m</sub>
					</it> is the second sequence,</p>
				<p>&#8226; <it>W(a<sub>i</sub>, b<sub>j</sub>)</it> is the substitution matrix,</p>
				<p>&#8226; <it>G<sub>init</sub>
					</it> and <it>G<sub>ext</sub>
					</it> are the penalties for starting and continuing a gap,</p>
				<p>the alignment scores ending with a gap along A and B are</p>
				<p>
					<display-formula>
						<m:math name="1471-2105-9-S2-S10-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mtable columnalign="left">
									<m:mtr>
										<m:mtd>
											<m:msub>
												<m:mi>E</m:mi>
												<m:mrow>
													<m:mi>i</m:mi>
													<m:mo>,</m:mo>
													<m:mi>j</m:mi>
												</m:mrow>
											</m:msub>
											<m:mo>=</m:mo>
											<m:mi>m</m:mi>
											<m:mi>a</m:mi>
											<m:mi>x</m:mi>
											<m:mrow>
												<m:mo>{</m:mo>
												<m:mtable columnalign="left">
													<m:mtr>
														<m:mtd>
															<m:msub>
																<m:mi>E</m:mi>
																<m:mrow>
																	<m:mi>i</m:mi>
																	<m:mo>,</m:mo>
																	<m:mi>j</m:mi>
																	<m:mo>&#8722;</m:mo>
																	<m:mn>1</m:mn>
																</m:mrow>
															</m:msub>
															<m:mo>&#8722;</m:mo>
															<m:msub>
																<m:mi>G</m:mi>
																<m:mrow>
																	<m:mi>e</m:mi>
																	<m:mi>x</m:mi>
																	<m:mi>t</m:mi>
																</m:mrow>
															</m:msub>
														</m:mtd>
													</m:mtr>
													<m:mtr>
														<m:mtd>
															<m:msub>
																<m:mi>H</m:mi>
																<m:mrow>
																	<m:mi>i</m:mi>
																	<m:mo>,</m:mo>
																	<m:mi>j</m:mi>
																	<m:mo>&#8722;</m:mo>
																	<m:mn>1</m:mn>
																</m:mrow>
															</m:msub>
															<m:mo>&#8722;</m:mo>
															<m:msub>
																<m:mi>G</m:mi>
																<m:mrow>
																	<m:mi>i</m:mi>
																	<m:mi>n</m:mi>
																	<m:mi>i</m:mi>
																	<m:mi>t</m:mi>
																</m:mrow>
															</m:msub>
														</m:mtd>
													</m:mtr>
												</m:mtable>
												<m:mo>}</m:mo>
											</m:mrow>
										</m:mtd>
									</m:mtr>
									<m:mtr>
										<m:mtd>
											<m:msub>
												<m:mi>F</m:mi>
												<m:mrow>
													<m:mi>i</m:mi>
													<m:mo>,</m:mo>
													<m:mi>j</m:mi>
												</m:mrow>
											</m:msub>
											<m:mo>=</m:mo>
											<m:mi>m</m:mi>
											<m:mi>a</m:mi>
											<m:mi>x</m:mi>
											<m:mrow>
												<m:mo>{</m:mo>
												<m:mtable columnalign="left">
													<m:mtr>
														<m:mtd>
															<m:msub>
																<m:mi>F</m:mi>
																<m:mrow>
																	<m:mi>i</m:mi>
																	<m:mo>&#8722;</m:mo>
																	<m:mn>1</m:mn>
																	<m:mo>,</m:mo>
																	<m:mi>j</m:mi>
																</m:mrow>
															</m:msub>
															<m:mo>&#8722;</m:mo>
															<m:msub>
																<m:mi>G</m:mi>
																<m:mrow>
																	<m:mi>e</m:mi>
																	<m:mi>x</m:mi>
																	<m:mi>t</m:mi>
																</m:mrow>
															</m:msub>
														</m:mtd>
													</m:mtr>
													<m:mtr>
														<m:mtd>
															<m:msub>
																<m:mi>H</m:mi>
																<m:mrow>
																	<m:mi>i</m:mi>
																	<m:mo>&#8722;</m:mo>
																	<m:mn>1</m:mn>
																	<m:mo>,</m:mo>
																	<m:mi>j</m:mi>
																</m:mrow>
															</m:msub>
															<m:mo>&#8722;</m:mo>
															<m:msub>
																<m:mi>G</m:mi>
																<m:mrow>
																	<m:mi>i</m:mi>
																	<m:mi>n</m:mi>
																	<m:mi>i</m:mi>
																	<m:mi>t</m:mi>
																</m:mrow>
															</m:msub>
														</m:mtd>
													</m:mtr>
												</m:mtable>
												<m:mo>}</m:mo>
											</m:mrow>
										</m:mtd>
									</m:mtr>
								</m:mtable>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGceaqabeaaieWacqWFfbqrdaWgaaWcbaGaemyAaKMaeiilaWIaemOAaOgabeaakiabg2da9Gqabiab+1gaTjab+fgaHjab+Hha4naacmaaeaqabeaacqWFfbqrdaWgaaWcbaGaemyAaKMaeiilaWIaemOAaOMaeyOeI0IaeGymaedabeaakiabgkHiTiab=DeahnaaBaaaleaacqWGLbqzcqWG4baEcqWG0baDaeqaaaGcbaGae8hsaG0aaSbaaSqaaiabdMgaPjabcYcaSiabdQgaQjabgkHiTiabigdaXaqabaGccqGHsislcqWFhbWrdaWgaaWcbaGaemyAaKMaemOBa4MaemyAaKMaemiDaqhabeaaaaGccaGL7bGaayzFaaaabaGae8Nray0aaSbaaSqaaiabdMgaPjabcYcaSiabdQgaQbqabaGccqGH9aqpcqGFTbqBcqGFHbqycqGF4baEdaGadaabaeqabaGae8Nray0aaSbaaSqaaiabdMgaPjabgkHiTiabigdaXiabcYcaSiabdQgaQbqabaGccqGHsislcqWFhbWrdaWgaaWcbaGaemyzauMaemiEaGNaemiDaqhabeaaaOqaaiab=HeainaaBaaaleaacqWGPbqAcqGHsislcqaIXaqmcqGGSaalcqWGQbGAaeqaaOGaeyOeI0Iae83raC0aaSbaaSqaaiabdMgaPjabd6gaUjabdMgaPjabdsha0bqabaaaaOGaay5Eaiaaw2haaaaaaa@7FE4@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>Finally the alignment scores of the sub-sequences <it>A<sub>i</sub>
					</it>, <it>B<sub>j</sub>
					</it> are:</p>
				<p>
					<display-formula>
						<m:math name="1471-2105-9-S2-S10-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:msub>
										<m:mi>H</m:mi>
										<m:mrow>
											<m:mi>i</m:mi>
											<m:mo>,</m:mo>
											<m:mi>j</m:mi>
										</m:mrow>
									</m:msub>
									<m:mo>=</m:mo>
									<m:mtext>&#8202;</m:mtext>
									<m:mtext>&#8202;</m:mtext>
									<m:mi>m</m:mi>
									<m:mi>a</m:mi>
									<m:mi>x</m:mi>
									<m:mtext>&#8202;</m:mtext>
									<m:mtext>&#8202;</m:mtext>
									<m:mrow>
										<m:mo>{</m:mo>
										<m:mtable columnalign="left">
											<m:mtr>
												<m:mtd>
													<m:mn>0</m:mn>
												</m:mtd>
											</m:mtr>
											<m:mtr>
												<m:mtd>
													<m:msub>
														<m:mi>E</m:mi>
														<m:mrow>
															<m:mi>i</m:mi>
															<m:mo>,</m:mo>
															<m:mi>j</m:mi>
														</m:mrow>
													</m:msub>
												</m:mtd>
											</m:mtr>
											<m:mtr>
												<m:mtd>
													<m:msub>
														<m:mi>F</m:mi>
														<m:mrow>
															<m:mi>i</m:mi>
															<m:mo>,</m:mo>
															<m:mi>j</m:mi>
														</m:mrow>
													</m:msub>
												</m:mtd>
											</m:mtr>
											<m:mtr>
												<m:mtd>
													<m:msub>
														<m:mi>H</m:mi>
														<m:mrow>
															<m:mi>i</m:mi>
															<m:mo>&#8722;</m:mo>
															<m:mn>1</m:mn>
															<m:mo>,</m:mo>
															<m:mi>j</m:mi>
															<m:mo>&#8722;</m:mo>
															<m:mn>1</m:mn>
														</m:mrow>
													</m:msub>
													<m:mo>&#8722;</m:mo>
													<m:mi>W</m:mi>
													<m:mrow>
														<m:mo>(</m:mo>
														<m:mrow>
															<m:msub>
																<m:mi>a</m:mi>
																<m:mi>i</m:mi>
															</m:msub>
															<m:mo>,</m:mo>
															<m:msub>
																<m:mi>b</m:mi>
																<m:mi>j</m:mi>
															</m:msub>
														</m:mrow>
														<m:mo>)</m:mo>
													</m:mrow>
												</m:mtd>
											</m:mtr>
										</m:mtable>
										<m:mo>}</m:mo>
									</m:mrow>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaacbmGae8hsaG0aaSbaaSqaaiabdMgaPjabcYcaSiabdQgaQbqabaGccqGH9aqpcaaMi8UaaGjcVJqabiab+1gaTjab+fgaHjab+Hha4jaayIW7caaMi8+aaiWaaqaabeqaaiab+bdaWaqaaiab=veafnaaBaaaleaacqWGPbqAcqGGSaalcqWGQbGAaeqaaaGcbaGae8Nray0aaSbaaSqaaiabdMgaPjabcYcaSiabdQgaQbqabaaakeaacqWFibasdaWgaaWcbaGaemyAaKMaeyOeI0IaeGymaeJaeiilaWIaemOAaOMaeyOeI0IaeGymaedabeaakiabgkHiTiab=DfaxnaabmaabaGae8xyae2aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqWFIbGydaWgaaWcbaGaemOAaOgabeaaaOGaayjkaiaawMcaaaaacaGL7bGaayzFaaaaaa@600E@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>where 1&#8804;<it>i</it>&#8804;<it>n</it> and 1&#8804;<it>j</it>&#8804;<it>m</it>. The values for E, F and H are 0 when <it>i</it>&lt;1 and <it>j</it>&lt;1. The maximum value of the alignment matrix gives the degree of similarity between <it>A</it> and <it>B</it>.</p>
				<p>An important point to be considered is that any cell of the alignment matrix can be computed only after the values of the left and above cells are known, as shown in Figure <figr fid="F1">1</figr>. Different cells can be simultaneously processed only if they are on the same anti-diagonal.</p>
				<fig id="F1">
					<title>
						<p>Figure 1</p>
					</title>
					<caption>
						<p>Smith-Waterman data dependencies</p>
					</caption>
					<text>
						<p>
							<b>Smith-Waterman data dependencies.</b> Each cell of the alignment matrix depends on the cells on the left and above it. Independent data can be found only on the same anti-diagonal.</p>
					</text>
					<graphic file="1471-2105-9-S2-S10-1"/>
				</fig>
			</sec>
			<sec>
				<st>
					<p>CUDA programming model</p>
				</st>
				<p>The two major GPU vendors, NVidia and AMD, recently announced their new developing platforms, respectively CUDA <abbrgrp>
						<abbr bid="B7">7</abbr>
					</abbrgrp> and CTM <abbrgrp>
						<abbr bid="B8">8</abbr>
					</abbrgrp>. Unlike previous GPU programming models, these are proprietary approaches designed to allow a direct access to their specific graphics hardware. Therefore, there is no compatibility between the two platforms. CUDA is an extension of the C programming language; CTM is a virtual machine running proprietary assembler code. However, both platforms overcome some important restrictions on previous GPGPU approaches, in particular those set by the traditional graphics pipeline and the relative programming interfaces like OpenGL and Direct3D.</p>
				<p>We selected NVidia GeForce 8800 and its CUDA platform to design our Smith-Waterman implementation because it is the first available GPU on the market capable of an internal integer representation of data.</p>
				<p>In CUDA, the GPU is viewed as a compute device suitable for parallel data applications. It has its own device random access memory and may run a very high number of threads in parallel (Figure <figr fid="F2">2</figr>). Threads are grouped in <it>blocks</it> and many <it>blocks</it> may run in a <it>grid</it> of <it>blocks</it>. Such structured sets of threads may be launched on a <it>kernel</it> of code, processesing the data stored in the device memory. Threads of the same <it>block</it> share data through fast shared on-chip memory and can be synchronized through synchronization points. An important aspect of CUDA programming is the management of the memory spaces that have different characteristics and performances:</p>
				<fig id="F2">
					<title>
						<p>Figure 2</p>
					</title>
					<caption>
						<p>CUDA architecture</p>
					</caption>
					<text>
						<p>
							<b>CUDA architecture.</b> New CUDA compatible GPUs are implemented as a set of multiprocessors. Each multiprocessor has several ALUs (Arithmetic Logic Unit) that, at any given clock cycle, execute the same instructions but on different data. Each ALU can access (read and write) the multiprocessor <it>shared memory</it> and the device RAM.</p>
					</text>
					<graphic file="1471-2105-9-S2-S10-2"/>
				</fig>
				<p>&#8226; Read-write per-thread <it>registers</it> (fast, very limited size)</p>
				<p>&#8226; Read-write per-thread <it>local memory</it> (slow, not cached, limited size)</p>
				<p>&#8226; Read-write per-<it>block shared memory</it> (fast, very limited size)</p>
				<p>&#8226; Read-write per-<it>grid global memory</it> (slow, not cached, large size)</p>
				<p>&#8226; Read-only per-<it>grid constant memory</it> (slow but cached, limited size)</p>
				<p>&#8226; Read-only per-<it>grid texture memory</it> (slow but cached, large size)</p>
				<p>The proper choice of the memory to be used in each <it>kernel</it> depends on many factors such as the speed, the amount needed, and the operations to be performed on the stored data. An important restriction is the limited size of <it>shared memory</it>, which is the only available read-write cache. Unlike the CPU programming model, here the programmer needs to explicitly copy data from the <it>global memory</it> to the cache (<it>shared memory</it>) and backwards. But this new architecture allows the access to memory in a really general way, so both <it>scatter</it> and <it>gather</it> operations are available. <it>Gather</it> is the ability to read any memory cell during the run of the <it>kernel</it> code. <it>Scatter</it> is the ability to randomly access any memory cell for writing. The unavailability of <it>scatter</it> was one of the major limitations of OpenGL when applied to GPGPU applications. The main point in approaching CUDA is that the overall performance of the applications dramatically depends on the management of the memory, which is much more complex than in the CPUs.</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Results and discussion</p>
			</st>
			<p>Exhaustive tests have been performed to compare the implementation of Smith-Waterman in CUDA with:</p>
			<p>&#8226; the results of W. Liu as reported in his paper <abbrgrp>
					<abbr bid="B4">4</abbr>
				</abbrgrp>. His solution was implemented in OpenGL and was tested on a NVidia GeForce 7900 GPU</p>
			<p>&#8226; BLAST <abbrgrp>
					<abbr bid="B1">1</abbr>
				</abbrgrp> and SSEARCH <abbrgrp>
					<abbr bid="B14">14</abbr>
				</abbrgrp>, running on a 3 GHz Intel Pentium IV processor</p>
			<p>&#8226; the results of the SIMD implementation by Farrar as reported in his paper <abbrgrp>
					<abbr bid="B3">3</abbr>
				</abbrgrp>. His tests were run on a 2.4 GHz Xeon Core 2 Duo processor.</p>
			<p>We have tested our solution on a workstation, having the 2.4 GHz Intel Q6600 processor and two NVidia GeForce 8800 GTX graphic cards. We have measured the performance by running the application both on single and on double GPU configurations. By doubling the computing resources we observed that the overall performance of the application also doubles. This shows that the solution can benefit from a nearly linear speed improvement when adding more graphic boards to the system. It must be mentioned that the Nvidia SLI option, available for multi-GPU systems, is designed for OpenGL. Therefore, SLI must be disabled for CUDA, which requires direct programming of every installed GPU.</p>
			<sec>
				<st>
					<p>Smith-Waterman in CUDA vs. Liu's implementation</p>
				</st>
				<p>For this test five protein sequences of different length (from 63 to 511 residues) were run against the SwissProt database (Dec. 2006 &#8211; 250,296 proteins and 91,694,534 amino acids). The substitution matrix BLOSUM50 with a gap-open penalty of 10 and a gap-extension penalty of 2 were used. The resulting MCUPS for each of the 5 query sequences are shown in Table <tblr tid="T1">1</tblr>.</p>
				<tbl id="T1" hint_layout="single">
					<title>
						<p>Table 1</p>
					</title>
					<caption>
						<p>Smith-Waterman in CUDA running on single and double GPU vs. Liu's solution implemented in OpenGL</p>
					</caption>
					<tblbdy cols="5">
						<r>
							<c cspan="2" ca="center">
								<p>
									<b>
										<it>Sequence</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>SW-Cuda*</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>SW-Cuda**</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Weiguo Liu</it>
									</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<b>
										<it>Name</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Length</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>MCUPS</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>MCUPS</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>MCUPS</it>
									</b>
								</p>
							</c>
						</r>
						<r>
							<c cspan="5">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>O29181</it>
								</p>
							</c>
							<c>
								<p>63</p>
							</c>
							<c>
								<p>1849</p>
							</c>
							<c>
								<p>3561</p>
							</c>
							<c>
								<p>197</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>P03630</it>
								</p>
							</c>
							<c>
								<p>127</p>
							</c>
							<c>
								<p>1889</p>
							</c>
							<c>
								<p>3612</p>
							</c>
							<c>
								<p>317</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>P53765</it>
								</p>
							</c>
							<c>
								<p>255</p>
							</c>
							<c>
								<p>1811</p>
							</c>
							<c>
								<p>3428</p>
							</c>
							<c>
								<p>428</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>Q8ZGB4</it>
								</p>
							</c>
							<c>
								<p>361</p>
							</c>
							<c>
								<p>1810</p>
							</c>
							<c>
								<p>3446</p>
							</c>
							<c>
								<p>486</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>P58229</it>
								</p>
							</c>
							<c>
								<p>511</p>
							</c>
							<c>
								<p>1795</p>
							</c>
							<c>
								<p>3353</p>
							</c>
							<c>
								<p>533</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>Substitution matrix used: BLOSUM50. Gap-open penalty: 10. Gap-extension penalty: 2.</p>
						<p>Database used: SwissProt (Dec. 2006 &#8211; 250,296 proteins and 91,694,534 amino acids).</p>
						<p>* Smith-Waterman in CUDA running on an NVidia GeForce 8800 GTX</p>
						<p>** Smith-Waterman in CUDA running on two NVidia GeForce 8800 GTX</p>
					</tblfn>
				</tbl>
				<p>Liu obtained on the same sequences an average of 392.2 MCUPS and a peak of 533 MCUPS. Our solution on a single GPU was completed in a time of 63.5 sec with an average of 1830 MCUPS and a peak of 1889 MCUPS. Our implementation on two GPUs achieved a search time of 33.63 sec with an average of 3480 MCUPS and a peak of 3612 MCUPS. These results indicate that our implementation of Smith-Waterman is up to 18 times faster than that of Liu.</p>
			</sec>
			<sec>
				<st>
					<p>Smith-Waterman in CUDA vs. BLAST and SSEARCH</p>
				</st>
				<p>For this test we used the same sequences, database and substitution matrix described in the previous paragraph. SSEARCH completed the search in 960 sec with an average of 119.2 MCUPS and a peak of 123 MCUPS. BLAST completed the search in 53.3 sec with an average of 2018 MCUPS and a peak of 2691 MCUPS.</p>
				<p>The execution times of our CUDA implementation were up to 30 times faster than SSEARCH and up to 2.4 times faster than BLAST, as shown in Figure <figr fid="F3">3</figr> and Table <tblr tid="T2">2</tblr>.</p>
				<fig id="F3">
					<title>
						<p>Figure 3</p>
					</title>
					<caption>
						<p>Smith-Waterman in CUDA running on single and double GPU vs. BLAST and SSEARCH</p>
					</caption>
					<text>
						<p>
							<b>Smith-Waterman in CUDA running on single and double GPU vs. BLAST and SSEARCH.</b> Substitution matrix used: BLOSUM50. Gap-open penalty: 10. Gap-extension penalty: 2.</p>
						<p>Database used: SwissProt (Dec. 2006 &#8211; 250,296 proteins and 91,694,534 amino acids).</p>
						<p>* Smith-Waterman in CUDA running on an NVidia GeForce 8800 GTX</p>
						<p>** Smith-Waterman in CUDA running on two NVidia GeForce 8800 GTX</p>
					</text>
					<graphic file="1471-2105-9-S2-S10-3"/>
				</fig>
				<tbl id="T2" hint_layout="single">
					<title>
						<p>Table 2</p>
					</title>
					<caption>
						<p>Smith-Waterman in CUDA running on single and double GPU vs. BLAST and SSEARCH</p>
					</caption>
					<tblbdy cols="10">
						<r>
							<c cspan="2">
								<p>
									<b>
										<it>Sequence</it>
									</b>
								</p>
							</c>
							<c cspan="2">
								<p>
									<b>
										<it>SW-Cuda*</it>
									</b>
								</p>
							</c>
							<c cspan="2">
								<p>
									<b>
										<it>SW-Cuda**</it>
									</b>
								</p>
							</c>
							<c cspan="2">
								<p>
									<b>
										<it>Ssearch(Fasta)</it>
									</b>
								</p>
							</c>
							<c cspan="2">
								<p>
									<b>
										<it>Blast</it>
									</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<b>
										<it>Name</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Length</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Time (s)</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>MCUPS</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Time (s)</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>MCUPS</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Time (s)</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>MCUPS</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Time (s)</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>MCUPS</it>
									</b>
								</p>
							</c>
						</r>
						<r>
							<c cspan="10">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>O29181</it>
								</p>
							</c>
							<c>
								<p>63</p>
							</c>
							<c>
								<p>2.98</p>
							</c>
							<c>
								<p>1849</p>
							</c>
							<c>
								<p>1.547</p>
							</c>
							<c>
								<p>3561</p>
							</c>
							<c>
								<p>46</p>
							</c>
							<c>
								<p>119</p>
							</c>
							<c>
								<p>3.7</p>
							</c>
							<c>
								<p>1488</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>P03630</it>
								</p>
							</c>
							<c>
								<p>127</p>
							</c>
							<c>
								<p>5.88</p>
							</c>
							<c>
								<p>1889</p>
							</c>
							<c>
								<p>3.075</p>
							</c>
							<c>
								<p>3612</p>
							</c>
							<c>
								<p>93</p>
							</c>
							<c>
								<p>119</p>
							</c>
							<c>
								<p>5.7</p>
							</c>
							<c>
								<p>1948</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>P53765</it>
								</p>
							</c>
							<c>
								<p>255</p>
							</c>
							<c>
								<p>12.31</p>
							</c>
							<c>
								<p>1811</p>
							</c>
							<c>
								<p>6.505</p>
							</c>
							<c>
								<p>3428</p>
							</c>
							<c>
								<p>184</p>
							</c>
							<c>
								<p>121</p>
							</c>
							<c>
								<p>11</p>
							</c>
							<c>
								<p>2027</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>Q8ZGB4</it>
								</p>
							</c>
							<c>
								<p>361</p>
							</c>
							<c>
								<p>17.44</p>
							</c>
							<c>
								<p>1810</p>
							</c>
							<c>
								<p>9.162</p>
							</c>
							<c>
								<p>3446</p>
							</c>
							<c>
								<p>275</p>
							</c>
							<c>
								<p>114</p>
							</c>
							<c>
								<p>16.3</p>
							</c>
							<c>
								<p>1936</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>P58229</it>
								</p>
							</c>
							<c>
								<p>511</p>
							</c>
							<c>
								<p>24.89</p>
							</c>
							<c>
								<p>1795</p>
							</c>
							<c>
								<p>13.326</p>
							</c>
							<c>
								<p>3353</p>
							</c>
							<c>
								<p>362</p>
							</c>
							<c>
								<p>123</p>
							</c>
							<c>
								<p>16.6</p>
							</c>
							<c>
								<p>2691</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>Substitution matrix used: BLOSUM50. Gap-open penalty: 10. Gap-extension penalty: 2.</p>
						<p>Database used: SwissProt (Dec. 2006 &#8211; 250,296 proteins and 91,694,534 amino acids).</p>
						<p>* Smith-Waterman in CUDA running on an NVidia GeForce 8800 GTX</p>
						<p>** Smith-Waterman in CUDA running on two NVidia GeForce 8800 GTX</p>
					</tblfn>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>Smith-Waterman in CUDA vs. Farrar's implementation</p>
				</st>
				<p>This last test was done running 11 sequences of different length (from 143 to 567 residues) against the SwissProt database (Rel. 49.1 &#8211; 208,005 proteins and 75,841,138 amino acids). The substitution matrix is the BLOSUM50 with a gap-open penalty of 10 and a gap-extension penalty of 2.</p>
				<p>The Farrar's approach is based on the following consideration: for most cells in the alignment matrix, F remains at zero and does not contribute to the value of H. Only when H is greater than <it>G<sub>init</sub>
					</it> + <it>G<sub>ext</sub>
					</it> will F start to influence the value of H. So firstly F is not considered. Then, if required, a second step tries to correct the introduced errors. Farrar's solution completed the search in 161 sec with an average of 1630 MCUPS and a peak of 2045 MCUPS. Our solution running on a single GPU turned in a slightly better time of 154.95 sec with an average of 1783.3 MCUPS and a peak of 1845 MCUPS. On two GPU devices the search was completed in 79.65 sec with an average of 3792.2 MCUPS and a peak of 3575 MCUPS. The search times and resulting MCUPS are shown in Figure <figr fid="F4">4</figr> and Table <tblr tid="T3">3</tblr>.</p>
				<fig id="F4">
					<title>
						<p>Figure 4</p>
					</title>
					<caption>
						<p>Smith-Waterman in CUDA running on single and double GPU vs. Farrar's solution</p>
					</caption>
					<text>
						<p>
							<b>Smith-Waterman in CUDA running on single and double GPU vs. Farrar's solution</b>. Substitution matrix used: BLOSUM50. Gap-open penalty: 10. Gap-extension penalty: 2.</p>
						<p>Database used: SwissProt (Rel. 49.1 &#8211; 208,005 proteins and 75,841,138 amino acids).</p>
						<p>* Smith-Waterman in CUDA running on an NVidia GeForce 8800 GTX</p>
						<p>** Smith-Waterman in CUDA running on two NVidia GeForce 8800 GTX</p>
					</text>
					<graphic file="1471-2105-9-S2-S10-4"/>
				</fig>
				<tbl id="T3" hint_layout="single">
					<title>
						<p>Table 3</p>
					</title>
					<caption>
						<p>Smith-Waterman in CUDA running on single and double GPU vs. Farrar's solution</p>
					</caption>
					<tblbdy cols="8">
						<r>
							<c cspan="2" ca="center">
								<p>
									<b>
										<it>Sequence</it>
									</b>
								</p>
							</c>
							<c cspan="2" ca="center">
								<p>
									<b>
										<it>SW-Cuda*</it>
									</b>
								</p>
							</c>
							<c cspan="2" ca="center">
								<p>
									<b>
										<it>SW-Cuda**</it>
									</b>
								</p>
							</c>
							<c cspan="2" ca="center">
								<p>
									<b>
										<it>Farrar</it>
									</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<b>
										<it>Name</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Length</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Time (s)</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>MCUPS</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Time (s)</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>MCUPS</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>Time (s)</it>
									</b>
								</p>
							</c>
							<c>
								<p>
									<b>
										<it>MCUPS</it>
									</b>
								</p>
							</c>
						</r>
						<r>
							<c cspan="8">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>P02232</it>
								</p>
							</c>
							<c>
								<p>143</p>
							</c>
							<c>
								<p>5.59</p>
							</c>
							<c>
								<p>1845</p>
							</c>
							<c>
								<p>2.95</p>
							</c>
							<c>
								<p>3497</p>
							</c>
							<c>
								<p>9</p>
							</c>
							<c>
								<p>1149</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>P01111</it>
								</p>
							</c>
							<c>
								<p>189</p>
							</c>
							<c>
								<p>7.59</p>
							</c>
							<c>
								<p>1796</p>
							</c>
							<c>
								<p>3.84</p>
							</c>
							<c>
								<p>3551</p>
							</c>
							<c>
								<p>10</p>
							</c>
							<c>
								<p>1367</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>P05013</it>
								</p>
							</c>
							<c>
								<p>189</p>
							</c>
							<c>
								<p>7.59</p>
							</c>
							<c>
								<p>1796</p>
							</c>
							<c>
								<p>3.84</p>
							</c>
							<c>
								<p>3551</p>
							</c>
							<c>
								<p>10</p>
							</c>
							<c>
								<p>1367</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>P14942</it>
								</p>
							</c>
							<c>
								<p>222</p>
							</c>
							<c>
								<p>8.84</p>
							</c>
							<c>
								<p>1812</p>
							</c>
							<c>
								<p>4.48</p>
							</c>
							<c>
								<p>3575</p>
							</c>
							<c>
								<p>12</p>
							</c>
							<c>
								<p>1338</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>P00762</it>
								</p>
							</c>
							<c>
								<p>246</p>
							</c>
							<c>
								<p>9.85</p>
							</c>
							<c>
								<p>1802</p>
							</c>
							<c>
								<p>5.01</p>
							</c>
							<c>
								<p>3542</p>
							</c>
							<c>
								<p>13</p>
							</c>
							<c>
								<p>1369</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>P10318</it>
								</p>
							</c>
							<c>
								<p>362</p>
							</c>
							<c>
								<p>14.71</p>
							</c>
							<c>
								<p>1775</p>
							</c>
							<c>
								<p>7.57</p>
							</c>
							<c>
								<p>3450</p>
							</c>
							<c>
								<p>15</p>
							</c>
							<c>
								<p>1746</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>P07327</it>
								</p>
							</c>
							<c>
								<p>374</p>
							</c>
							<c>
								<p>15.28</p>
							</c>
							<c>
								<p>1766</p>
							</c>
							<c>
								<p>7.86</p>
							</c>
							<c>
								<p>3433</p>
							</c>
							<c>
								<p>16</p>
							</c>
							<c>
								<p>1691</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>P01008</it>
								</p>
							</c>
							<c>
								<p>464</p>
							</c>
							<c>
								<p>18.96</p>
							</c>
							<c>
								<p>1765</p>
							</c>
							<c>
								<p>9.83</p>
							</c>
							<c>
								<p>3405</p>
							</c>
							<c>
								<p>18</p>
							</c>
							<c>
								<p>1864</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>P10635</it>
								</p>
							</c>
							<c>
								<p>497</p>
							</c>
							<c>
								<p>20.39</p>
							</c>
							<c>
								<p>1758</p>
							</c>
							<c>
								<p>10.43</p>
							</c>
							<c>
								<p>3438</p>
							</c>
							<c>
								<p>19</p>
							</c>
							<c>
								<p>1892</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>P25705</it>
								</p>
							</c>
							<c>
								<p>553</p>
							</c>
							<c>
								<p>22.83</p>
							</c>
							<c>
								<p>1747</p>
							</c>
							<c>
								<p>11.88</p>
							</c>
							<c>
								<p>3358</p>
							</c>
							<c>
								<p>19</p>
							</c>
							<c>
								<p>2105</p>
							</c>
						</r>
						<r>
							<c>
								<p>
									<it>P03435</it>
								</p>
							</c>
							<c>
								<p>567</p>
							</c>
							<c>
								<p>23.32</p>
							</c>
							<c>
								<p>1754</p>
							</c>
							<c>
								<p>11.96</p>
							</c>
							<c>
								<p>3420</p>
							</c>
							<c>
								<p>20</p>
							</c>
							<c>
								<p>2045</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>Substitution matrix used: BLOSUM50. Gap-open penalty: 10. Gap-extension penalty: 2.</p>
						<p>Database used: SwissProt (Rel. 49.1 &#8211; 208,005 proteins and 75,841,138 amino acids).</p>
						<p>* Smith-Waterman in CUDA running on an NVidia GeForce 8800 GTX</p>
						<p>** Smith-Waterman in CUDA running on two NVidia GeForce 8800 GTX</p>
					</tblfn>
				</tbl>
				<p>Farrar's solution improves its performances on the longer sequences, but on the average, it takes longer than our solution running even on a single GPU. So Smith-Waterman in CUDA is up to 3 times faster than Farrar's implementation.</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Conclusions</p>
			</st>
			<p>Up to now the huge computational power of the GPUs was hampered by the limited programming model of OpenGL which is unsuitable for efficient general-purpose computing.</p>
			<p>The results of this work show that the new CUDA compatible graphic cards are now advanced enough to be considered as efficient hardware accelerators for the Smith-Waterman algorithm. High speed can be obtained with the greatest sensitivity. But this work also opens interesting perspectives as similar strategies of acceleration could be applied to a number widely used algorithms in bioinformatics. Thus equal investments in terms of hardware may lead to much better performances. Future work of our team is planned in the direction of accelerating BLAST.</p>
			<p>The source files of our Smith-Waterman implementation are available at <url>http://bioinformatics.cribi.unipd.it/cuda/</url>.</p>
		</sec>
		<sec>
			<st>
				<p>Methods</p>
			</st>
			<sec>
				<st>
					<p>Query-profile</p>
				</st>
				<p>When calculating <it>H<sub>ij</sub>
					</it> the value from the substitution matrix <it>W(q<sub>i</sub>, d<sub>j</sub>)</it> is added to <it>H<sub>i&#8722;1</sub>
					</it>, <it>
						<sub>j&#8722;1</sub>
					</it>. As suggested by Rognes and Seeberg <abbrgrp>
						<abbr bid="B9">9</abbr>
					</abbrgrp>, to avoid the lookup of <it>W(q<sub>i</sub>, d<sub>j</sub>)</it> in the internal cycle of the algorithm, we pre-compute a query profile parallel to the query sequence for each possible residue.</p>
				<p>The query profile, shown in Figure <figr fid="F5">5</figr>, can be considered as a query-specific substitution matrix computed only once for the entire database. The score for matching symbol A (for alanine) in the database sequence with each of the symbols in the query sequence is stored sequentially in the first matrix row. The scores for matching symbol B are stored in the next row, and so on.</p>
				<fig id="F5">
					<title>
						<p>Figure 5</p>
					</title>
					<caption>
						<p>Query-profile</p>
					</caption>
					<text>
						<p>
							<b>Query-profile.</b> Example of query profile for the protein 029181. For each amino acid, a profile row is filled with the scores obtained matching that amino acid with the query residues, based on the given substitution matrix.</p>
					</text>
					<graphic file="1471-2105-9-S2-S10-5"/>
				</fig>
				<p>In this way we replace random accesses to the substitution matrix with sequential ones to the query profile. This solution exploits the cache of the GPU <it>texture memory</it> space where the query profile is stored.</p>
			</sec>
			<sec>
				<st>
					<p>Smith-Waterman in CUDA</p>
				</st>
				<p>A great number of parallel threads have to be launched simultaneously to fully exploit the huge computational power of the GPU. The strategy adopted in our implementation in CUDA was to make each GPU thread compute the whole alignment of the query sequence with one database sequence. As explained in the section about the CUDA programming model, the threads are grouped in a <it>grid</it> of <it>blocks</it> when running on the graphics card. In order to make the most efficient use of the GPU resources the computing time of all the threads in the same <it>grid</it> must be as near as possible. For this reason we found it was important to pre-order the sequences of the database in function of their length. So when running, the adjacent threads will need to align the query sequence with two database queries having the nearest possible sizes.</p>
				<p>Following is the optimal configuration of threads allowing for the best performance:</p>
				<p>&#8226; number of threads per <it>block</it>: 64</p>
				<p>&#8226; number of <it>blocks</it>: 450</p>
				<p>&#8226; total number of threads per <it>grid</it>: 28800</p>
				<p>The ordered database is stored in the <it>global</it> GPU memory, while the query-profile is saved into a <it>texture</it>. For each alignment the matrix is computed column by column in order parallel to the query sequence. To compute a column we need all the H and E values from the previous one. We store them in the <it>local memory</it> of the thread. More precisely, we use two buffers: one for the previous values and one for the newly computed ones. At the end of each column we swap them and so on. <it>Local memory</it> is not cached, so it is very important to choose the right access pattern to this space. The GPU is able to read and write up to 128 bits of the <it>local memory</it> with a single instruction. So each thread reads at once four H and four E values (16 bits long) from the loading buffer plus the respective four values from the profile. It computes the four results for the new column, then it stores them in the storing buffer. To fully take advantage of the memory bandwidth of the graphics card we package the profile in the <it>texture</it>, saving four successive values (always minor than 255) into the four bytes of a single unsigned integer. Thus, each thread can gather all the data needed to compute four cells of the alignment matrix with only two read instructions (one from the local buffer and one from the <it>texture</it>).</p>
				<p>Figure <figr fid="F6">6</figr> has the pseudo code of the <it>kernel</it> executed by each thread, while Figure <figr fid="F7">7</figr> shows the interactions between the <it>local memory</it> buffers and the query-profile to compute the alignment matrix.</p>
				<fig id="F6">
					<title>
						<p>Figure 6</p>
					</title>
					<caption>
						<p>
							<it>Kernel</it> pseudo code</p>
					</caption>
					<text>
						<p>
							<b>
								<it>Kernel</it> pseudo code</b>. Each thread executes this code on a different database sequence. The pseudo-code for the Smith-Waterman implementation is made up of the outer loop, which cycles on the database sequence characters, followed by the inner loop, which does the basic dynamic programming calculations.</p>
					</text>
					<graphic file="1471-2105-9-S2-S10-6"/>
				</fig>
				<fig id="F7">
					<title>
						<p>Figure 7</p>
					</title>
					<caption>
						<p>Smith-Waterman in CUDA functioning</p>
					</caption>
					<text>
						<p>
							<b>Smith-Waterman in CUDA functioning.</b> Each thread gathers four E and H values from the load buffer (first read operation) and four values from the profile (second read operation: the four values are packaged in a single unsigned integer of the query-profile). The Smith-Waterman algorithm is then applied to these data and the results are saved in the storing buffer (a single write operation). The alignment also requires two supplementary values: an f_north and an h_north. In this case, there is no need to save an entire column, but only two temporary numbers updated at each cell computation.</p>
					</text>
					<graphic file="1471-2105-9-S2-S10-7"/>
				</fig>
				<p>Before running Smith-Waterman, the implementation automatically detects the number of computational resources available. A dynamic load balancing is achieved according to the number of devices and their computational power. The database is split in the same number of segments as the number of GPUs. Each GPU then computes the alignment of the query with one database segment. The size of the segment depends upon the power of that GPU. The speed of each device is computed after every alignment. A new partitioning of the database is done for the successive query on the base of a weighted average of the performances detected during previous runs. Pre-fixed weights are used for the first run.</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>List of abbreviations used</p>
			</st>
			<p>CTM &#8211; Close To Metal</p>
			<p>CUDA &#8211; Compute Unified Device Architecture</p>
			<p>SIMD &#8211; Single Instruction, Multiple Data</p>
			<p>GPU &#8211; Graphics Processing Unit</p>
			<p>GPGPU &#8211; General Purpose computing on Graphics Processing Unit</p>
			<p>CPU &#8211; Central Processing Unit</p>
			<p>CUPS &#8211; Cells Updates Per Second</p>
			<p>SSE &#8211; Streaming SIMD Extensions</p>
		</sec>
		<sec>
			<st>
				<p>Competing interests</p>
			</st>
			<p>The authors declare that they have no competing interests.</p>
		</sec>
		<sec>
			<st>
				<p>Authors' contributions</p>
			</st>
			<p>SAM coordinated the study, designed the strategies of parallelization and the architecture of the solution and contributed to writing the manuscript. GV provided the idea for the study, contributed to discussions, analysed the results and contributed to revising of the manuscript. Both authors read and approved the final manuscript.</p>
		</sec>
	</bdy>
	<bm>
		<ack>
			<sec>
				<st>
					<p>Acknowledgements</p>
				</st>
				<p>The authors wish to thank Antonio Mariano for helping in the implementation of the algorithm. This work was supported by PRIN2005 to GV.</p>
				<p>This article has been published as part of <it>BMC Bioinformatics</it> Volume 9 Supplement 2, 2008: Italian Society of Bioinformatics (BITS): Annual Meeting 2007. The full contents of the supplement are available online at <url>http://www.biomedcentral.com/1471-2105/9?issue=S2</url>
				</p>
			</sec>
		</ack>
		<refgrp>
			<bibl id="B1">
				<title>
					<p>Improved tools for biological sequence comparison</p>
				</title>
				<aug>
					<au>
						<snm>Pearson</snm>
						<fnm>WR</fnm>
					</au>
					<au>
						<snm>Lipman</snm>
						<fnm>DJ</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci U S A</source>
				<pubdate>1988</pubdate>
				<volume>85</volume>
				<fpage>2444</fpage>
				<lpage>2448</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1073/pnas.85.8.2444</pubid>
						<pubid idtype="pmpid" link="fulltext">3162770</pubid>
						<pubid idtype="pmcid">280013</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B2">
				<title>
					<p>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs</p>
				</title>
				<aug>
					<au>
						<snm>Altschul</snm>
						<fnm>SF</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>1997</pubdate>
				<volume>25</volume>
				<fpage>3389</fpage>
				<lpage>3402</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">146917</pubid>
						<pubid idtype="pmpid" link="fulltext">9254694</pubid>
						<pubid idtype="doi">10.1093/nar/25.17.3389</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B3">
				<title>
					<p>Striped Smith-Waterman speeds database searches six times over other SIMD implementations</p>
				</title>
				<aug>
					<au>
						<snm>Farrar</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2007</pubdate>
				<volume>23</volume>
				<issue>2</issue>
				<fpage>156</fpage>
				<lpage>161</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/btl582</pubid>
						<pubid idtype="pmpid" link="fulltext">17110365</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B4">
				<title>
					<p>Bio-Sequence Database Scanning On GPU</p>
				</title>
				<aug>
					<au>
						<snm>Liu</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Schmidt</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Voss</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Schroeder</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Muller-Wittig</snm>
						<fnm>W</fnm>
					</au>
				</aug>
				<source>Proceeding of the 20th IEEE International Parallel &amp; Distributed Processing Symposium: 2006(IPDSP 2006) (HICOMB Workshop</source>
				<publisher>Rhode Island, Greece</publisher>
				<pubdate>2006</pubdate>
			</bibl>
			<bibl id="B5">
				<title>
					<p>Identification of common molecular subsequences</p>
				</title>
				<aug>
					<au>
						<snm>Smith</snm>
						<fnm>TF</fnm>
					</au>
					<au>
						<snm>Waterman</snm>
						<fnm>MS</fnm>
					</au>
				</aug>
				<source>J Mol Biol</source>
				<pubdate>1981</pubdate>
				<volume>147</volume>
				<fpage>195</fpage>
				<lpage>197</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/0022-2836(81)90087-5</pubid>
						<pubid idtype="pmpid" link="fulltext">7265238</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B6">
				<title>
					<p>An improved algorithm for matching biological sequences</p>
				</title>
				<aug>
					<au>
						<snm>Gotoh</snm>
						<fnm>O</fnm>
					</au>
				</aug>
				<source>J. Mol. Biol</source>
				<pubdate>1982</pubdate>
				<volume>162</volume>
				<fpage>705</fpage>
				<lpage>708</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/0022-2836(82)90398-9</pubid>
						<pubid idtype="pmpid" link="fulltext">7166760</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B7">
				<title>
					<p>NVidia CUDA</p>
				</title>
				<note>[<url>http://developer.NVidia.com/object/cuda.html</url>]</note>
			</bibl>
			<bibl id="B8">
				<title>
					<p>AMD CTM</p>
				</title>
				<note>[<url>http://ati.amd.com/products/streamprocessor/index.html</url>]</note>
			</bibl>
			<bibl id="B9">
				<title>
					<p>Six-fold speed-up of Smith-waterman sequence database searches using parallel processing on common microprocessors</p>
				</title>
				<aug>
					<au>
						<snm>Rognes</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Seeberg</snm>
						<fnm>E</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2000</pubdate>
				<volume>16</volume>
				<fpage>699</fpage>
				<lpage>706</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/16.8.699</pubid>
						<pubid idtype="pmpid" link="fulltext">11099256</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B10">
				<title>
					<p>Recent developments in linear-space alignment methods: a survey</p>
				</title>
				<aug>
					<au>
						<snm>Chao</snm>
						<fnm>KM</fnm>
					</au>
					<au>
						<snm>Hardison</snm>
						<fnm>RC</fnm>
					</au>
					<au>
						<snm>Miller</snm>
						<fnm>W</fnm>
					</au>
				</aug>
				<source>J. Comput. Biol</source>
				<pubdate>1994</pubdate>
				<volume>4</volume>
				<fpage>271</fpage>
				<lpage>291</lpage>
			</bibl>
			<bibl id="B11">
				<title>
					<p>Profile analysis: detection of distantly related proteins</p>
				</title>
				<aug>
					<au>
						<snm>Gribskov</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>McLachlan</snm>
						<fnm>AD</fnm>
					</au>
					<au>
						<snm>Eisenberg</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci U S A</source>
				<pubdate>1987</pubdate>
				<volume>84</volume>
				<fpage>4355</fpage>
				<lpage>4358</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1073/pnas.84.13.4355</pubid>
						<pubid idtype="pmpid" link="fulltext">3474607</pubid>
						<pubid idtype="pmcid">305087</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B12">
				<title>
					<p>Amino acid substitution matrices from protein blocks</p>
				</title>
				<aug>
					<au>
						<snm>Henikoff</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Henikoff</snm>
						<fnm>JG</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci U S A</source>
				<pubdate>1992</pubdate>
				<volume>89</volume>
				<fpage>10915</fpage>
				<lpage>10919</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1073/pnas.89.22.10915</pubid>
						<pubid idtype="pmpid" link="fulltext">1438297</pubid>
						<pubid idtype="pmcid">50453</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B13">
				<title>
					<p>A general method applicable to the search for similarities in the amino acid sequence of two proteins</p>
				</title>
				<aug>
					<au>
						<snm>Needleman</snm>
						<fnm>SB</fnm>
					</au>
					<au>
						<snm>Wunsch</snm>
						<fnm>CD</fnm>
					</au>
				</aug>
				<source>J. Mol. Biol</source>
				<pubdate>1970</pubdate>
				<volume>48</volume>
				<fpage>443</fpage>
				<lpage>453</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/0022-2836(70)90057-4</pubid>
						<pubid idtype="pmpid" link="fulltext">5420325</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B14">
				<title>
					<p>Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms</p>
				</title>
				<aug>
					<au>
						<snm>Pearson</snm>
						<fnm>W</fnm>
					</au>
				</aug>
				<source>Genomics</source>
				<pubdate>1991</pubdate>
				<volume>11</volume>
				<fpage>635</fpage>
				<lpage>650</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/0888-7543(91)90071-L</pubid>
						<pubid idtype="pmpid" link="fulltext">1774068</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B15">
				<title>
					<p>Rapid and sensitive sequence comparison with FASTP and FASTA</p>
				</title>
				<aug>
					<au>
						<snm>Pearson</snm>
						<fnm>W</fnm>
					</au>
				</aug>
				<source>Methods Enzymol</source>
				<pubdate>1990</pubdate>
				<volume>183</volume>
				<fpage>63</fpage>
				<lpage>98</lpage>
				<xrefbib>
					<pubid idtype="pmpid">2156132</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B16">
				<title>
					<p>The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000</p>
				</title>
				<aug>
					<au>
						<snm>Bairoch</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Apweiler</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2000</pubdate>
				<volume>28</volume>
				<fpage>45</fpage>
				<lpage>48</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">102476</pubid>
						<pubid idtype="pmpid" link="fulltext">10592178</pubid>
						<pubid idtype="doi">10.1093/nar/28.1.45</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
		</refgrp>
	</bm>
</art>
