Department of Information Technology, Clayton State University, Morrow, GA 30260, USA

Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA

Department of Computer Science, Sun Yat-sen University, P.R.C

Abstract

Background

One of the most fundamental and challenging tasks in bio-informatics is to identify related sequences and their hidden biological significance. The most popular and proven best practice method to accomplish this task is aligning multiple sequences together. However, multiple sequence alignment is a computing extensive task. In addition, the advancement in DNA/RNA and Protein sequencing techniques has created a vast amount of sequences to be analyzed that exceeding the capability of traditional computing models. Therefore, an effective parallel multiple sequence alignment model capable of resolving these issues is in a great demand.

Results

We design ^{4}) to ^{3}) processing units for scoring schemes that use three distinct values for match/mismatch/gap-extension. The general solution to multiple sequence alignment algorithm takes ^{4}) processing units and completes in

Conclusions

To our knowledge, this is the first time the progressive multiple sequence alignment algorithm is completely parallelized with ^{3}) processing units. This is a big improvement over the current best constant-time algorithm that uses ^{4}) processing units.

Background

The advancement of DNA/RNA and protein sequencing and sequence identification has created numerous databases of sequences. One of the most fundamental and challenging tasks in bio-informatics is to identify related sequences and their hidden biological significance. Aligning multiple sequences together provides researchers with one of the best solutions to this task. In general, multiple sequence alignment can be defined as:

**Definition 1**

_{1}_{2},..., _{m}), _{1}, _{2}, ..., _{m}) _{i }∪ '-'

(i) Perform all pair-wise alignments of the input sequences.

(ii) Compute a dendrogram indicating the order in which the sequences to be aligned.

(iii) Pair-wise align two sequences (or two pre-aligned groups of sequences) following the dendrogram starting from the leaves to the root of the dendrogram.

Figure

A progressive multiple sequence alignment

**A progressive multiple sequence alignment**. An example of progressive Multiple Sequence Alignment. (a) represents three input sequences (S1, S2, S3); (b) shows the pair-wise dynamic programming alignment of two sequences; (c) shows the order of the sequences to be aligned, where the leaves on right hand-side are the input sequences, the internal nodes represent the theoretical ancestors from which the sequences are derived, and the characters on the tree branches represent the missing/mutated residues; and (d) shows the pair-wise dynamic programming of two pre-aligned groups of sequences.

Step (i) can be optimally solve by Dynamic Programming (DP) algorithm. There are two versions of DP: the Smith-Waterman's ^{2}) time to complete, including the back-tracking steps. Thus, with ^{2 }^{2}) or ^{4}) if

To generate a dendrogram from the distances between the sequences (or the scores generated from step (i)), either UPGMA ^{3}) run-time complexity.

In the worst case, step (iii) performs (^{4}) via dynamic programming (^{2})) and sum-of-pair scoring function ^{2})). This scoring function is required to evaluate every all possible residue matchings of the sequences. As a result, the run-time complexity of step (iii) is ^{4}) ≈ ^{5}), which is the overall run-time complexity of progressive multiple sequence alignment algorithm.

Optimal pair-wise sequence alignment by dynamic programming

Given two sequences

The recursive formula to compute the DP matrix for the Longest Common Subsequence (LCS) as seen in

Similarly, the Needleman-Wunsch's algorithm

where s(_{i}, _{j}) is the pair-wise symbol matching score of the two symbols _{i }and _{j }from sequences

Smith and Waterman

The alignment can be obtained from the DP matrix by starting from cell _{n, n}, (or the cell containing the max value in the matrix as in the Smith-Waterman's algorithm), and tracking back to the top of the matrix, i.e. cell c_{0,0}, by following neighboring cells with the largest value.

Existing parallel implementations

Progressive multiple sequence alignment algorithms are widely parallelized, mostly because they perform

The two most notable parallel versions of dynamic programming algorithm are proposed by Huang

Independently, Huang et al. _{i-1, j-1 }and _{i-1, j }are available before the calculation of cell _{i,j}. The value of _{i, j-1 }can be obtained by performing prefix-sum across all cells in row ^{th}. Thus, with

In addition, the construction of a dendrogram can be parallelized as in ^{3}) time.

Furthermore, there are attempts to parallelize the progressive alignment step [step (iii)] as in

Overall, the major speedups achieved from these implementations come from two parallel tasks: performing ^{2}^{2}) to ^{4}) to ^{3}^{4}), [or ^{4}) if ^{3}^{2})). To address these issues, we design our parallel progressive multiple sequence alignment on a reconfigurable mesh (r-mesh) computing model similar to the ones used in

Reconfigurable-mesh computing models - (r-mesh)

A Reconfigurable mesh (r-mesh) computing, first proposed by Miller et al

Port configurations on reconfigurable computing model

**Port configurations on reconfigurable computing model**. Allowable configurations on 4 port processing units; (a) shows the ports directions; (b) shows the 15 possible port connections, where the last five port configurations in curly braces are not allowed in Linear r-mesh (Lr-mesh) models.

There are many reconfigurable computing models such as Linear r-mesh (Lr-mesh), Processor Array with Reconfigurable Bus System (PARBS), Pipedlined r-mesh (Pr-mesh), Field-programmable Gate Array (FPGA), etc. These models are different in many ways from construction to operation run-time complexities. For example, the Pr-mesh model does not function properly with configurations containing cycles, while many other models do. However, there are many algorithms to simulate the operations of one reconfigurable model onto another in constant time as seen in

In the scope of this study, we will use a simple electrical r-mesh system, where each processing unit, or processing element (PU or PE), contains four ports and can perform basic routing and arithmetic operations. Most reconfiguration computing models utilize the representation of the data to parallelize their operations; and there are various proposed formats

**Definition 2**

_{0}, _{1}, ..., _{n}_{-1})_{i }= 1 _{i }= 0

For example, a number 3 is represented as 11110000 in 8-bit 1UN representation.

In addition to the 1UN unary format, we will be utilizing the following theorem for some of the operations:

**Theorem 1**:

^{c}]

In terms of multiple sequence alignment, the number of bits used in the 1UN notation is correlated to the maximum length of the input sequences. In the next Section, we will describe the designs of r-meshcomponents to use in dynamic programming algorithms.

Parallel pair-wise dynamic programming algorithms

This section begins with the description of several configurations of r-mesh needed to compute various operations in pair-wise dynamic programming algorithm. Following the r-mesh constructions is a new constant-time parallel dynamic programming algorithm for Needleman-Wunsch's, Smith-Waterman, and the Longest Common Subsequence (LCS) algorithms.

R-mesh max switches

One of the operations in the dynamic programming algorithm requires the capability to select the largest value from a set of input numbers. Following is the design of an r-mesh switch that can select the maximum value from an input triplet in the same broadcasting step. For 1-bit data, the switch can be built as in Figure

1-bit max switches

**1-bit max switches**. Two 1-bit max switches. (a)- fusing {NSEW} to find the max of two inputs from North and West ports; (b)- construction of a 1-bit 4-input max switch.

An n-bit 3-input max switch

**An n-bit 3-input max switch**. An n-bit 3-input max switch, where the rectangle represents a 1-bit 4-input max switch from Figure 3.

R-mesh adder/subtractor

Similarly, to get a constant time dynamic programming algorithm we have to be able to perform a series of additions and subtractions in one broadcasting step. Exploiting the properties of 1UN representation, we are presenting an adder/subtractor that can perform an addition or a subtraction of two n-bit numbers in 1UN representation in one broadcasting time. The adder/subtractor is a

An n-bit adder/subtractor

**An n-bit adder/subtractor**. An n-bit adder/subtractor that can perform addition or subtraction between two 1UN numbers during a broadcasting time. For additions the inputs are on the North and West borders and the output is on the South border. For subtractions, the inputs are on the West and South borders and the output is on the North border. The number on the West bound is 1-bit left-shifted. The dotted lines represent the omitted processing units that are the same as the ones in the last rows. This figure shows the addition of 3 and 3. Note: the leading 1 bit of input number on the West-bound (left) has been shifted off. The right border is fed with zero (or no signal) during the subtract operation.

This adder/subtractor can only handle numbers in 1UN representation, i.e. positive values. Thus, any operation that yields a negative result will be represented as a pattern of all zeros. When this adder/subtractor is used in a DP algorithm, one of the two inputs is already known. For example, to calculate the value at cell _{i, j}, three binary arithmetic operations must be performed: _{i-1, j-1 }+ _{i}, _{j}), _{i-1, j }+ _{i, j-1 }+ _{i}_{j}

For biological sequence alignments, symbol matching scores are commonly obtained from substitution matrices such as PAM

Constant-time dynamic programming on r-mesh

The dynamic programming techniques used in the Longest Common Subsequence (LCS), Smith-Waterman's and Needle-Wunsch's algorithms are very similar. Thus, a DP r-mesh designed to solve one problem can be modified to solve another problem with minimal configuration. We are presenting the solution for the latter cases first, and then show a simple modification of the solution to solve the first case.

Smith-Waterman's and Needle-Wunsch's algorithms

Although the number representation can be converted from one format to another in constant time ^{2}). To eliminate this format conversion all the possible symbol matching scores, or scoring matrix, (4 × 4 for RNA/DNA sequences and 20 × 20 for protein sequences) are pre-scaled up to positive values. Thus, an alignment of any pair of residue symbols will yield a positive score; and gap matching (or insert/delete) is the only operation that can reduce the alignment score in preceding cells. Nevertheless, if the value in cell _{i-1, j }(or _{i,j-1}) is smaller than the magnitude of the gap cost (|_{i,j }since the addition of the positive value in cell _{i-1, j-1 }and the positive symbol matching score _{i}, _{i}) is always greater than or equal to zero.

In general, we do not have to perform this scale-up operation for DNA since DNA/RNA scoring schemes that generally use only two values: a positive integer value for match and the same cost for both mismatch and gap.

Unlike DNA, scoring protein residue alignment is often based on scoring scoring/substitution/mutation matrices such as that in _{ij }is the frequency or the percentage of residue _{i }and _{j }are background probabilities which residues ^{β}, where

A simple mechanism to obtain a scaled-up version of a scoring matrix is: (a) taking the antilog of the scoring matrix and

When these scaled-up scoring matrices are used, the Smith-Waterman's algorithm must be modified.

Instead of setting sub-alignment scores to zeros when they become negative, these scores are set to

Using scaled-up scoring matrices will eliminate the need for signed number representation in our following algorithm designs. However, if there is a need to obtain the alignment score based on the original scoring matrices, the score can be calculated as follows: (i) load the original score matrix and gap cost to each cell on an r-mesh as similar to the one described in Section; (ii) configure cells on the diagonal path to use their corresponding matching score from the matrix and other cells representing gap insertions or deletions to use gap cost; (iii) calculate the prefix-sum of all the cells on the path representing the alignment using Theorem 1.

Having the adder/subtrator units and the switches ready, the dynamic programming r-mesh, (DP r-mesh), can be constructed with each cell _{i,j }in the DP matrix containing 3 adder/subtractor units and a 3-input max switch allowing it to propagate the max value of cells _{i-1, j-1}, _{i-1, j }and _{i, j-1 }to cell c_{i, j }in the same broadcasting step. Figure

A dynamic programming r-mesh

**A dynamic programming r-mesh**. Each cell _{i, j }is a combination of a 3-input max switch and three adder/subtractor units. The "+" and "-" represent the actual functions of the adder/subtractor units in the configuration.

A 1 × n adder/subtractor unit can perform increments and decrements in the range of [-1,0,1]. As a result, a DP r-mesh can be built with 1-bit input components to handle all pair-wise alignments using constant scoring schemes that can be converted to [-1,0,1] range. For instance, the scoring scheme for the longest common subsequence rewards 1 for a match and zero for mismatch and gap extension.

To align two sequences, _{i, j }loads or computes its symbol matching score for the symbol pair at row _{0,0 }to its neighboring cells _{0,1}, _{1,0}, and _{1,1 }to activate the DP algorithm on the r-mesh. The values coming from cells _{i-1, j }and _{i, j-1 }are subtracted with the gap costs. The value coming from _{i-1, j-1 }is added with the initial symbol matching score in _{i, j}. These values will flow through the DP r-mesh in one broadcasting step, and cell _{n, n }will receive the correct value of the alignment.

In term of time complexity, this dynamic programming r-mesh takes a constant time to initialize the DP r-mesh and one broadcasting time to compute the alignment. Thus, its run-time complexity is ^{3}) processing units.

To handle all other scoring schemes, ^{4}) processing units.

Longest common subsequence (LCS)

The complication of signed numbers does not exist in the longest common subsequence problem. The arithmetic operation in LCS is a simple addition of 1 if there is a match. The same dynamic programming r-mesh as seen in Figure

To find the longest common subsequence between two sequences _{i-1, j-1 }is fed into the North-West processing unit, and the other values are fed into the North-East unit. Then, _{i, j }loads in its symbols and fuses the South-East processing unit (in bold) as NS,E,W if the symbols at row _{i-1, j-1 }or the max value of cells _{i-1, j }and _{i, j-1 }to pass through. These are the only changes for the DP r-mesh to solve the LCS problems.

A 4-way max switch

**A 4-way max switch**. A configuration of a 4-way max switch to solve the longest common subsequence (lcs). The South-East processing unit (in bold) configures {NS,E,W} if the symbols at row

This modified constant-time DP r-mesh used ^{3}) processing units. However, this is an order of reduction comparing the current best constant parallel DP algorithm that uses an r-mesh of size ^{2}) × ^{2})

Affine gap cost

Affine gap cost (or penalty) is a technique where the opening gap has different cost from an extending gap

To handle affine gap cost, we need to extend the representation of the number by 1 bit (right most bit). This bit indicates whether a value coming from _{i-1, j }or _{i, j-1 }to _{i, j }is an opening gap or not. If the incoming value has been gap-penalized, its right most bit is 1, and it will not be charged with an opening gap again; otherwise, an opening gap will be applied. The original "-" units must be modified to accommodate affine gaps. Figure

A configuration for selecting a min value

**A configuration for selecting a min value**. A configuration to select one of the two inputs in 1UN notation using the right most bit as a selector

The modification of the dynamic programming r-mesh to handle affine gap cost requires additional 2 adder/subtractor units, 2 on/off switches, and one 2-input max switch. Asymptotically, the amount of processing units used is still bounded by ^{4}) and the run-time complexity remains

R-mesh on/off switches

To handle affine gap cost in dynamic programming, we need a switch that can select, i.e. turns on or off, the output ports of a data flow. The on/off r-mesh switch can be configured as in Figure

An

**An n × n + 1 n-bit on/off switch**. By default, all processing units on the last column (column

This r-mesh configuration uses (^{2}), processing units to turn off the flow of an

Dynamic programming back-tracking on r-mesh

The pair-wise alignment is obtained by following the path leading to the overall optimal alignment score, or the end of the alignment. In the case of the Needleman-Wunsch's algorithm, cell _{n, n }holds this value; and in the case of the Smith-Waterman's algorithm, cell _{i, j }with the maximum alignment score is the end point. The cell with the largest value can be located in

1. Initially, the DP matrix with calculated values are stored in the first slice of the r-mesh cube, i.e. in cells _{i, j,0}, 0 <

2. _{i, j,0 }sends its value to _{i, j, i}, 0 ≤

3. _{i, j, i }sends its value to _{0, j, k}, i.e. to move the solution values to the first row of each 2D r-mesh slice.

4. Each 2D r-mesh slice finds its max value _{0, r, k }where

5. _{0, r, k }sends _{k,0,0}, i.e. each 2D r-mesh slice sends its max value column number

6. The first 2D r-mesh slice, _{i, j,0}, finds the max value of _{i0,0 }(i.e. value r received from the previous step). The row and column indices of the max value found in this step is the location of the max value in the original DP r-mesh.

These above steps rely on the capability to find the max value from

1. initially, the values are stored in the first row of the r-mesh.

2. _{0, j }broadcasts its value, namely _{j}, to _{i, j}, (column-wise broadcasting).

3. _{i, i }broadcasts its value, namely _{i}, to _{i, j }(row-wise broadcasting).

4. _{i, j }sets a flag bit _{i }>_{j}; otherwise sets

5. _{0, j }is holding the max value if

The location of the max value in the DP r-mesh can be obtained in

To trace back the path leading to the optimal alignment, we start with the end point cell _{e, d }found above and following these steps:

1. _{i, j}, (_{i, j+1}, _{i+ 1, j}, _{i+1, j+i}. Thus, each cell can receive up to three values coming from its Noth, West, and Northwest borders.

2. _{i, j }finds the max of the inputs and fuses its port to the neighbor cell that sent the max value in the previous step. If there are more than one port to be fused, (this happens when there are multiple optimal alignments), _{i, j }randomly selects one.

3. _{e, d }sends a signal to its fused port. The optimal pair-wise alignment is the ordered list of cells where this signal travels through.

Each operation in the back-tracking process takes ^{3 }processing units and takes

Progressive multiple sequence alignment on r-mesh

In this section, we start by describing a parallel algorithm to generate a dendrogram, or guiding tree, representing the order in which the input sequences should be aligned. Then we will show a reworked version of sum-of-pair scoring method that can be performed in constant time on a 2D r-mesh. Finally, we will describe our parallel progressive multiple sequence alignment algorithm on r-mesh along with its complexity analysis.

Hierarchical clustering on r-mesh

The parallel neighbor-joining (NJ)

Followings are the actual steps to build the dendrogram:

1. Initially, all the pair-wise distances are given in form of a matrix

2. Calculate the average distance from node

3. The pair of nodes with the shortest distance (_{ij}, where _{ij }= _{ij }- _{i }- _{j}.

4. A new node _{j,u }= _{ij}-_{iu}.

5. The distance matrix _{vu }= _{iv }+ _{jv }- _{ij}.

These steps are repeated for

Step 1 and 4 are constant time operations on an

Before proceeding to step 2, we should reexamine some facts. First, the maximum alignment score from all the pair-wise DP sequence alignments are bounded by ^{2}, where ^{2 }occurs only if we align a sequence of these symbols to itself. ^{2}. Thus, the sum of ^{4}. These facts allow us to calculate the sums in step 2 in ^{3 }processing units, to complete in

In step 3, each processing unit computes value _{ij }locally. The max value can be found using the same technique described in Section in constant time.

Similarly, step 5 is performed locally by the processing units in the r-mesh in

Constant run-time sum-of-pair scoring method

The third step [step (iii)] of the progressive MSA algorithm is following the dendrogram, built in the earlier step, to perform pair-wise dynamic programming alignment on two pre-aligned groups of sequences. The dynamic programming alignment algorithm in this step is exactly the same as the one in step (i); however, quantifying a match between two columns of residues are no longer a simple constant look-up, unless the hierarchical expected probability (HEP) matching scoring scheme is used

where _{i }and _{j }are residue symbols from columns _{i}, _{j}) is the matching score between these two symbols _{i }and _{j}. For example, to calculate the sum-of-pair of the following two columns

and

we will have to score 15 residue pairs:(A,C), (A,T), (A,G), (A,T), (A,T), (C,T), (C,G), (C,T), (C,T), (T,G), (T,T), (T,T), (G,T), (G,T), (T,T). Since the matching between residue

where _{i }and _{j }are the total count of symbols/types

Thus, the sum-of-pair score of the two columns given above will be:

This scoring function can be implemented on an array of _{k}, where _{k }sums the 1's it receives. The sum-of-pair score can be computed between the pairs of processing units containing a sum larger than 0 calculated from previous steps. All of these steps are carried out in constant time. There are ^{2 }possible pair-wise column arrangements of two pre-aligned groups of sequences of max length ^{2 }processing units.

Parallel progressive MSA algorithm and its complexity analysis

Progressive multiple sequence alignment algorithm is a heuristic alignment technique that builds up a final multiple sequence alignment by combining pair-wise alignments starting with the most similar pair and progressing to the most distant pair. The distance between the sequences can be calculated by dynamic programming algorithms such as Smith-Waterman's or Needle-Wunsch's algorithms (step i). The order in which the sequences should be aligned are represented as a guiding and can be calculated via hierarchical clustering algorithms similar to the one described in Section (step ii). After the guiding tree is completed, the input sequences can be pair-wise aligned following the order specified in the tree (step iii). In the previous Sections, we have described and designed several r-meshes to handle individual operations in the progressive multiple alignment algorithm. Finally, a progressive multiple sequence alignment r-mesh configuration can be constructed. First, the input sequences are pair-wise aligned using the dynamic programming r-mesh described previously in Section. These ^{3}) processing units. Finally, the progressive step, [step (iii)], takes O(m) time using a DP r-mesh. Therefore, the overall run-time complexity of this parallel progressive multiple sequence alignment is ^{4}) processing units to handle all scoring schemes with affine gap cost. And step (i) needs ^{4}) ≈ ^{5}) processing units used.

For alignment problems that use constant scoring schemes without affine gap cost, this parallel progressive multiple sequence alignment algorithm only needs ^{3}) ≈ ^{4}) processing units to complete in

Table

Summary of progressive multiple sequence alignment components

**Component**

**input size**

**processors**

**run-time**

2-input max switch

1 -

1

1 broadcast

4-input max switch

1 -

4

1 broadcast

2-input max switch

1 broadcast

4-input max switch

4

1 broadcast

on/off switch

1 broadcast

adder/subtractor

1 broadcast

DP(const. scoring)

2 sequences, max length =

^{3})

1 broadcast

DP (general scoring)

2 sequences, max length =

^{3}),

1 broadcast

DP back-tracking

Neighbor-Joining

^{3})

Sum-of-pair

2 pre-aligned groups of m sequences

^{2}

MSA(const. scoring)

^{3})

MSA

^{4})

This Table summarizes all the parallel components developed in this study along with their time and CPU complexity.

Conclusions

In this study, we have designed various r-mesh components that can run in one broadcasting step, which enabling us to effectively parallelize the progressive multiple sequence alignment paradigm. to align ^{4}) to ^{4}) processing units. For a scoring scheme that rewards 1 for a match, 0 for a mismatch, and -1 for a gap insertion/deletion, our algorithm uses only ^{3}) processing units. Moreover, to our knowledge, we are the first to propose an ^{3}) processing units.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

KN designed parallel models used in this study. YP and GN participated in designing and criticizing the parallel models and their analysis. All authors read and approved the final manuscript.

Acknowledgements

This study is supported by the Molecular Basis of Disease (MBD) at Georgia State University.

This research was also supported in part by CCF-0514750, CCF-0646102, and the National Institutes of Health (NIH) under Grants R01 GM34766-17S1, and P20 GM065762-01A1.

The research of Nong was supported in part by the National Natural Science Foundation of China under Grant 60873056 and the Fundamental Research Funds for the Central Universities of China under Grant 11lgzd04.