Abstract
Background
Multiple sequence alignment is fundamental. Exponential growth in computation time appears to be inevitable when an optimal alignment is required for many sequences. Exact costs of optimum alignments are therefore rarely computed. Consequently much effort has been invested in algorithms for alignment that are heuristic, or explore a restricted class of solutions. These give an upper bound on the alignment cost, but it is equally important to determine the quality of the solution obtained. In the absence of an optimal alignment with which to compare, lower bounds may be calculated to assess the quality of the alignment. As more effort is invested in improving upper bounds (alignment algorithms), it is therefore important to improve lower bounds as well. Although numerous cost metrics can be used to determine the quality of an alignment, many are based on sumofpairs (SP) measures and their generalizations.
Results
Two standard and two new methods are considered for using exact 2way and 3way alignments to compute lower bounds on total SP alignment cost; one new method fares well with respect to accuracy, while the other reduces the computation time. The first employs exhaustive computation of exact 3way alignments, while the second employs an efficient heuristic to compute a much smaller number of exact 3way alignments. Calculating all 3way alignments exactly and computing their average improves lower bounds on sum of SP cost in vway alignments. However judicious selection of a subset of all 3way alignments can yield a further improvement with minimal additional effort. On the other hand, a simple heuristic to select a random subset of 3way alignments (a random packing) yields accuracy comparable to averaging all 3way alignments with substantially less computational effort.
Conclusion
Calculation of lower bounds on SP cost (and thus the quality of an alignment) can be improved by employing a mixture of 3way and 2way alignments.
Background
Let Σ be a finite alphabet, and consider v words (sequences) w_{1},..., w_{v }over Σ. For 1 ≤ i ≤ v, let ℓ_{i }denote the length of word w_{i}, and let σ_{ij }denote the jth character of w_{i}. A string x_{i }over Σ ∪ {} is an extension of w_{i }of length M if there exist indices 1 ≤ d_{1 }<d_{2 }< ⋯
An alignment of two words w_{i }and w_{j }consists of an extension x_{i }∈ Ext(w_{i}, M) and an extension x_{j }∈ Ext(w_{j}, M). The cost of such an alignment, denoted cost(x_{i}, x_{j}), is determined position by position. In the simplest case, a position contributes a cost of 1 if the extensions do not have the same character in that position, 0 otherwise. Numerous different cost measures are of interest, but for our purposes counting matches of a symbol of Σ against – (indels) and matches of two different symbols of Σ (substitutions) suffices.
Pairwise alignment of sequences w_{i }and w_{j }chooses M, x_{i }∈ Ext(w_{i}, M) and x_{j }∈ Ext(w_{j}, M), to obtain an alignment with SP cost cost(x_{i}, x_{j}). The alignment with lowest cost has SP cost
The alignment of v sequences w_{1},..., w_{v }consists of extensions {x_{i }∈ Ext(w_{i}, M) : 1 ≤ i ≤ v}. The sumofpairs or SP cost is ∑_{1≤i<j≤v }cost(x_{i}, x_{j}), and the multiple alignment with lowest cost has cost
One natural objective is to find a multiple alignment that minimizes the SP cost. Exact methods have, for the most part, employed dynamic programming. In this environment, the execution time grows polynomially with the length of the sequences, but exponentially in the number of sequences. While improvements in time and space utilization are possible (for example, [17]), dynamic programming remains infeasible for large numbers of sequences. Indeed, finding optimal multiple sequence alignments under the SP cost metric is NPhard [810]. Consequently, substantial effort has been invested in finding effective heuristic algorithms (see [4,1115] for surveys). These heuristic methods produce upper bounds on the cost of a multiple sequence alignment with minimum SP cost.
When heuristic methods are employed, how can one assess the quality of the alignment obtained? Feng and Doolittle [16] argue that the main biological motivation for multiple sequence alignment is to reconstruct phylogeny; see [17] for a recent overview. With this in mind, quality relates to the amount of biological information revealed by the alignment. Let us ask a more modest question. How well does the SP cost produced by a specific method compare to the minimum SP cost over all alignments? In other words, what is the accuracy of the upper bound? A precise answer poses a significant challenge; comparison with an optimal solution would require that one be found, negating the need for the comparison. It is natural, therefore, to determine a lower bound on the SP cost, and compare the heuristic method with the lower bound as a measure of accuracy. Lower bounds do not provide alignments; instead, when they are in good agreement with an upper bound, they certify that the upper bound is indeed close to optimum. Lower bounds are also employed in branchandbound strategies for exact alignment [18,19], but our focus is on tools for comparison with upper bounds. A comparison of an upper and a lower bound simultaneously measures the accuracy of the upper bound, and the lower bound. No matter how good an upper bound is, comparison with a poor lower bound fails to certify that the upper bound is close to optimum. Therefore improvements in lower bounds are crucial, not to reveal biological information through alignment directly, but rather to better calibrate an heuristic method that is being developed for biological inferences. Determining effective lower bounds on the SP cost of a multiple sequence alignment has been much less studied than heuristic methods, primarily because a lower bound does not provide a feasible alignment. Nevertheless, an effective lower bound provides the only real means to assess proximity of an upper bound to the optimum. In this paper, we examine a conceptually simple technique for calculating lower bounds on SP costs for alignments. It is based on the observation that vway alignments for small values of v can be efficiently calculated by dynamic programming. The specific methods that we propose combines exact 2 and 3way alignments selected to treat each 2way alignment exactly once. Randomization of the selection of 3way alignments to examine enables us to improve lower bounds from 2way alignments without excessive computational cost.
Results and Discussion
The minimum SP cost, mincost({w_{1},..., w_{v}}), is
A simple lower bound is obtained by treating the alignment as
By way of example, consider the eighteen sequences given next; these are generated from a sequence of 50 bases by random insertions, deletions, and substitutions. Our interest is in determining a lower bound on the sum of SP costs in an 18way alignment of these sequences (see Table 1).
Using dynamic programming on the 153 =
Fractional Alignment
This decomposition of the
gives a fractional SP alignment cost, with the contribution of each pair scaled by its weight in the class of the decomposition. Indeed
is a lower bound on the minimum SP vway alignment cost for any decomposition.
That this is a lower bound on mincost({w_{1},..., w_{v}}) is easy to demonstrate. When the decomposition has a single class, the lower bound given is equal to the SP alignment cost. So consider two classes in the decomposition c' and c". Merge these two classes to form a single class c by setting μ_{c, {i, j} }= μ_{c', {i, j} }+ μ_{c", {i, j} }for every pair {i, j}, removing classes c' and c". We need only verify that, when M is chosen to yield the best fractional alignment for class c (i.e. the extensions x_{1},..., x_{v}),
where y_{1},..., y_{v }form a best fractional alignment (of length M) for c' and z_{1},..., z_{v }a best for c". In applying this general framework, one must calculate fractional alignment costs,
and this appears to necessitate solving many vway alignment problems. Each can be solved by dynamic programming, but unless the
decomposition is suitably chosen the resulting subproblems may be no simpler to solve
than the original. If, in a given class of the decomposition, most weights are zero,
the fractional alignment problem is substantially simplified. In particular, if all
pairs containing a particular sequence have weights equal to zero in this class, the
sequence does not participate in the fractional alignment. Then by choosing weights
within each class so that "few" sequences appear in pairs with nonzero weight, the
alignment task is reduced to one on the few sequences only, and can be solved efficiently
using dynamic programming. Indeed when α =
Improvement by 3way alignment
There is every reason to expect that the best alignment of two sequences may conflict with the best 2way alignments of each with a third. For example, aligning the first three sequences of the eighteen in the example optimally, one obtains the alignment
ATGGGTTGCGTCAGGAGTAAAGAAGCCAAGGGCCCGGCACTGAAGTACCA
ATGGGCTGTATTAAAAGTAAGGAAGACAAAGGACCAGCAATCAAGTACAG
ATGGGCTGCATTAAAAGTAAACAAAAGTCCAGCCATAAAATACAC
Between the first and second sequences, there are 15 substitutions and no indels. Between the second and third, there are 8 substitutions and 5 indels. Between the first and third there are 14 substitutions and 5 indels. The total threeway alignment score is 15 + 13 + 19 = 47. When exact twoway alignment of each pair among the three sequences is done, the alignment of the first two sequences agrees with that in the 3way alignment. The optimal alignment of the second and third sequences does not; it has cost 12 (shown with 7 substitutions and 5 indels):
ATGGGCTGTATTAAAAGTAAGGAAGACAAAGGACCAGCAATCAAGTACAG
ATGGGCTGCATTAAAAGTAAACAAAAGTCCAGCCATAAAATACAC
Moreover, the optimal 2way alignment for the first and third sequences has cost 18 (shown as 13 substitutions and 5 indels):
ATGGGTTGCGTCAGGAGTAAAGAAGCCAAGGGCCCGGCACTGAAGTACCA
ATGGGCTGCATTAAAAGTAAACAAAAGTCCAGCCATAAAATACAC
The sum of the three optimal 2way alignment costs is therefore 45, while the optimal 3way alignment has cost 47. Hence the use of exact 3way alignment provides a better bound; thus rather than computing the lower bound by adding the SP scores for the three pairs of sequences, one can employ an exact 3way alignment for all three sequences. As shown in the example, the difference in SP scores of simultaneous threeway alignment of the three sequences and sum of pairwise scores of the three pairs in this triple is 2. In this example, the best improvement that can be achieved by replacing any three 2way alignments by the corresponding 3way alignment is 3. To establish this, all 816 3way alignments were calculated using dynamic programming, and compared to the sum of 2way alignments within each.
Average 3way alignment
In vway alignment, a reasonable goal therefore is to choose many triples of sequences
for which the 3way alignment improves the lower bound, so as to obtain many small
improvements that aggregate to form a large improvement in the bound. At the extreme,
one could calculate best 3way alignments for all
There are two problems with this approach. First and foremost, while some triples yield improved cost bounds over the sum of the three pairs of the triple, others may not. Simply averaging the contributions of each negates to a degree the improvements attributable to some triples. Secondly, in addition to a substantial increase in cost for each alignment, the calculation involves completing O(v^{3}) alignments rather than O(v^{2}). It is therefore desirable to reduce the number of classes in the decomposition, both to reduce computation time and to potentially improve the bound.
Weighted selection of 3way alignments
We now employ only some of the 3way alignments. Choose a collection
The determination of fractional 3way alignments can be avoided by choosing a collection
Let V be a set of elements and
Packings with λ = 1, i.e. (v, 3, 1)packings, trivially have the property that every pair appears either in zero or in λ = 1 triples of the packing, and hence an SP lower bound can be calculated directly. Since every triple in the packing contains three pairs and every pair is in at most one triple, no (v, 3, 1)packing can contain more than v(v  1)/6 triples. Using such a packing for vway alignment, we therefore employ only O(v^{2}) 3way alignments, rather than the O(v^{3}) to employ all.
The natural question to address is: Which (v, 3, 1)packing should we employ to obtain the bound? To a certain extent, this decision is based on the amount of computational effort that can be expended, and the accuracy of the bound desired. Of course, the best outcome would be to achieve the best accuracy at least computational cost. In certain cases, one might be able to anticipate from exact 2way alignments when a specific 3way alignment is likely to lead to an improvement. For example, given three sequences w_{1}, w_{2}, w_{3}, if the 2way alignments have very different patterns of indels and substitutions, one might expect that the 3way alignment yields higher cost than the three 2way alignments. This appears to be difficult to quantify.
In the absence of a good guide to the selection of 3way alignments that yield the best improvements, two alternatives can be pursued. If accuracy is of paramount concern, we again calculate all 3way alignments. Instead of simply averaging their contributions, we assign to each triple a weight consisting of the difference between the SP cost for the 3way alignment and the sum of the three 2way alignments involved. The best accuracy from the method is obtained by selecting a (v, 3, 1)packing of maximum total weight. This can prove more computationally intensive than simply undertaking all 3way alignments, since choosing a maximum weight packing is NPhard [22]. However, there is no need to obtain a packing of maximum weight; one of "large" weight suffices. As a compromise, we develop a simple heuristic (hillclimbing) to choose a packing of large but not necessarily maximum weight in the Methods section.
In the running example, since an (18,3,1)packing has at most 48 triples, one might hope to obtain a relatively large improvement. The best improvement that we found is 47. Then the hillclimbing algorithm to maximize weight was executed. It yields a collection of 33 triples: four triples with an improvement of 3 each, eleven with an improvement of 2, and thirteen with an improvement of 1. While calculating the average of all 816 alignments increases the lower bound by only 23, this method is comparable in execution time and provides more than twice the improvement in the lower bound.
This demonstrates that, if one is willing to invest the effort to calculate all
Heuristic selection of 3way alignments
If computation time is the more critical, we choose a maximum (v, 3, 1)packing at random, and evaluate the bound. We therefore employ the hillclimbing algorithm (developed in the Methods section) to calculate a packing of maximum size, not using any weight measuring expected improvement at this stage. Then we calculate the optimal 3way alignment for each triple of the packing, to determine the improvement in the lower bound. Our first attempt gave an SP lower bound of 1968 (an improvement of 22), our second a lower bound of 1966 (an improvement of 20), and our third a lower bound of 1974 (an improvement of 28). There is significant variation in repetitions of the method, as a consequence of the random selections made. Naturally, one cannot conclude much from three trials, so we report the distribution of improvements from 10,000 trials of the method (see Table 2).
It may happen that one is so unlucky as to obtain an improvement of only 10, or lucky enough to obtain an improvement of 33. But one expects, within a very small number of trials, to obtain an improvement of 23 or better. Admittedly, when comparing to an improvement of 47, none of the improvements are striking. To a degree, however, this misses the point. The computation time per trial is much lower than the cost of computing all 3way alignments. Hence the real benefit is that an improvement can be obtained that is competitive with the average of all 3way alignments, at a fraction of the cost.
The astute reader can observe that in our 10,000 trials, the average is slightly less than 23, the average of all 816 improvements. This can be attributed to the fact that in a maximum packing on 18 elements, there are nine pairs not appearing in any triple; in a similar experiment with a number of sequences v ≡ 1, 3 (mod 6) (where the maximum packing has each pair in a triple), the averages would be in closer agreement. Moreover, using a metaheuristic search strategy such as simulated annealing could prove beneficial, again at the expense of a substantial increase in computation time.
Conclusion
The calculation of vway alignments that are optimal with respect to sumofpairs cost is challenging, and substantial effort has been invested in the calculation of upper bounds on SP costs. Exact alignment appears to be computationally out of reach when many long sequences are to be aligned, and lower bounds from sums of exact 2way alignments are often too weak to be of assistance. In this paper, we establish a generic lower bound using decompositions into fractional alignments. With this we propose two improvements using exact 3way alignments. The first still calculates all 3way alignments, but uses a hillclimbing technique to choose a packing among these of maximal weight. This improves on the accuracy of the bound at a similar computational cost to the naive method. The second instead reduces the computation time, by choosing at random a packing, and only then computing the 3way alignments for the triples of the packing. In our test cases, the improvements are competitive with calculating the overall average, but are achieved at reduced effort in computation.
It remains of interest to determine whether, from exact 2way alignment data, one can determine which triples are most likely to provide an improvement, prior to calculating the 3way alignments; see [23] for a related investigation. In our experiment, we did not achieve success better than random by classifying the triples based on features of the 2way alignments. Nevertheless, such a strategy may augment the effectiveness of the hillclimbing strategy if such a prediction is feasible.
Methods
Dynamic programming methods for exact 2way and 3way alignment that we use are standard [3]. Here we develop the new methods for selecting 3way alignments used in the computation of lower bounds.
Hillclimbing 1: Packing Pairs into Triples
The main requirement is to select a random (v, 3, 1)maximum packing. Stinson [24] developed a simple but effective local optimization strategy for the case when the
packing
We describe the method more formally next. We must determine in advance the number
of triples in a maximum packing, to provide a stopping criterion. It is well known
that in a maximum (v, 3, 1)packing, the number of pairs not appearing in triples is 0 when v ≡ 1, 3 (mod 6); 4 when v ≡ 5 (mod 6);
if v is odd then
if v ≡ 1, 3 (mod 6) then
ε := 0
else ε : 8;
triplesRemaining : = (v(v  1)  ε)/6;
else triples Remaining : = floor(v(v  2)/6);
D : ∅;
while triplesRemaining > 0 do
begin
Pick a random element x appearing < (v  2)/2 times in D;
Pick random pairs {x, y}, {x, z} not appearing in triples in D;
if pair {y, z} appears in a triple T of D
then begin
D : = D {T};
triplesRemaining : = triplesRemaining + 1
end;
D : D ∪ {{x, y, z}};
triplesRemaining : = triplesRemaining  1
end
Intuitively, the method repeatedly chooses two pairs with a common element, {x, y} and {x, z}, neither appearing in a chosen triple. If {x, z} appears in no triple as well, we can simply add the triple {x, y, z}. Otherwise we delete the unique triple containing {y, z} before making the addition. The method is a hillclimbing technique because it never reduces the number of triples chosen.
The while loop in this method employs as its only stopping criterion that it finds a maximum packing. Stinson [24] incorporates a count of the number of times a "lateral" move is made since the addition of the last triple. When this exceeds a predetermined threshold, the computation is abandoned and restarted. We incorporate this in our implementation as well.
Repeatedly employing this method yields different (v, 3, 1)packings. As each is generated, the relevant 3way alignments can be calculated to determine the lower bound obtained from this packing. There is a natural tradeoff between computation time and the quality of the bound. Only after a packing is chosen does one need to compute the 3way alignments involved, giving a factor of O(v) savings in computation time; however, what has been sacrificed, potentially, is the amount of improvement since the method does not consider (or even know) the improvement until the packing is fully chosen. Of course, one can apply the method repeatedly increasing the computational expense, but with the expectation of improving the accuracy.
Hillclimbing 2: Maximizing weight
Stinson's method can also be adapted to choose a packing whose improvement is "large". To do this, we require that all 3way alignments have been calculated. Then we modify the hillclimbing method in two ways. First, we only adjoin a new triple if it makes a nonzero improvement (for this purpose, we can look up the precomputed 3way alignment cost). Secondly, we only replace one triple by another if it makes an improvement at least as large as the one being removed. One might also consider replacing two or three triples by one, if this yields an improvement. We did not obtain any overall benefit by allowing this. This variant of the Stinson method essentially enables us to find packings of large weight rather than with many triples; it does, however, require that all weights (improvements) be precomputed.
Authors' contributions
Both authors contributed in all phases of the research.
Acknowledgements
Raji Gurunathan and Sandhya Durvasula, graduate students in EFG, conducted initial experiments with random techniques for selecting 3way alignments that led to the current work. Thanks to Sonja Prohaska for helpful discussions. Research of the authors was supported by a State of Arizona Proposition 301 grant to the Initiative for Genome Informatics.
References

Davidson A: A Fast Pruning Algorithm for Optimal Sequence Alignment.
Proceedings of The 2nd IEEE International Symposium on Bioinformatics and Bioengineering (BIBE'2001) 2001, 4956.

Gotoh O: Significant Improvement in Accuracy of Multiple Protein Sequence Alignments by Iterative Refinement as Assessed by Reference to Structural Alignments.

Gusfield D: Efficient methods for multiple sequence alignment with guaranteed error bounds.

Gusfield D: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press; 1997.

Miller W: Building multiple alignments from pairwise alignments.

Myers EW, Miller W: Optimal alignments in linear space.
Comput Appl Biosci 1988, 4:1117. PubMed Abstract

Bonizzoni P, Vedova GD: The complexity of multiple sequence alignment with SPscore that is a metric.

Just W: Computational complexity of multiple sequence alignment using SP score.
J Comput Biol 2001, 8:615623. PubMed Abstract  Publisher Full Text

Wang L, Jiang T: On the complexity of multiple sequence alignment.

Durbin R, Eddy S, Krogh A, Mitchison G: Biological Sequence Analysis. Cambridge: Cambridge University Press; 1998.

Kumar S, Filipski AJ: Molecular Phylogeny Reconstruction. Oxford University Press; 2000.

Stevens JR, Schofield CJ: Phylogenetics and sequence analysis – some problems for the unwary.
Trends in Parasitology 2003, 19:582588. PubMed Abstract  Publisher Full Text

Thompson JD, Plewniak F, Poch O: A comprehensive comparison of multiple sequence alignment programs.
Nucleic Acids Res 1999, 27:26822690. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wallace IM, Blackshields G, Higgins DG: Multiple sequence alignments.
Curr Opin Struct Biol 2005, 15:261266. PubMed Abstract  Publisher Full Text

Feng DF, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees.
J Mol Evol 1987, 25:351360. PubMed Abstract

Kumar S, Filipski AJ: Multiple Sequence Alignment: In Pursuit of Homologous DNA Positions.

Gupta SK, Kececioglu JD, Schaffer AA: Improving the practical space and time efficiency of the shortestpaths approach to sumofpairs multiple sequence alignment.
J Comput Biol 1995, 2:45972. PubMed Abstract

Spouge JL: Speeding up dynamic programming algorithms for finding optimal latticepaths.

Colbourn CJ, Dinitz J, (Eds): CRC Handbook of Combinatorial Designs. Boca Raton FL: CRC Press; 1996.

Colbourn CJ, Rosa A: Triple Systems. Oxford: Oxford University Press; 1999.

Holyer I: The NPcompleteness of some edgepartition problems.

Rosenberg MS: Multiple sequence alignment accuracy and evolutionary distance estimation.
BMC Bioinformatics 2005, 6:278. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Stinson DR: Hillclimbing algorithms for the construction of combinatorial designs.

Gibbons PB: Computational Methods in Design Theory. In CRC Handbook of Combinatorial Designs. Boca Raton FL: CRC Press; 1996:718740.

Colbourn CJ, Mathon R: Leave graphs of small maximal partial triple systems.