Division of Invertebrate Zoology, American Museum of Natural History, New York, NY - 10024, USA

Abstract

Background

The inference of homologies among DNA sequences, that is, positions in multiple genomes that share a common evolutionary origin, is a crucial, yet difficult task facing biologists. Its computational counterpart is known as the multiple sequence alignment problem. There are various criteria and methods available to perform multiple sequence alignments, and among these, the minimization of the overall cost of the alignment on a phylogenetic tree is known in combinatorial optimization as the Tree Alignment Problem. This problem typically occurs as a subproblem of the Generalized Tree Alignment Problem, which looks for the tree with the lowest alignment cost among all possible trees. This is equivalent to the Maximum Parsimony problem when the input sequences are not aligned, that is, when phylogeny and alignments are simultaneously inferred.

Results

For large data sets, a popular heuristic is Direct Optimization (DO). DO provides a good tradeoff between speed, scalability, and competitive scores, and is implemented in the computer program POY. All other (competitive) algorithms have greater time complexities compared to DO. Here, we introduce and present experiments a new algorithm Affine-DO to accommodate the indel (alignment gap) models commonly used in phylogenetic analysis of molecular sequence data. Affine-DO has the same time complexity as DO, but is correctly suited for the affine gap edit distance. We demonstrate its performance with more than 330,000 experimental tests. These experiments show that the solutions of Affine-DO are close to the lower bound inferred from a linear programming solution. Moreover, iterating over a solution produced using Affine-DO shows little improvement.

Conclusions

Our results show that Affine-DO is likely producing near-optimal solutions, with approximations within 10% for sequences with small divergence, and within 30% for random sequences, for which Affine-DO produced the worst solutions. The Affine-DO algorithm has the necessary scalability and optimality to be a significant improvement in the real-world phylogenetic analysis of sequence data.

Background

The inference of homologies among DNA sequences, that is, positions in multiple genomes that share a common evolutionary origin, is a crucial, yet difficult task facing biologists. Its computational counterpart is known as the multiple sequence alignment problem. There are various criteria and methods available to perform multiple sequence alignments (e.g.

An important element in sequence alignment and phylogenetic inference is the selection of the edit function, and in particular, the cost

For large data sets, a popular heuristic is Direct Optimization (DO)

The properties of DO and the GTAP (DO+GTAP) for phylogenetic analysis were experimentally evaluated in

In

Although

A comparison of the tree scores of various methods was recently performed in

In this paper, we introduce and present experiments for a new algorithm Affine-DO. Affine-DO has the same time complexity of DO, but is correctly suited for the affine gap edit distance. We show its performance experimentally, as implemented in POY version 4, with more than 330,000 experimental tests. These experiments show that the solutions of Affine-DO are close to the lower bound inferred from an Linear Programming (LP) solution. Moreover, iterating over a solution produced using Affine-DO has very little impact in the overall solution, a positive sign of the algorithm’s performance.

Although we build Affine-DO on top of the successful aspects of DO, DO has never been formally described, nor have its basic properties been demonstrated. To describe Affine-DO, we first formally define DO and demonstrate some of its properties.

Related Work

The TAP is known to be NP-Hard

Hein

The most important theoretical results for the TAP are several 2 approximation algorithms, and a Polynomial Time Approximation Scheme (PTAS)

Direct Optimization (DO)

DO can be implemented with a time complexity of ^{2}|^{2}|

Schwikowski and Vignon

Results and discussion

Direct Optimization

Direct Optimization (DO) has only been described informally in the literature

In TreeAlign and PRODALI, the set of optimal alignments between sequences are represented in an ^{4})), and in its implementation. PRODALI is more expensive in practice, as it not only stores the set of optimal, but also suboptimal alignments.

In DO, not all the possible alignments are stored, but only one. However, it stores all the possible sequences that can be produced from this alignment. We will call such set of sequences a reduced alignment graph (RAG). Thanks to their simplicity, DO use a more compact representation of a RAG, to permit greater scalability than that of TreeAlign or PRODALI. DO represents them as sequences in an extended alphabet by which we can then represent a complete RAG with an array.

It is then possible to align RAG’s, find the closest sequences contained in them, and compute their RAG with time complexity ^{2}). The following section formalizes these ideas.

Sets of Sequences, Edition Distance, and Medians

The first goal is to find a compact representation of sets of sequences produced in a pairwise alignment. For example, the alignment

Alignment graphs

**Alignment graphs.** Graphs representing the alignment **a.** A plain alignment graph. **b.** An alignment graph that contains more potential sequences.

The same information can be efficiently stored by using an extended alphabet

We call such representation a Reduced Alignment Graph (RAG). Notice that all the intermediate sequences can be produced by selecting an element from each set in the RAG, and removing all the indels from the resulting sequence. If a sequence can be generated by following this procedure, then we say that the sequence is

Observation 1

Let

In the original problem definition, we are given a distance _{
P
}(_{
a∈A,b∈B
}
_{
P
}. The following observation is by definition:

Observation 2

For all _{
P
}, there exists an _{
P
}(

Define the RAG edit distance by setting _{
P
}in Equation 1.

The sequence edit distance can be computed using dynamic programming

with base cases

We will show that we can find efficiently the closest sequences in a pair of RAGs, as well as their edit distance. Thanks to these properties, a RAG is used instead of an alignment graph, to bound the cost of a tree with lower time complexity.

Lemma 1

For all RAGs

Proof

We define a procedure to produce

**case 1** Select an element _{
i
}that holds Observation 2 and prepend it to _{
k
}that is closest to _{
P
}(_{
i
},_{
j
}).

**case 2** Select an element _{
i
}closest to _{
P
}(_{
i
},{

**case 3** Symmetric to case 2.

□

Observe that the overall time complexity remains ^{2}) as in the original Needleman-Wunsch algorithm

The DO Algorithm

DO (Algorithm

**Data**: A binary tree

**Data**: An assignment

**Data**:

**Result**:

**begin**

**foreach**
**do**

**foreach**
**do**

**if**
**then**

_{
i
},_{
i
}={_{
i
}}〉

**else**

**Data**:

_{
P
}(

_{
P
}(

**end**

**end**

**return**

**end**

**end**

We have not defined yet _{
P
}(_{
P
}(

Without loss of generality, assume from now on that for all

Lemma 2

Let _{
P
}(_{
P
}(

Proof

Follows directly from the median definition and Lemma 1. □

Lemma 2 is important for the correctness of the DO algorithm. It shows that for every sequence contained in _{
P
}(_{
P
} directly to calculate the overall cost of the tree. Without it, _{
P
}cannot be used for this purpose directly.

Definition 1

Compatible assignments Two assignments ^{∗}and ^{
′
}:^{∗}are ^{
′
}(

The following Theorem shows that the tree cost computed by DO is feasible:

Theorem 1

There exists an assignment of sequences ^{
′
}compatible with

Proof

Let ^{
′
}the final assignment of sequences to the vertices of ^{
′
}(^{∗}is included in ^{
′
}(^{
′
}(

DO is weaker than the alignment graph algorithms

The Affine Gap Cost Case

In practice, biologists use DO because of its scalability and competitive costs. However, the DO algorithm was defined for the non-affine distance functions (_{
P
} cannot be directly used to correctly bound the cost of a tree.

Example of suboptimal median

**Example of suboptimal median.** Let _{P}. It follows that DO, if used directly for the affine gap cost case, can compute an incorrect cost for a given tree.

To overcome this problem, we extend Gotoh’s algorithm

Heuristic Pairwise RAG Alignment

Let _{
P
}, using 4 auxiliary matrices (

The matrices _{
i
} and _{
j
} align elements other than an _{
i
}and _{
j
}. _{
j
} with an indel. Finally, _{
i
} with an indel.

To compute these values, we define a number of accessory functions. The cost of a pure substitution _{
P
}(

There are three remaining accessory functions required to compute the matrices _{
i
} with a gap:

The second function ^{
′
}(

The third, and final accessory function, computes what would be the extra cost of

Finally, the recursive functions for the cost matrices is defined as:

with base cases _{
i
}),1≤_{
j
}), and

The following theorem shows that if we align a pair of sequences in

Theorem 2

There exists a sequence

Proof

We are going to create a pair of sequences _{
k
}and _{
k
}, where _{
i
}and _{
j
}as follows:

1. _{1…i
}and _{1…j
}when a non-indel element of _{
i
}and _{
j
}is aligned. If the backtrack uses _{
i
}and _{
j
}the closest elements in _{
i
}∖_{
j
}∖_{
i
}and _{
j
}, and add a cost that is always greater than or equal to _{
i
},_{
j
})=_{
i
},_{
j
}).

2. _{
k
}=

If _{
k
}=_{
k
}and _{
k
}causes no additional cost in the particular alignment being built between

3. _{
k
}=

4. _{
i
}or _{
j
}does not contain an indel. Otherwise, if this option is selected, then simply assign _{
k
} and _{
k
}with no extra cost for the alignment of

□

The Main Algorithm: Affine-DO

We will now use _{
P
}in Algorithm

1. If we selected two indels in _{
k
}and _{
k
}, don’t change

2. If _{
k
}=_{
k
}≠

3. If _{
k
}≠_{
k
}=

4. If _{
k
}≠_{
k
}≠_{
i
},for some _{
j
},_{
k
},_{
k
})} + {_{
j
},for some _{
i
},_{
k
},_{
k
})} to

5. Once the complete ^{
′
}is created, remove all the elements _{
i
}={

Definition 2

Affine-DO Affine-DO is Algorithm _{
P
}with _{
P
}with

It is now possible to use the Affine-DO algorithm to bound heuristically the cost of an instance of the TAP.

Theorem 3

Given a rooted tree ^{
′
}:^{∗}such that ^{
′
}(^{
′
}.

Proof

If there are no indels involved in the tree alignment, then the arguments of Theorem 1 would suffice. Hence, we now concentrate on the cases that involve indels.

To prove those remaining cases, we will use induction on the vertices of the tree. To do so, we will count the

Credits and debits in the simple cases

**Credits and debits in the simple cases.**

For the inductive step, consider the leaves of the tree. By definition, for all

Consider now the interior vertex

Consider now the more difficult case when the blocks do not have exact limits. Assume without loss of generality that

Credits and debits in the complex cases

**Credits and debits in the complex cases.** In the upper part, overlapping blocks of type B in

The total credit granted by Equation 2 is _{1},_{2},…,_{
m
}can occur (Figure

By the inductive hypothesis, the subtree rooted by

Theorem 4

If ^{2}|^{2}|

Proof

If the alphabet is small, then _{
P
}can be pre-computed in a lookup table for constant time comparison of the sets. For large alphabets the maximum size of the sets contained in _{
P
}can be made constant. Otherwise, a binary tree representation of the sets would be necessary, adding a |^{2}) where

Experimental Evaluation

In this section, we describe the methods used to generate the instance problems, assess the solutions generated by each algorithm, and compare the algorithms. This allows the assessment of the performance of each algorithm, Affine-DO in greater detail, and an evaluation of Affine-DO using exact solutions for trees with only 3 leaves.

Data Sets

To generate the instance problems, We simulated a number of sequences using DAWG 1.1.1

**Parameter**

**Values Evaluated**

All combinations of parameters were employed to generate the test data sets. The branch length variation equals the average branch length.

Substitution Rate

1

Average Branch Length

0

Max. Gap

1,2,5,10,15

Root Sequence Length

70,100,150,200,

300,400,500,1000

Solution Assessment

The sequences assigned by the simulation can be far from the optimal solution. To evaluate Affine-DO, we used two algorithms: the standard Fixed States algorithm, which is known to be a 2-approximation, and the cost calculated by the solution of an LP instance of the problem. A good heuristic solution should always be located between these two bounds. As a comparison measure for each solution, the ratio between the solution cost and the LP bound was computed. The closer the ratio to 1

This form of evaluation has the main advantage (but also disadvantage), of being overly pessimistic. Most likely, the LP solution is unachievable, and therefore, the approximation ratio inferred for the solution produced by Affine-DO will most likely be an overestimate. To assess how over-negative the LP bound is, we produced 2100 random sequences divided in triplets of lengths between 70 and 1000. For each triplet, the Affine-DO, the LP bound, and the exact solution were computed. These three solutions were compared to provide an experimental overview of the potential performance of our algorithm. We selected random sequences because preliminary experiments showed evidence that these produce the most difficult instances for Affine-DO.

Algorithms compared

We implemented a number of algorithms to approximate the tree alignment problem. Our implementation can be divided in two groups: initial assignment, and iterative improvement.

Initial Assignment

includes the Fixed States (a stronger version of the Lifted Assignment ^{
′
} compatible with

Iterative Improvement

modifies an existing ^{
′
}by readjusting each interior vertex using its three neighbors. This procedure is repeated iteratively, until a (user provided) maximum number of iterations is reached, or no further tree cost improvements can be achieved. The adjustment itself can be done using an ^{3}) ^{2}) memory consumption

Approximate DO

**Approximate DO.** An iteration of the approximated iterative improvement. To improve _{1}, _{2}, and _{3}in the three possible rooted trees with leaves _{1}yields better cost than the original

We compared MSAM

In total, more than 330,000 solutions were evaluated. We only present those results that show significant differences, and represent the overall patterns detected. The Exact Iterative algorithm was only evaluated for the short sequences (70 to 100 bases), due to the tremendous execution time it requires. Fixed States followed by iterative improvement is not included because its execution time is prohibitive for this number of tests (POY version 4 supports this type of analysis). Nevertheless, preliminary analyses showed that this combination of algorithms produce results in between Fixed States and Affine-DO, but not competitive with Affine-DO.

Algorithm Comparison

The most important patterns observed between the evaluated algorithms are presented in Figure

Algorithm comparison

**Algorithm comparison.** General patterns observed in the approximation ratio of the different algorithms. Simulation is the simulated data, ADO is Affine-DO, Approx. and Exact IADO are the approximated and the exact iterative Affine-DO algorithms respectively, initial and final MSAM are the initial and final estimations of the MSAM algorithm. **a.** substitutions = 1, **b.** substitutions = 4, **c.** substitutions = 4,

**Subst.**

**Gap Op.**

**Branch Len.**

**Algorithm**

**Min.**

**Median**

**Max**

Each individual indel has cost 1.

1

0

0

Simulated

1

1

1

Fixed States

1

1

1

ADO

1

1

1

ADO + Iter.

1

1

1

1

0

0

Simulated

1

2

2

Fixed States

1

1

1

ADO

1

1

1

ADO + Iter.

1

1

1

4

3

0

Simulated

1

1

1

Fixed States

1

1

1

ADO

1

1

1

ADO + Iter.

1

1

1

4

3

0

Simulated

2

2

2

Fixed States

1

1

1

ADO

1

1

1

ADO + Iter.

1

1

1

Although the combination of Affine-DO and Iterative improvement produces better solutions, its execution time is dramatically higher. In the current implementation, running on a 3.0 Ghz, 64 bit Intel Xeon 5160 CPU with 32 GB of RAM, Affine-DO evaluates each tree in less than 1 second in the worst case, while Affine-DO + Iterative improvement may take more than 1 hour per tree. For this reason, Affine-DO is well suited for heuristics that require a very large number of tree evaluations such as the GTAP, where millions of trees are evaluated during a heuristic search.

Approximation of Affine-DO

Figure

Affine-DO vs

**Affine-DO vs.** Theoretical LP bound. Guaranteed approximation ratio of Affine-DO compared with the theoretical LP bound, for different cost and sequence generation parameters. **a.** substitutions = 1, **b.** substitutions = 2, **c.** substitutions = 4,

Typically, the larger the sequence divergence, the larger is the approximation degree of Affine-DO. The same pattern is observed for larger

Affine-DO vs.

**Affine-DO vs** Theoretical LP bound with random sequences. Guaranteed approximation of Affine-DO for random sequences. In the left substitutions=1,

The worst case is observed with an average approximation slightly over 1

Comparison with an exact solution

To assess Affine-DO and the tightness of the LP bound, we computed the exact solution for 700 unrooted trees consisting of 3 leaves with random sequences assigned to their leaves, under all the parameter sets tested. Figure

Affine-DO vs

**Affine-DO vs.** exact solution. Tightness of the Affine-DO solution according to the LP bound compared to the exact approximation. Observe that even for a very small data set, the LP bound is not realistic, and Affine-DO is close to the optimal solution. **a.** substitutions = 1, **b.** substitutions = 2, **c.** substitutions = 4,

Note that the LP-inferred bound is overly negative even for these small test data sets, with the inferred approximation expected at around 1.15, while in reality Affine-DO finds solutions that are expected to approximate within 1.05 of the optimal solution, a 10% difference for trees consisting of only 3 sequences.

Conclusions

We have presented a novel algorithm that we have called Affine-DO for the TAP under affine gap costs. Our experimental evaluation, the largest performed for this kind of problem, shows that Affine-DO performs better than Fixed States. However, we observed that the LP bound is too pessimistic, producing unfeasible solutions 10% worse, even for the smallest non-trivial tree consisting of 3 leaves. Based on these observations, we believe that Affine-DO is producing near-optimal solutions, with approximations within 10% for sequences with small divergence, and within 30% for random sequences, for which Affine-DO produced the worst solutions.

Affine-DO is well suited for the GTAP under affine sequence edit distances, and yields significantly better results when augmented with iterative methods. The main open question is whether or not there exists a guaranteed bound for DO or Affine-DO. Then, if the answer is positive, whether or not it is possible to improve the PTAS using these ideas. Additionally, many of these ideas can be applied for true simultaneous tree and alignment estimation under other optimality criteria such as ML and MAP. Their use under these different optimality criteria remains to be explored.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

WW defined the Fixed States and DO algorithms. AV developed Affine-DO and performed all analyses under supervision of WW. AV and WW wrote and revised the manuscript. Both authors read and approved the final manuscript for publication.

Acknowledgements

This material is based upon work supported by, or in part by, the U. S. Army Research Laboratory and the U. S. Army Research Office under grant number W911NF- 05-1-0271.