Division of Invertebrate Zoology, American Museum of Natural History, New York, NY - 10024, USA

Abstract

Background

A phylogeny postulates shared ancestry relationships among organisms in the form of a binary tree. Phylogenies attempt to answer an important question posed in biology: what are the ancestor-descendent relationships between organisms? At the core of every biological problem lies a phylogenetic component. The patterns that can be observed in nature are the product of complex interactions, constrained by the template that our ancestors provide. The problem of simultaneous tree and alignment estimation under Maximum Parsimony is known in combinatorial optimization as the Generalized Tree Alignment Problem (GTAP). The GTAP is the Steiner Tree Problem for the sequence edit distance. Like many biologically interesting problems, the GTAP is NP-Hard. Typically the Steiner Tree is presented under the Manhattan or the Hamming distances.

Results

Experimentally, the accuracy of the GTAP has been subjected to evaluation. Results show that phylogenies selected using the GTAP from unaligned sequences are competitive with the best methods and algorithms available. Here, we implement and explore experimentally existing and new local search heuristics for the GTAP using simulated and real data.

Conclusions

The methods presented here improve by more than three orders of magnitude in execution time the best local search heuristics existing to date when applied to real data.

Background

A phylogeny postulates shared ancestry relationships among organisms in the form of a binary tree. Phylogenies attempt to answer an important question posed in biology: what are the ancestor-descendent relationships between organisms? At the core of every biological problem lies a phylogenetic component. The patterns that can be observed in nature are the product of complex interactions, constrained by the template that our ancestors provide. For example, the presence and structure of the human skull is mainly determined by its structure in our ancestors. The relationship between the features observed in different organisms can only be understood if the phylogenetic relationships can be hypothesized.

An important method of phylogenetic inference is Maximum Parsimony (MP). Under MP, the preferred hypothesis is the one that minimizes the number of evolutionary transformations required to explain the observed features

The problem of

Due to its computational hardness, biologists interested in the GTAP rely on heuristic procedures to find good solutions. The simplest, and arguably the most important heuristic for the GTAP is a

In this paper, we discuss, implement, and experimentally explore existing and new local search heuristics for the GTAP using simulated data. Our methods improve by more than three orders of magnitude the best local search heuristics existing to date with real data. We begin by formally explaining the existing heuristics, and new heuristics for the GTAP. Following the results of

The algorithms

A subproblem of the GTAP is the Tree Alignment Problem (TAP) (see ^{2}|

Existing heuristics

A local search consists of two steps: initial tree construction, and refinement (defined below). Given an initial tree **
Tree Breaking.
** Given a tree

Breaking and joining a tree

**Breaking and joining a tree.** Breaking a tree in two connected components, and joining them again with a different edge. The resulting tree is part of T’s TBR neighborhood.

**
Tree Joining.
** Let

The TBR neighborhood of

The most popular strategy for the initial tree construction is the ^{2}). The Wagner algorithm is used in most software packages for phylogenetic analysis under MP (e.g. ^{3})

Depending on the distance function, different procedures are used to compute the score of the trees in the TBR neighborhood efficiently ^{3}) ^{4}) ^{3}) by increasing the hidden factor from ^{2}) to ^{3}) (remember that typically

Exploring a neighborhood requires two additional criteria: the stopping rule, and the selection of the next candidate solution. Depending on their properties, a number of local search strategies can be described. A classic heuristic that specifies the stopping and selection criteria is simulated annealing (SA)

Sectorial search

Other strategies (e.g. Parsimony Ratchet, Tree Fusing, the Genetic Algorithm, DCM), do not strictly belong to the set of local search heuristics. Given that local search is part of all these strategies, all of them would be more efficient if a good local search is in place.

New heuristics for the GTAP

In this section, we describe four ideas to improve the local search strategies in the GTAP: efficient tree length calculation during the search, better tree cost bounding, a smarter local search strategy, and initial tree building algorithms.

Efficient tree updates

To apply the selection and stopping rules during TBR, it is necessary to calculate the tree length after every break, and join. Affine-DO requires a

To update a tree efficiently, we do not maintain a unique rooted representation, but rather take its unrooted representation and keep all the potential roots assigned to every edge of the tree (Figure

All possible roots of an unrooted tree

**All possible roots of an unrooted tree.** All possible roots of the unrooted tree correspond to the subdivision vertices of its edges (empty circles).

Three directional assignment

For an unrooted binary tree, we assign to each edge (

Three directional assignment

**Three directional assignment.** Three possible assignments to interior vertices of an unrooted tree. Left: computing the subdivision vertex of (

Observation 1

A tree with a three directional assignment computes the length of every tree that can be produced by breaking any one edge with time complexity

Observation 2

Given two separate trees ^{2}).

The simplest implementation of the three directions is to eagerly compute all the assignments in preparation for the first tree break, and join. However, such an algorithm would entail overhead for greedy heuristics such as simulated annealing, where the first acceptable tree should be chosen to continue with the local search.

We solve this problem by using lazy evaluation and memoization

Multiple heuristic TAP solutions

The Affine-DO algorithm may calculate different tree length bounds depending on the root location (i.e. one per subdivision vertex). Nevertheless, the best of all the assignments is preferable for each tree. Computing all of the Affine-DO tree lengths, however, would add a

Algorithm 1: Improving the bound of a tree on each edge break

Algorithm 2: Improving the bound of a tree on each join

For a fixed

Smarter local searches

Affine-DO ^{2}), the same of a regular pairwise sequence alignment

RAGs can be used to guide a local search. If the union of a pair of RAGs

Unions to bound the cost of a tree

**Unions to bound the cost of a tree.** Use of unions to bound the cost during a local search. Shade areas enclose disjoint sets of vertices in the tree. Suppose that we merge all the RAG’s of each vertex set using Algorithm 3 to produce the unions X, Y, and Z. Then we can heuristically bound

Algorithm 3: Algorithm to compute _{
i
},

Theorem 1

Let

Proof

At each step, either _{
i
}, _{
j
}, _{
k
}, {_{
e
} is prepended before _{
f
} if and only if _{
i
} is not prepended, then the

The analysis of

Theorem 2

Algorithm 3 computes the union of

Proof

The algorithm stops when

The union of RAGs can be executed in ^{2}), therefore, this method entails a small additive factor to the time complexity of Affine-DO. In our implementation, we have fixed the size of the vertex sets to 12 vertices on all data sets experimentally.

Using unions during a local search

Let

Algorithm 4: Heuristic Union-pruning TBR. The threshold 1.17 parameter was experimentally tuned

Building the initial trees

The Wagner algorithm is a basic procedure to compute an initial tree (Algorithm 5). We modify this procedure in two ways.

Algorithm 5: The Wagner algorithm for initial tree building

Union–pruning.

Unions can be used to efficiently prune candidate trees during the Wagner algorithm by maintaining the union set of the tree

Addition sequence

The initial sequence

1. Compute a Minimum Spanning Tree (MST) of

2. Traverse

3. To produce the

We call this procedure MST-Wagner.

Methods

We evaluated experimentally a number of algorithms for local searches under the GTAP. An experimental evaluation of this kind has three fundamental components: a selection of heuristics, implementation, and selection of data sets. The overall performance is compared with the length of the trees found by each method.

Algorithms compared

We compared the following heuristic local searches, in all meaningful combinations.

For the edit distance parameters we tested the following combinations of substitution, indel, and gap opening parameters [total gap cost = gap opening + (length × indel)]: (1, 1, 0), (1, 2, 0), (2, 1, 1), (3, 1, 2). In our experience, these parameters encompass enough variation in the GTAP, while maintaining a limited number of combinations with the algorithms. In total, 34 combinations of build algorithms and distance functions were tested. For the refinement step, a total of 208 combinations of algorithms and edit distance functions were tested.

Implementation

We implemented the algorithms under comparison in the Objective CAML and C programming languages. All the algorithms are available in the author’s computer program POY version 4

Data sets

To generate the instance problems, we simulated sequences using DAWG 1.1.1

**Parameter**

**Values Evaluated**

All combinations of parameters were employed to generate the test data sets. The branch length variation equals the average branch length.

Substitution Rate

1.5

Average Branch Length

0.1,0.2,0.3,

Max. Gap

1,2,5,10,15

Root Sequence Length

500

Results and discussion

This section begins with the difference in performance between the Exhaustive (E) and the Non-exhaustive (NE) algorithms, which can be applied in conjunction with any other search strategy. It continues with a comparison of the build algorithms, and the refinement algorithms. Finally, we compose the results in a simple local search heuristic which we compared with the previous best heuristic on a real dataset.

Exhaustive and non-exhaustive algorithms

In the build step (Figure

Tree building algorithm comparison: NE vs E

**Tree building algorithm comparison: NE vs E.** Comparison of the Non-Exhaustive (NE), and Exhaustive (E) TAP approximation algorithms in tree building (Figure a), and TBR (Figure b). The patterns showed were observed in most of the combinations of simulation, algorithm, and edit distance parameters. **a.** Tree building using the Wagner algorithm. In every case, E outperformed NE, but the difference is not significant. However, as the branch lengths increased, the performance of the NE algorithm showed high variability (right), making E highly competitive for all distance functions with average branch length 0.3. **b.** Refinement using Union-pruning with NE and E. In this case, for almost every combination of algorithm, simulation, and distance function, E produce significantly shorter trees.

For the TBR step, E significantly outperforms NE, with better minimum and expected scores (Figure

Initial tree building

The initial tree building algorithms fall into two main groups: algorithms with RAS, and algorithms using MST. In all cases, MST produced significantly shorter trees (Figure

Tree building algorithm comparison

**Tree building algorithm comparison.** Comparison of initial tree build algorithms.

Neighbor joining produced trees of highest score among all the algorithms for all parameters (i.e. the worst, between 10% and 20% higher). We do not present it in the graphs as it would make the more subtle differences between other algorithms difficult to observe. Overall, the most important improvement occurs with the MST addition sequence in first place, followed by the use of the Union-pruning strategy in second. Nevertheless, we will see in the next section that the use of the MST algorithm remains limited.

Refinement

To evaluate the TBR refinement experimentally, we must produce an initial tree. Although MST showed better results than RAS, we found that in almost every instance TBR failed to improve the MST trees. At the end, RAS + TBR would always find better trees than MST + TBR. For this reason, we used the second best method to construct the initial trees: RAS using Union-pruning.

The refinement comparison can be divided in two groups: 1.) a comparison between basic TBR using Union-pruning, and branch length sorting, and 2.) the comparison of different algorithms using the best combination among those in 1.

Union-pruning and branch length sorting

The behavior of TBR with Union-pruning and branch length sorting is presented in Figure

Tree search algorithm comparison

**Tree search algorithm comparison.** Comparative performance of Union-pruning, and branch length sorting, with randomized algorithms in TBR.

The results match our expectation: the Union-pruning algorithm can positively guide the search with better taxon sampling. We have observed this behavior in real data sets, where new terminals some times

Local search strategy

Beyond the use of Union-pruning, and Exhaustive TAP estimation, the differences among the algorithms compared are not significant (Table

**Gap Len.**

**Edition Distance**

**TBR**

**Sectorial**

**BFS**

**Annealing**

**Subst.**

**Indel**

**GO**

**Min.**

**Avg.**

**Min.**

**Avg.**

**Min.**

**Avg.**

**Min.**

**Avg.**

The differences observed are not significant. All the simulations shown have branch length 0.3, but similar patterns were observed for branch lengths 0.1 and 0.2. The minima across each row is in bold.

1

1

1

0

7190

7222.75

**7186**

7221.188

7190

**7220.969**

7198

7230.802

1

2

0

8410

8437.76

8405

**8429.812**

**8406**

8436.865

8416

8457.24

2

1

1

**14022**

14111.76

14032

14107.58

**14022**

**14096.88**

14031

14144.56

3

1

2

20089

20236.07

20118

20303.64

**20062**

**20221.83**

20172

20373.85

2

1

1

0

6680

6702.115

**6674**

**6697.76**

6676

6699.719

6687

6713.854

1

2

0

7969

7992.562

7963

**7989.333**

7969

7990.583

**7967**

8005.479

2

1

1

12994

13040.67

**12978**

13034.80

12981

**13030.21**

13001

13074.80

3

1

2

18603

18690.26

**18588**

18716.78

18589

**18678.82**

18629

18785.17

4

1

1

0

**7164**

7190.719

**7164**

**7186.323**

7166

7188.062

7176

7208.594

1

2

0

**8684**

8719.552

**8684**

**8714.406**

8682

8716.677

8698

8751.26

2

1

1

**13586**

13652.25

13590

13658.08

13592

**13646.89**

13601

13694.72

3

1

2

19148

19291.41

19149

19344.61

**19113**

**19283.66**

19209

19448.12

5

1

1

0

7049

7077.542

**7043**

7074.229

7049

**7073.729**

7057

7092

1

2

0

8692

8716.01

**8683**

8715.5

8688

**8711.104**

8690

8730.646

2

1

1

**13329**

13389.48

13334

13394.16

13336

**13387.41**

13363

13429.17

3

1

2

18876

18983.53

**18861**

19027.35

18870

**18974.93**

18930

19091.70

10

1

1

0

7149

7181.74

**7141**

**7174.938**

7145

7176.719

7163

7200.5

1

2

0

8965

9002.677

**8944**

8993.438

8948

**8992.656**

8979

9020.635

2

1

1

13200

13271.72

13199

13277.82

**13195**

**13266.54**

13235

13320.24

3

1

2

**18395**

18557.96

18423

18630.5

18402

**18549.86**

18470

18648.79

15

1

1

0

7162

7194.01

7160

7194.531

**7159**

**7190.542**

7182

7216.719

1

2

0

9151

9196.552

**9142**

9192.125

9147

**9191.344**

9151

9228.146

2

1

1

13168

13230.11

13164

13231.83

**13155**

**13217.84**

13186

13271.46

3

1

2

18194

18350.44

18234

18415.64

**18166**

**18335**

18290

18484.11

Overall performance

Based on the previous experiments, we prefer a heuristic local search strategy that consists of the following steps: build initial trees using RAS guided by Union-pruning, followed by a refinement step consisting of TBR using the three directional heuristics, Exhaustive TAP, Union-pruning, and cutting edges according to descending lengths. We compared this algorithm (implemented in POY version 4), with that of POY version 3 which uses a one directional algorithm, with randomized TBR steps

For this comparison, a random subset of 100 published anurans

To compare the performance of POY version 3 and version 4, we executed 1000 independent repetitions consisting of 1 build, followed by refinement, and reported the resulting tree score. This procedure can be executed in POY 3 with the command:

Comparison of new algorithms vs old algorithms

**Comparison of new algorithms vs old algorithms.** Density histogram of the frequency of occurrence of different tree scores in POY version 3 and version 4 for the example data set.

Discussion

We described and implemented new heuristics for the GTAP. We have shown that they find better solutions than previous approaches. We found that a number of conditions affect the fit of the heuristic to the problem: long branch–length data sets can be better analyzed with Sectorial Search instead of the Union-pruning, while Union-pruning yields excellent results in medium, and short branch lengths. Exhaustive-TBR yields the best results overall and should always be preferred. Although the MST algorithm yields better initial results than RAS, it is not preferable in the long run, and a small number of local searches should never be used to produce reliable results. It remains to be explored the quality of the numerous meta–heuristics available in the literature. It is now possible to explore them using a more efficient local search strategy.

Conclusions

We described new strategies that can be composed to produce a powerful local search strategy for the Tree Alignment Problem. The results showed that our methods improve on the best existing local search heuristics by more than three orders of magnitude.

In general, the Exhaustive–TBR refinement strategy should always be used, while Union-pruning should only be preferred if dense taxon sampling or short branch lengths are expected. Moreover, although the MST build strategy yields better results than the traditional Wagner build, the former should not be preferred in real analyses since it tends to produce less competitive trees after the refinement step.

It is difficult to predict the performance of other high level heuristics applied to the GTAP. Strategies such as Sectorial Search, and Tree Fusing should be effective. However, Divide and Conquer techniques such as DCM-3 may have a more limited application, unless used in the spirit of Sectorial Search. Given that phylogenetic analysis under MP shows a simplified setting compared to other optimality criteria, it is our opinion that metaheuristics such as Simulated Annealing have limited applicability in the join estimation of tree and alignments for all optimality criteria, and novel strategies are needed to successfully scale to larger problem sizes. Nevertheless (unless

Affine-DO, Union–pruning, and Exhaustive–TBR are some of the algorithms that we have implemented in the computer program POY version 4

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

The authors contributed equally to this work. Both authors read and approved the final manuscript.

Acknowledgements

This material is based upon work supported by, or in part by, the U. S. Army Research Laboratory and the U. S. Army Research Office under grant number W911NF- 05-1-0271.