Department of Computer Science, Iowa State University, Ames, IA, USA

National Evolutionary Synthesis Center, Durham, NC, USA; University of Florida, Gainesville, FL, USA

Abstract

Background

To infer a species phylogeny from unlinked genes, phylogenetic inference methods must confront the biological processes that create incongruence between gene trees and the species phylogeny. Intra-specific gene variation in ancestral species can result in deep coalescence, also known as incomplete lineage sorting, which creates incongruence between gene trees and the species tree. One approach to account for deep coalescence in phylogenetic analyses is the deep coalescence problem, which takes a collection of gene trees and seeks the species tree that implies the fewest deep coalescence events. Although this approach is promising for phylogenetics, the consensus properties of this problem are mostly unknown and analyses of large data sets may be computationally prohibitive.

Results

We prove that the deep coalescence consensus tree problem satisfies the highly desirable Pareto property for clusters (clades). That is, in all instances, each cluster that is present in all of the input gene trees, called a consensus cluster, will also be found in every optimal solution. Moreover, we introduce a new divide and conquer method for the deep coalescence problem based on the Pareto property. This method refines the strict consensus of the input gene trees, thereby, in practice, often greatly reducing the complexity of the tree search and guaranteeing that the estimated species tree will satisfy the Pareto property.

Conclusions

Analyses of both simulated and empirical data sets demonstrate that the divide and conquer method can greatly improve upon the speed of heuristics that do not consider the Pareto consensus property, while also guaranteeing that the proposed solution fulfills the Pareto property. The divide and conquer method extends the utility of the deep coalescence problem to data sets with enormous numbers of taxa.

Introduction

The rapidly growing abundance of genomic sequence data has revealed extensive incongruence among gene trees (e.g.,

Related work

The deep coalescence problem is an example of a supertree problems, in which input trees with taxonomic overlap are combined to build a species tree that includes all of the taxa found in the input trees (see

Our contributions

We prove that the deep coalescence consensus tree problem satisfies the Pareto property for clusters. This result provides useful guidance for the species tree search. Instead of evaluating all possible species trees, to find the optimal solution we need only to examine trees that satisfy the Pareto property on clusters. These trees will all be refinements of the strict consensus of the gene trees. Furthermore, the Pareto property allow us to show that the problem can be divided into smaller independent subproblems based on the strict consensus tree. We apply this property and describe a new divide and conquer method, and our experiments demonstrate that this method can greatly improve the speed of deep coalescence tree heuristics, potentially enabling efficient and effective estimates from inputs with several thousands of taxa. Future work will exploit the independence of the subproblems and solve these on parallel machines, which should result in even larger and more accurate solutions.

Methods

Basic definitions, notations, and preliminaries

In this section we introduce basic definitions and notations and then define preliminaries required for this work. For brevity some proofs are omitted in the text but available in Additional file

**Omitted proofs in the main manuscript**.

Click here for file

A

A

Let _{T }_{T} y _{T} y _{T} y _{T} y _{T }_{2}(

Let

If {_{T} y _{T}_{T}

If

Examples of the following definitions are shown in Figure _{v}_{T} v

Examples of tree definitions

**Examples of tree definitions**. (a) A rooted tree

Deep coalescence

We define the

Example of deep coalescence cost definition

**Example of deep coalescence cost definition**. Example showing the deep coalescence cost from

Throughout this section we assume

**Definition 1 **(Path length)**. Suppose x ≤**

**Definition 2 **(LCA mapping)**. Let v ∈ V(T), the LCA mapping of v in S, denoted M**

**Definition 3 **(Deep coalescence)**. The deep coalescence cost from T to S, denoted DC(T, S), is**

Using the extended path lengths, the deep coalescence cost can be equivalently expressed as

Consensus tree

**Definition 4 **(

_{1},...,_{n}

**Definition 5 **(Deep coalescence consensus tree problem).

Cluster and Pareto

**Definition 6 **(Cluster)**. Let T be a tree, the clusters induced by T, denoted, **

**Definition 7 **(Pareto on clusters)**. Let P be a consensus tree problem based on some cost function. We say that P is Pareto on clusters if: for all instances I = (T**

Theorem overview

We wish to show that the deep coalescence consensus tree problem is Pareto on clusters. We describe a high level structure of the proof in this section and provide necessary supporting lemmata in the next section. The proof proceeds by contradiction, assuming that the deep coalescence consensus tree problem is _{1},...,_{n}

Supporting lemmata

Shallowest regrouping operation

In this section we formally define the new tree edit operation that forms the key part of the theorem. We begin with some useful definitions related to the depth of nodes. An example of this operation is shown in Figure

Example of the shallowest regrouping operation

**Example of the shallowest regrouping operation**. Example of the shallowest regrouping operation of _{1} and _{2}. _{1} and _{2} are the resulting trees of this operation. That is, _{1} = Γ(_{1}) and _{2} = Γ(_{2}).

**Definition 8 **(Node depth)**. The depth of a node v∈V(T), denoted dep**

**Definition 9 **(Shallowest nodes)**. Let T be a tree and X ⊆ V(T), the shallowest function, denoted shallowest**

Now we have the necessary mechanics to define the new tree edit operation. In what follows, we assume

**Definition 10 **(Regroup). _{2}(

**Definition 11 **(Shallowest regroup).

As Figure

Counting the number of degree-two nodes

The regrouping operation includes the step of suppressing nodes with degree two. Since this step affects path lengths and ultimately deep coalescence costs, we are required to count carefully the number of degree-two nodes under various conditions. Here we assume that

**Observation 1**. _{2}(_{2}(

**Observation 2**. _{2}(

The next Lemma says that if the root of

**Lemma 1**. _{2}(_{2}(

_{1} < ... <_{n}_{1}, ... , _{n }

Setup and variable assignments for the proof of Lemma 1

**Setup and variable assignments for the proof of Lemma 1**. Tree showing the variable assignments in the proof of Lemma 1. Dotted lines represent omitted parts of the tree, and triangles represent subtrees.

• _{n }

• _{1} ∩

• _{1} ∩ _{2},..., _{n }_{2}, ...,_{n }_{1} is the shallowest degree-two node in

In order to obtain _{1} whose leaves are in _{1}). Thus there must be at least one degree-two node in _{1} (or _{1} if _{1} is pruned). Similarly, for 1 <_{i }_{i}

Properties of the regrouping operation

We examine some properties of the regrouping operation in this section. In general, these properties show that the path lengths defined by LCA's do not increase under several different assumptions. This preservation of path lengths would later assist in the calculation of deep coalescence costs. Throughout this section, we assume _{2}(

**Lemma 2**. _{S}_{S' }

**Lemma 3**. _{S}_{R}

**Lemma 4**. _{S}_{R}

**Lemma 5**. _{S}_{R}

_{S }_{S" }_{R}_{S}_{S'}_{S" }_{S}_{S}_{S} b_{S} x_{S" }_{S"}_{S"} v_{S" }x_{S" }b_{S}_{S" }_{S" } b _{S}_{S" }

Next, by (R2) _{S"}_{R}

Finally, combining the above results we have _{S}_{R}

Main theorem

**Theorem 1**.

_{1},...,_{n}_{i}, S_{i}, R

Let _{i }

Since (1) sums over all edges in

Running example for the proof of Theorem 1

**Running example for the proof of Theorem 1**. A running example for the proof of Theorem 1 where _{1}, _{2}, and _{3}. The rest of edges form the partition _{4}. By counting the costs for each partition we have Σ_{1} =6 - 4 = 2, Σ_{2} =1 - 3 = −2, Σ_{3} = 2 − 1 = 1, and Σ_{4} =3 − 3 = 0. Overall we have

We identify some specific nodes in order to partition the edges of _{2}(_{w' }

Let _{T }_{T }_{1}, _{2}, _{3}, _{4}} as follows.

1. _{1} ≜ {{

2. _{2 }≜ {{

3. _{3} ≜ {{

4. _{4} ≜ _{1 }∪ _{2} ∪ _{3})

We consider (1) for each of the partition separately. For clarity, we define the aggregated cost difference Σ_{i }_{i }

Hence (1) becomes

Let _{S}_{S }_{i }

**Claim 1**. Σ_{1} ≥

_{1} ≥ 0. Since _{S }_{x' }_{x'}_{1} ≥ |_{2}(_{2}(_{U }_{U }_{1} ≥ |_{2}(_{U }_{S }

**Claim 2**. Σ_{2} = −

The fourth equality holds because _{R}_{S}

**Claim 3**. Σ_{3} ≥ 1

_{3} where _{T} b_{1} or _{2}. We consider two cases for

1. If _{S}_{R}

2. If _{S}_{R}

In any case, we have _{S}_{R}_{3}. This implies that Σ_{3} ≥ 0. Further, since _{2}(_{3} such that _{S}_{R}_{3} ≥ 1.

**Claim 4**. Σ_{4} ≥ 0

_{4} where _{T} b_{S }_{R}_{4}, hence Σ_{4} ≥ 0.

Finally, we have Σ_{1} +Σ_{2} +Σ_{3} +Σ_{4} ≥

Algorithm for improving a candidate solution

Algorithm 1 takes a consensus tree problem instance and a candidate solution as inputs. If the candidate solution does not display the consensus clusters, it is transformed into one that includes all of the consensus clusters and has a smaller (more optimal) deep coalescence cost.

**Algorithm 1 **Deep coalescence consensus clusters builder

1: **procedure **DCConsensusClustersBuilder (

Input: A consensus tree problem instance _{1},...,_{n}

Output:

2:

3:

4: **for all **cluster **do**

5: **if ****then**

6:

7:

8: **end if**

9: **end for**

10: **return **

11: **end procedure**

The correctness of Algorithm 1 follows from the proof of Theorem 1. We now analyze its time complexity. Let ^{2}) time.

General method for improving a search algorithm

In this section we extend the result of Theorem 1 and show that the deep coalescence consensus tree problem exhibits optimal substructures based on the

Running example for the definitions and proof of Theorem 2

**Running example for the definitions and proof of Theorem 2**. A running example for the definitions and proof of Theorem 2. Arrows are marked by numbers 1 to 6, demonstrating the steps of the proof. Each step is explained below: (1) Given an instance _{1},...,_{n}

**Definition 12 **(Strict consensus tree _{1},...,_{n}

**Definition 13 **(Cut on trees).

_{1},...,_{n}

**Theorem 2**. _{1},...,_{n}

1. Remove all edges of

2. Identify

3. For each leaf

Let the resulting tree be

Let _{i }

For convenience, let

Similar to the proof of Theorem 1, we partition the edges of _{under}
, E_{out}, E_{in}

1.

2.

3.

Recall that the modification of _{under }_{out}_{in }

Theorem 2 implies that every internal node of the strict consensus tree defines an independent subproblem, and solutions of these subproblems can be combined to give a solution to the original deep coalescence consensus tree problem. This leads to the following general divide and conquer method that improves an existing search algorithm.

**Method 1 **Deep coalescence consensus tree method

1: **procedure **DCConsensusTreeMethod(

Input: A DC consensus tree problem instance _{1},...,_{n}

Output: A candidate solution

2:

3: **for all **internal node **do**

4: _{h }

5: _{h }_{h}

6: Refine the children of _{h}

7: **end for**

8: **return **

9: **end procedure**

Results

We used simulation experiments to (i) test if the solutions obtained from efficient heuristics presented in

Experiment results 1

First to examine if subtree pruning and regrafting (SPR) heuristic solutions from

Experiment results 2

We next evaluated the efficacy and scalability of Method 1 and compared it to the standalone SPR heuristic. We generate sets of gene trees, each with different consensus tree structures (depths and branch factors) as follows. The _{d,b}_{d,b}_{d,b}_{d,b}

Deep coalescence score and runtime results for Experiment 2

**Deep coalescence score and runtime results for Experiment 2**. Legend: blue represents Method 1 (divide and conquer) and orange represents standalone SPR heuristic.

Experiment results 3

Finally, we examined the performance of Method 1 and compare it to the standalone SPR heuristic using more biologically plausible coalescence simulations. We followed the general structure the coalescence simulation protocol described by Maddison and Knowles

For each data set, we performed a phylogenetic analyses using Method 1 and also using only the SPR heuristic from Bansal et al.

Discussion

In addition to offering a biologically informed optimality criterion to resolve incongruence among gene trees, we prove that the deep coalescence problem also is guaranteed to retain the phylogenetic clusters for which all gene trees agree. Since the deep coalescence problem is NP-hard

Still, our simulation experiments suggest that, in many cases, the SPR local search heuristic described by Bansal et al.

Further, Theorem 2 shows that the deep coalescence consensus tree problem exhibits independent optimal substructures. This implies that, once we compute the strict consensus tree of the problem instance, the rest of Method 1 can be directly parallelized, regardless of which external deep coalescence solver is used. In the case where the external solver guarantees exact solutions, our method would also give exact solutions, but can potentially solve instances with a much larger taxa size compared to running the external solver alone.

Although the Pareto property for the deep coalescence consensus tree problem is desirable, and the divide and conquer method is promising for large-scale analyses, there are limitations to their use. First, the Pareto property and Method 1 are limited to the consensus case, or, instances in which all of the input gene trees contain sequences from all of the species. Also, the Pareto property is only useful when all input trees share some clusters in common. If there are no consensus clusters among the input trees, then Method 1 conveys no run-time benefits. While this may seem like an extreme case, it is possible with high levels of incomplete lineage sorting, or, perhaps more likely, much error in the gene tree estimates. Also, as we add more and more gene trees, we would expect more instances of conflict among the gene trees, potentially converging towards the elimination of consensus clusters. Than and Rosenberg

Conclusions

We prove that the deep coalescence consensus tree problem satisfies the Pareto property for clusters and describe an efficient algorithm that, given a candidate solution that does not display the consensus clusters, transforms the solution so that it includes all the consensus clusters and has a lower deep coalescence cost. We extend the result and prove that the problem exhibits optimal substructures based on the strict consensus tree of the input gene trees. Based on this property, we suggest a new, parallelizable tree search method, in which we refine the strict consensus of the input gene trees. In contrast to previously proposed heuristics, this method guarantees that the proposed solution will contain the Pareto clusters. Also, as our experiments demonstrate, this method can greatly improve the speed of deep coalescence tree heuristics, potentially enabling efficient and effective estimates from input with thousands of taxa.

List of abbreviations used

LCA: least common ancestor; SPR: subtree pruning and regrafting; NNI: nearest neighbor interchange; TBR: tree bisection and reconnection

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

HTL and OE were responsible for theory development and algorithm design. HTL implemented the programs. HTL and JGB designed and conducted simulation experiments, and JGB led the analysis of the results. All authors contributed to the writing of this manuscript, and have read and approved the final manuscript.

Acknowledgements

The authors would like to thank our anonymous reviewers who have provided valuable comments, as well as providing a simpler proof of Lemma 1. This work was conducted with support from the Gene Tree Reconciliation Working Group at NIMBioS through NSF award #EF-0832858, with additional support from the University of Tennessee. HTL and OE were supported in parts by NSF awards #0830012 and #10117189.

This article has been published as part of