Department of Computer Science, Iowa State University, Ames, IA 50011, USA

Department of Biology, University of Florida, Gainesville, FL 32611, USA

Abstract

Background

Gene tree - species tree reconciliation problems infer the patterns and processes of gene evolution within a species tree. Gene tree parsimony approaches seek the evolutionary scenario that implies the fewest gene duplications, duplications and losses, or deep coalescence (incomplete lineage sorting) events needed to reconcile a gene tree and a species tree. While a gene tree parsimony approach can be informative about genome evolution and phylogenetics, error in gene trees can profoundly bias the results.

Results

We introduce efficient algorithms that rapidly search local Subtree Prune and Regraft (SPR) or Tree Bisection and Reconnection (TBR) neighborhoods of a given gene tree to identify a topology that implies the fewest duplications, duplication and losses, or deep coalescence events. These algorithms improve on the current solutions by a factor of ^{2 }for searching TBR neighborhoods, where

Conclusions

The new gene tree rearrangement algorithms provide a fast method to address gene tree error. They do not make assumptions about the underlying processes of genome evolution, and they are amenable to analyses of large-scale genomic data sets. These algorithms are also easily incorporated into gene tree parsimony phylogenetic analyses, potentially producing more credible estimates of reconciliation cost.

Introduction

The availability of large-scale genomic data from a wide variety of taxa has revealed much incongruence between gene trees and the phylogeny of the species in which the genes evolve. This incongruence may be caused by evolutionary processes such as gene duplication and loss, deep coalescence, or lateral gene transfer. The variation in gene tree topologies can be used to infer the processes of genome evolution. Gene tree - species tree (GT-ST) reconciliation methods seek to map the history of gene trees into the context of species evolution and thus potentially link processes of gene evolution to phenotypic changes and diversification. Yet these methods can be confounded by error in the gene trees, which also may cause incongruence between the gene and species topologies. We introduce efficient algorithms to correct gene tree topologies based on the gene duplication, duplication and loss, or deep coalescence cost models. The algorithms work by identifying the small rearrangements in the gene trees that reduce the reconciliation cost. They are extremely fast and thus amenable to analyses of enormous genomic data sets.

Perhaps the most commonly used and computationally feasible approach to GT-ST reconciliation is gene tree parsimony, which seeks to infer the fewest evolutionary events (e.g., duplication, loss, coalescence, or lateral gene transfer) needed to reconcile a gene tree and species tree topology

Several approaches have been proposed to address gene tree error in GT-ST reconciliation. First, questionable nodes in a gene tree or nodes with low support may be collapsed prior to gene tree reconciliation, and the resulting non-binary gene trees may be reconciled with species trees

Previously ^{3}) for the SPR local search problem and ^{4}) for the TBR local search problem, where ^{2 }for the TBR local search problem. This makes the local search under the TBR edit operation as efficient as under the SPR edit operation, and it provides a high-speed gene tree error-correction protocol that is computationally feasible for large-scale genomic data sets.

We also evaluated the performance of our algorithms using the implementation of SPR based local search algorithms. Note, that the SPR neighborhood is properly contained in the TBR neighborhood for any given tree. Thus the performance of the SPR based program provides a conservative estimate of the performance of the TBR based program. We test our programs on a collection of 106 yeast gene trees, some of which contain hundreds of leaves, and we demonstrate how it can be easily incorporated into large-scale gene tree parsimony phylogenetic analyses.

Basic notation and preliminaries

Throughout this paper, the term tree refers to a rooted full binary (phylogenetic) tree.

Let

Given a vertex _{T}_{T}_{T}_{T}_{T}

We define ≤_{T }_{T }y _{T }y _{T }y _{T}_{T}_{T}

Given _{|U}, is the rooted tree that is obtained from _{u}_{|U}, for _{T }u_{1 }and _{2 }are called _{1 }and _{2 }which maps a vertex _{1 }of _{1 }to vertex _{2 }of _{2 }if the subtree rooted at _{1 }in _{1 }has the same leaf set as the subtree rooted at _{2 }in _{2}. If an isomorphism exists between _{1 }and _{2}, we write _{1 }≃ _{2}.

Given function

The reconciliation cost models

A

**Definition 1 (Mapping**). _{G,S }_{G,S }_{G,S}_{G,S}_{g}_{G,S}

**Definition 2 (Comparability**). _{G,S}(g) is well defined

Throughout this paper we use the following terminology: _{G,S}

Now we define different reconciliation costs from _{G,S }_{G,S}

**Definition 3 (Duplication cost**).

•

•

**Definition 4 (Duplication-loss cost**).

•

• The duplication-loss cost

**Definition 5 (Deep coalescence cost**).

•

•

The error-correction problems

Here we give definitions for tree rearrangement operations TBR

**Definition 6 **(Tree Bisection and Reconnection (TBR)).

Let _{T }

1.

2.

3.

4.

5.

_{T }(v, x, y) is obtained from T by a

1. _{G}_{y∈Y }TBR_{G}

2. _{G}_{x∈X }TBR_{G}

3. _{G }:= _{(u, v)∈E(G) }_{G}

An TBR operation

**An TBR operation**. Tree _{T}

**Definition 7 **(Subtree Prune and Regrafting (SPR)). _{T }_{T }_{T }_{v }and regrafts it above y. (See

The NNI adjacency graph

**The NNI adjacency graph**. (a) The tree _{v }_{G'}_{G'}_{l }_{r }

_{G}_{y∈Y }SPR_{G}

_{G }:= _{(u, v)∈E(G) }_{G}

We now state the SPR based error-correction problems for duplication (D), duplication-loss (DL), and deep coalescence (DC). Let Γ ∈ {D, DL, DC}.

Problem 1 (SPR based error-correction for Γ (SEC-Γ))

_{G }such that

The TBR based error-correction for Γ (TEC-Γ) problems are defined analogously to the SPR based error-correction for Γ (SEC-Γ) problems.

Solving the SEC-Γ problems

In this section we study the SPR based error-correction problems, for duplication (D), duplication-loss (DL), and deep coalescence (DC), in more detail. Our efficient solution for these problems are based on solving restricted versions of these problems efficiently. For each Γ ∈ {

Problem 2 (Restricted SPR based error-correction for (R-SEC-Γ))

_{G}

**Observation 1**. Let Γ ∈ {

Naïvely, the R-SEC-Γ problem can be solved in Θ(^{2}) time by computing the cost _{G}_{G}

Ordering the trees in SPR_{G}

Consider a graph on trees in _{G}

**Definition 8 **(Nearest Neighbor Interchange (NNI)). _{T }_{T }_{T }_{v }and regrafts it above the parent of v's parent. (See

**Definition 9 **(NNI distance). _{NNI}_{1, }_{2}), _{1 }_{2 }_{1 }_{2}.

**Definition 10 **(NNI-adjacency graph). _{G}_{1, }_{2}} ∈ _{NNI}_{1, }_{2}) = 1.

**Lemma 1**.

_{G}_{G}_{1}) and _{G}_{2}). We use induction on _{G}_{1}, _{2}). Let _{G}_{1}, _{2}) = 1 and assume without loss of generality that _{2 }= _{G}_{1}). Thus, _{G'' }_{1})). So the hypothesis holds for _{G}_{1}, _{2}) = 1. Assume now that the hypothesis is true for _{G}_{1}, _{2}) ≤ _{G}_{1}, _{2}) = _{1 }and _{2}; let _{G}_{1}) = 1, and ^{n }_{G}_{G}_{1}), then ^{n }_{G'}^{n }_{G'}_{G}_{2}) = _{G}_{1}, _{2}) =

**Theorem 1**.

_{G}

**Case 1: ****is a root**. Let _{1 }∈ _{G}^{1 }:= SPR_{G}_{1}), thus ^{1}, _{G}

**Case 2: **y **is a leaf**. Let _{1 }= _{G}^{1 }:= SPR_{G}_{1}), thus ^{1 }= NNI_{G'}^{1},

**Case 3: ****is an internal vertex**. Let _{1 }= _{G}_{2 }∈ _{G}^{1 }:= SPR_{G}_{1}), thus ^{1 }= NNI_{G'}^{2 }:= SPR_{G}_{2}), thus _{G}^{2}

This completes the proof.

The score difference of consecutive trees in

To solve the R-SEC-Γ problems we traverse tree

Let _{G'}

**Lemma 2**. ℳ_{G'',S}_{G',S}

_{G''}_{G''}

**Lemma 3**. ℳ_{G'',S}_{G',S}

_{G',S}_{G'',S}_{G',S}_{G' }_{G'',S}_{G''}_{G',S}_{G'',S}_{x}

**Lemma 4**. ℳ_{G''},_{S}_{G',S}_{G', S}

_{G'',S}_{G',S}_{G'',S}_{G',S}_{G'',S}_{G'',S}_{G'',S}_{G',S}_{G',S}

**Lemma 5**.

_{G'',S}_{G''},_{S}_{G',S}_{G'}

Let _{e }

**Theorem 2**.

□

**Definition 11**. _{G' }be a path from ^{' }^{'},

**Theorem 3**.

The algorithm

We describe a general algorithm, called Algo-R-SEC-Γ, to solve the R-SEC-Γ problem for each Γ ∈ {_{v }

Algorithm 1 - Algo-R-SEC-Γ

**Input: **A gene tree G, a species tree S, and v

**Output: **A tree G_{G}

_{v }and regrafting at Ro(G)

**For **each

**If **not backtracking, **then**

_{G'}_{G}

_{G'}_{G'}

_{G''.S }_{G',S}_{S}_{G',S}

_{G'',S}_{G',S}_{G',S}

** Else**,

_{G'}

**Return **BestTree

Algorithm 2 - Algo-Comp-Score

**Input: **A gene tree G, a species tree S, and LCA mapping _{G,S}

**For **each g

**If ****then**

**Return **score

Algorithm 3 - Algo-G-Score

_{G}_{,S}

**If ****then**

** If**ℳ(

**ElseIf ****then**

** If**ℳ(

**Else **//

**Lemma 6**.

From Definition 10, _{G}_{v }_{G}_{G}

For

**Case 1: **Γ **is D**. Algo-G-Score returns 1, if the vertex g ∈

**Case 2: **Γ **is DL**. Algo-G-Score computes losses by applying the formula of Definition 4. Further, it adds 1 if there is a duplication.

**Case 3: **Γ **is DC**. Algo-G-Score, returns the number of lineages from g to each of its children

In Algo-R-SEC-Γ, step 13 computes _{ X }(

In Algo-R-SEC-Γ, step 4 sets _{ X }(

**Lemma 7**. ^{2})

When Γ is DC, steps 4 and 5 are further executed in Algo-Comp-Score for constant time. Thus in Algo-R-SEC-Γ, step 3 runs for ^{2}) time. □

Solving the TEC-Γ problems

In this section we study the TBR based error-correction problems, for duplication (D), duplication-loss (DL), and deep coalescence (DC). More precisely, we extend our solution for the SEC-Γ problems to solve the TEC-Γ problems. A TBR operation can be viewed as an SPR operation, except that the pruned subtree can be rerooted before it is regrafted. Our speed-up for the SEC-Γ problems is achieved by observing that the scores Γof any re-rooted pruned subtree and its remaining pruned tree are independent of each other. We define the R-TEC-Γ problems for the TEC-Γ problems, as we defined the R-SEC-Γ problems for the SEC-Γ problems. We will show that the R-TEC-Γ problems can be solved by solving two smaller problems separately and combining their solutions.

**Definition 12**.

**Lemma 8**. _{G}_{v}_{v}

Proof. (⇒) Let ^{1 }:= _{G}_{1, }_{1 }∈ _{v}_{1 }≠ ^{2 }:= _{G}_{1}), for _{1 }∈ _{v}_{1 }≠ _{G}_{v}_{v}_{v }_{v, }

Lemma 8 implies that a tree in TBR_{G}_{G}

**Theorem 4**. ^{2})

Experimental results

We tested the performance of the gene tree rearrangement algorithms on a set of 106 gene alignments containing sequences from 8 yeast taxa from Rokas et al.

Error correction based on deep coalescence model

**Reconciliation Cost**

**Original**

**Post-Correction**

0

45

77

1

32

15

2

6

8

3

9

5

4

8

0

>4

6

1

The number of yeast gene trees with different reconciliation costs based on the deep coalescence model both before (Original) and after (Post-Correction) the SPR error correction.

Error correction based on duplication and loss model

**Reconciliation Cost**

**Original**

**Post-Correction**

0

45

77

1-5

32

15

6-10

15

13

11-15

8

0

16-20

5

1

>20

1

0

The number of yeast gene trees with different reconciliation costs based on the duplication and loss model both before (Original) and after (Post-Correction) the SPR error correction.

We also implemented a protocol to use the gene rearrangement algorithm to correct for gene tree error in gene tree parsimony phylogenetic analyses. We first took a collection of input gene trees and performed a SPR species tree search using Duptree

Conclusion

GT-ST reconciliation provides a powerful approach to study the patterns and processes of gene and genome evolution. Yet it can be thwarted by the error that is an inherent part of gene tree inference. Any reliable method for GT-ST reconciliation must account for gene tree error; however, any useful method also must be computationally tractable for large-scale genomic data. We introduce fast and effective algorithms to correct error in the gene trees. These algorithms, based on SPR and TBR rearrangements, greatly extend upon the range of possible errors in the gene tree from existing algorithms

Our analysis on 106 yeast gene trees demonstrates that even a single SPR correction on the gene trees can radically improve upon the reconciliation cost. Our results in the yeast analysis are very similar to the 2-3 fold improvement in implied duplications and losses reported from the parametric gene tree estimation and reconciliation method of Rasmussen and Kellis

We also demonstrated that this error correction protocol could easily be incorporated into a gene tree parsimony phylogenetic analysis. Previous studies have emphasized that gene tree parsimony is sensitive to the topology of the input trees. For example, the species tree may differ whether the gene trees are made using parsimony or maximum likelihood

While the results of the experiments are promising, they also suggest several directions for future research. First, further investigation is needed to characterize the effects of error on gene tree topologies. For example, it seems likely that gene tree errors may extend beyond a single SPR or TBR neighborhood. Yet, if we allow unlimited rearrangements, the gene trees will simply converge on the species tree topology. One simple improvement may be to weight the possible gene tree rearrangements based on support for different clades in the gene tree. Thus, well-supported clades may be rarely or never be subject to rearrangement, while poorly supported clades may be subject to extensive rearrangements. Finally, these approaches implicitly assume that all differences between gene trees and species trees are due to either coalescence, duplications, or duplications and losses. Future work will seek to combine these objectives and also address lateral transfer.

Competing interests

The authors declare that they have no competing interests.

Author's contributions

RC was responsible for algorithm design and program implementation, and wrote major parts of the paper. JGB performed the experimental evaluation and the analysis of the results, and contributed to the writing of the paper. OE supervised the project, contributed to the algorithmic design and writing of the paper.

All authors read and approved the final manuscript.

Acknowledgements

The authors thank André Wehe for his generous support with the implementation. This work was conducted in parts with support from the Gene Tree Reconciliation Working Group at NIMBioS through NSF award EF-0832858, with additional support from the University of Tennessee. R.C. and O.E. were supported in parts by NSF awards #0830012 and #10117189.

This article has been published as part of