Institute of Informatics, University of Warsaw, Warsaw, 02-097, Poland

Department of Computer Science, Iowa State University, Ames, 50011, USA

Abstract

Background

Evolutionary methods are increasingly challenged by the wealth of fast growing resources of genomic sequence information. Evolutionary events, like gene duplication, loss, and deep coalescence, account more then ever for incongruence between gene trees and the actual species tree. Gene tree reconciliation is addressing this fundamental problem by invoking the minimum number of gene duplication and losses that reconcile a rooted gene tree with a rooted species tree. However, the reconciliation process is highly sensitive to topological error or wrong rooting of the gene tree, a condition that is not met by most gene trees in practice. Thus, despite the promises of gene tree reconciliation, its applicability in practice is severely limited.

Results

We introduce the problem of reconciling unrooted and erroneous gene trees by simultaneously rooting and error-correcting them, and describe an efficient algorithm for this problem. Moreover, we introduce an error-corrected version of the gene duplication problem, a standard application of gene tree reconciliation. We introduce an effective heuristic for our error-corrected version of the gene duplication problem, given that the original version of this problem is NP-hard. Our experimental results suggest that our error-correcting approaches for unrooted input trees can significantly improve on the accuracy of gene tree reconciliation, and the species tree inference under the gene duplication problem. Furthermore, the efficiency of our algorithm for error-correcting reconciliation is capable of handling truly large-scale phylogenetic studies.

Conclusions

Our presented error-correction approach is a crucial step towards making gene tree reconciliation more robust, and thus to improve on the accuracy of applications that fundamentally rely on gene tree reconciliation, like the inference of gene-duplication supertrees.

Background

The wealth of newly sequenced genomes has provided us with an unprecedented resource of information for phylogenetic studies that will have extensive implications for a host of issues in biology, ecology, and medicine, and promise even more. Yet, before such phylogenies can be reliably inferred, challenging problems that came along with the newly sequenced genomes have to be overcome. Evolutionary biologists have long realized that gene-duplication and subsequent loss, a fundamental evolutionary process

Rooted reconciliation

**Rooted reconciliation**. An lca-mapping

Related work

Gene tree reconciliation is a well-studied method for resolving topological inconsistencies between a gene tree and a trusted species tree

A major problem in the application of gene tree reconciliation is its high sensitivity to error-prone gene trees. Even seemingly insignificant errors can largely mislead the reconciliation process and, typically undetected, infer incorrect phylogenies (e.g.,

In summary, even small topological error or a slightly misplaced root can incorrectly identify enormous numbers of gene duplications and losses, and therefore largely mislead the reconciliation process. Therefore, gene tree reconciliation requires gene trees that are free of error and correctly rooted at the same time

Our contribution

We address the problem of reconciling erroneous and unrooted gene trees by error-correcting and rooting them at the same time. Solving this problem efficiently is a crucial step towards making gene tree reconciliation more robust, and thus to improve on the accuracy of applications that rely on gene tree reconciliation like the construction of gene-duplication supertrees. We introduce the problem and design an efficient algorithm that facilitates a much more precise gene tree reconciliation, even for large-scale data sets. Our algorithm detects and corrects errors in unrooted gene trees, and thus we avoid the biologists' difficulty and uncertainty of handling erroneous gene trees and correctly rooting them. The presented experimental results suggest that our novel reconciliation algorithms can identify and correct topological error in unrooted input gene trees, and at the same time root them optimally.

Our algorithm is designed to search for the correct and rooted tree of a given unrooted tree in local search neighborhoods of the given tree. The size of these neighborhoods is described by a positive integer ^{k }

Further, we address the problem of constructing rooted supertrees by reconciling unrooted and erroneous gene trees with assigned weak edges, a key problem in illuminating the role and effect of gene duplication and loss in shaping the evolution of organisms. We introduce the problem and develop an effective local search heuristic that makes the construction of more accurate supertrees possible and allows a much better postulation of gene duplication histories. Our experimental results demonstrate that our approach is effective in identifying gene duplication histories given erroneous gene trees and producing more accurate supertrees under gene tree reconciliation.

Duplication-loss model

We introduce the fundamentals of the classical duplication-loss model. Our definitions are mostly adopted from

Let ℐ be the set of species consisting of

Let

We call distinct nodes **Sb**(**(i) Sb**(**(ii) Sb**(**Sb**(**(iii) Sb**(**Sb**(

By **Sb**(

In this general setting let us assume that we are given a

Now we present examples of cost functions that are used in the duplication model. We assume that if _{1 }and _{2 }are its children. The ^{D}_{i}^{D}^{L}**Sb **(_{1}), _{2})), and ^{L}^{D}_{1}), _{2})) and ^{L }_{1}), _{2})) (in both cases 0 if

The ^{D}_{i}^{D}^{L}**Sb**(_{1}), _{2})), and ^{L}^{D}_{1}), _{2})) and ^{L}_{1}), _{2})) (in both cases 0 if

Observe that a node ^{D}^{L}^{D }^{L }

Introduction to unrooted reconciliation

Here we highlight some results from _{* }is a new node defining the root. To distinguish between rootings of **(A1) ****(A2) **_{e}_{*})=⊤; that is, the root of every rooting is mapped into the root of _{e}_{*}) with no change of the cost).

First, we transform

Edges in _{1}, _{2}, _{1 }and _{2}, respectively. Then the edge _{3 }≠ _{1 }and _{3 }≠ _{2 }is labeled by _{1 }+ _{2}. Such labeling will be used to explore mappings of rootings of

Every internal node _{i}_{i}, v_{i}_{i}

Unrooted reconciliation

**Unrooted reconciliation**. a) A star in

The are several types of possible star topologies based on the labeling (for proofs and details see

In summary stars are basic 'puzzle-like' units that can be used to assemble them into unrooted gene trees. However, not all star compositions represent a gene tree. For instance, there is no gene tree with 3 stars of type S2. It follows from

Now we overview the main result of **(M1) **if **(M2) **if

Now we summarize the time complexity of this procedure. It follows from

Methods

First we describe our algorithm for computing the optimal cost and the set of optimal edges after one nearest neighbor interchange (NNI) operation performed on an unrooted gene tree, and then extend it to a general case with

NNI

**NNI**. A single NNI on _{i }and _{i }_{i }_{i }

Algorithm

Now we show that a single NNI operation can be completed in constant time if all structures required for computing the optimal rootings are already constructed. First, let us assume that the following is given: (a) two positive reals

**Definition 1**. _{i}-s are (rooted) subtrees of _{1}, _{2}) _{3}, _{4}) _{0 }_{i }is the root of T_{i}, e_{i }is the edge connecting w_{i }with e_{0 }_{i }is the lca-mapping of T_{i}

An NNI operation is depicted in Figure _{1}, _{3}), (_{2}, _{4})). However, it can be easily defined and therefore it is omitted here. Observe that the NNI operation (without updating of lca-mappings) can be performed in constant time for both trees.

The right part of Figure _{i }

**Lemma 1**.

We conclude that updating

For convenience, assume that the NNI operation replaces _{i }_{1}, _{4}} and {_{2}, _{3}} are semi-alternating. For two edges

**Lemma 2**. _{i }is replaced by

Proof: (EQ1) All edges in

**Lemma 3 **(NE1). _{i}
, e_{j}

Proof: The type of

**Lemma 4 **(NE2).

Proof: In this case

**Lemma 5**.

**(NE3) **_{i }and e_{j }are symmetric then

**(NE4) **_{j }is asymmetric and

**(NE5) **_{j }and

Proof: Note that {_{0}, _{i}
, e_{j}

First, we introduce a function for computing the cost differences. Consider three nodes ^{D }^{L }_{3}(_{4}(

**Lemma 6**. _{4}(_{1}, _{2}, _{3}, _{4})_{i }(for i

As mentioned the above lemma can be proved by comparing the rootings placed on the center edges in

**Lemma 7**. _{i }_{i}_{i }is replaced by _{3}(_{4}, _{3}, _{2}) _{3}(_{3}, _{4, }_{1}) _{3}(_{2}, _{1}, _{4}) _{3}(_{1}, _{2}, _{3})

Similarly to Lemma 6 we can prove Lemma 7 by comparing the rootings of _{i }

Algorithm

**Algorithm**. Optimal weighted cost for

General reconstruction problems

We present several approaches to problems of error correction and phylogeny reconstruction. Let us assume that

**Problem 1 **(**. Given a rooted species tree **

The

**Problem 2 **(

The complexity of the

In applications there is typically no need to search over all NNI variants of a gene tree. For instance, a good candidate for an NNI operation is

Software

The unrooted reconciliation algorithm

Software and datasets from our experiments are made freely available through

Experimental results and discussion

Data preparation

First, we inferred 4133 unrooted gene trees with branch lengths from nine yeast genomes contained in the Genolevures 3 data set

We aligned the protein sequences of each gene family by using the program TCoffee

Yeasts phylogeny

**Yeasts phylogeny**. Species tree topologies. G3 - original phylogeny of Genolevures 3 data set

Inferring optimal species trees

The optimal species tree reconstructed with error corrections (1NNIST optimization problem) is depicted in Figure

From weak edges to species trees

In the previous experiment, the NNI operations were performed on almost every gene tree in the optimal solution and with no restrictions on the edges. In order to reconstruct the trees more accurately, we performed experiments for

Figures

** ω**-1NNIST and

. A summary of

** ω**-3NNIST experiments

. A summary of

Branch lengths

**Branch lengths**. Histogram of branch lengths.

Rejected gene trees

**Rejected gene trees**. The number of rejected trees as a function of

From trusted species tree to weak edges in gene trees - automated and manual curation

Assume that the set of unrooted gene trees and the rooted (trusted) species tree

Discussion

We present novel theoretical and practical results on the problem of error correction and phylogeny reconstruction. In particular, we describe a polynomial time and space algorithm that simultaneously solves the problem of correction topological errors in unrooted gene trees and the problem of rooting unrooted gene trees. The algorithm allows us to perform efficiently experiments on truly large-scale datasets available for yeast genomes. Our experiments suggest that our algorithm can be used to (i) detect errors, (ii) to infer a correct phylogeny of species under the presence of weak edges in gene trees, and (iii) to help in tree curation procedures.

Conclusion

We introduced a novel polynomial time algorithm for error-corrected and unrooted gene tree reconciliation. Experiments on yeast genomes suggests that an implementation of our algorithm can greatly improve on the accuracy of gene tree reconciliation, and thus, curate error-prone gene trees. Moreover, we use our error-corrected reconciliation to make the gene duplication problem, a standard application of gene tree reconciliation, more robust. We conjecture that the error-corrected gene duplication problem is intrinsically hard to solve, since the gene duplication problem is already NP-hard. Therefore, we introduced an effective heuristic for error-corrected gene duplication problem. Our experimental results for a wide range of error-correction tests on yeasts phylogeny show that our error-corrected reconciliations result in improved predictions of invoked gene duplication and loss events that then allow to infer more accurate phylogenies.

The presented error correction is based on gene-species tree reconciliation using gene duplication and loss. However, there are other major evolutionary mechanism that infer gene tree topologies that are inconsistent with the actual species tree topology, like horizontal gene transfer and deep coalescence. Gene tree reconciliation using these mechanisms is highly sensitive to topological error, similar to gene tree reconciliation under gene duplication and loss. Future work will focus on the development of algorithms that can also reconcile unrooted and erroneous gene trees using horizontal gene transfer and deep coalescence.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

PG and OE were responsible for algorithm design and writing the paper. PG implemented the programs, and performed the experimental evaluation and the analysis of the results. Both authors read and approved the final manuscript.

Acknowledgements

The reviewers have provided several valuable comments that have improved the presentation. This work was conducted in parts with support from the Gene Tree Reconciliation Working Group at NIMBioS through NSF award #EF-0832858. PG was partially supported by the grant of MNiSW (N N301 065236) and OE was supported in parts by NSF awards #0830012 and #10117189.

This article has been published as part of