Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: Selected articles from the 7th International Symposium on Bioinformatics Research and Applications (ISBRA'11)

Open Access Proceedings

Efficient error correction algorithms for gene tree reconciliation based on duplication, duplication and loss, and deep coalescence

Ruchi Chaudhary1, J Gordon Burleigh2 and Oliver Eulenstein1*

Author Affiliations

1 Department of Computer Science, Iowa State University, Ames, IA 50011, USA

2 Department of Biology, University of Florida, Gainesville, FL 32611, USA

For all author emails, please log on.

BMC Bioinformatics 2012, 13(Suppl 10):S11  doi:10.1186/1471-2105-13-S10-S11


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/13/S10/S11


Published:25 June 2012

© 2012 Chaudhary et al. licensee BioMed Central Ltd.

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Gene tree - species tree reconciliation problems infer the patterns and processes of gene evolution within a species tree. Gene tree parsimony approaches seek the evolutionary scenario that implies the fewest gene duplications, duplications and losses, or deep coalescence (incomplete lineage sorting) events needed to reconcile a gene tree and a species tree. While a gene tree parsimony approach can be informative about genome evolution and phylogenetics, error in gene trees can profoundly bias the results.

Results

We introduce efficient algorithms that rapidly search local Subtree Prune and Regraft (SPR) or Tree Bisection and Reconnection (TBR) neighborhoods of a given gene tree to identify a topology that implies the fewest duplications, duplication and losses, or deep coalescence events. These algorithms improve on the current solutions by a factor of n for searching SPR neighborhoods and n2 for searching TBR neighborhoods, where n is the number of taxa in the given gene tree. They provide a fast error correction protocol for ameliorating the effects of gene tree error by allowing small rearrangements in the topology to improve the reconciliation cost. We also demonstrate a simple protocol to use the gene rearrangement algorithm to improve gene tree parsimony phylogenetic analyses.

Conclusions

The new gene tree rearrangement algorithms provide a fast method to address gene tree error. They do not make assumptions about the underlying processes of genome evolution, and they are amenable to analyses of large-scale genomic data sets. These algorithms are also easily incorporated into gene tree parsimony phylogenetic analyses, potentially producing more credible estimates of reconciliation cost.

Introduction

The availability of large-scale genomic data from a wide variety of taxa has revealed much incongruence between gene trees and the phylogeny of the species in which the genes evolve. This incongruence may be caused by evolutionary processes such as gene duplication and loss, deep coalescence, or lateral gene transfer. The variation in gene tree topologies can be used to infer the processes of genome evolution. Gene tree - species tree (GT-ST) reconciliation methods seek to map the history of gene trees into the context of species evolution and thus potentially link processes of gene evolution to phenotypic changes and diversification. Yet these methods can be confounded by error in the gene trees, which also may cause incongruence between the gene and species topologies. We introduce efficient algorithms to correct gene tree topologies based on the gene duplication, duplication and loss, or deep coalescence cost models. The algorithms work by identifying the small rearrangements in the gene trees that reduce the reconciliation cost. They are extremely fast and thus amenable to analyses of enormous genomic data sets.

Perhaps the most commonly used and computationally feasible approach to GT-ST reconciliation is gene tree parsimony, which seeks to infer the fewest evolutionary events (e.g., duplication, loss, coalescence, or lateral gene transfer) needed to reconcile a gene tree and species tree topology [1]. This approach also can be extended to infer species phylogenies, finding the species tree that implies the fewest evolutionary events implied by the gene trees (e.g., [2-4]). However, the gene trees often are estimated using heuristic methods from short sequence alignments, and consequently, there is often much error in the estimated gene tree topologies. Error in the gene trees creates more GT-ST incongruence and can radically affect GT-ST reconciliation analyses, implying far more duplications, duplications and losses, or deep coalescence events than actually exist. For example, Rasmussen and Kellis [5] estimated that error in gene tree reconstruction can lead to 2-3 fold overestimates of gene duplications and losses. Gene tree error also can erroneously imply large numbers of duplications near the root of the species tree [6,7], and it can mislead gene tree parsimony phylogenetic analyses (e.g., [8-10]).

Several approaches have been proposed to address gene tree error in GT-ST reconciliation. First, questionable nodes in a gene tree or nodes with low support may be collapsed prior to gene tree reconciliation, and the resulting non-binary gene trees may be reconciled with species trees [11-13]. Similarly, GT-ST reconciliations can use a distribution of gene tree topologies, such as bootstrap gene trees, rather than a single gene tree estimate [6,14,15]. Both of these approaches may help account for stochastic error and uncertainty in gene tree topologies, but they do not explicitly confront gene tree error. Methods also exist to simultaneously infer the gene tree topology and the gene tree reconciliation with a known species tree [5,16]. While these sophisticated statistical approaches appear very promising, they are computationally intensive, and it is unclear if they will be tractable for large-scale analyses. Another, perhaps a more computationally feasible, approach is to allow a limited number of local rearrangements in the gene tree topology if they reduced the reconciliation cost [17,18].

Previously [17,18] described a method to allow NNI-branch swaps on selected branches of a gene tree to reduce the reconciliation cost. Following [17,18], we address gene tree error in the reconciliation process by assuming that the correct gene tree can be found in a particular neighborhood of the given gene tree. We describe this approach for the gene duplication, duplication and loss, and deep coalescence models, which identify the fewest respective events implied from a given gene tree and given species tree. This neighborhood consists of all trees that are within one edit operation of the gene tree. While [17,18] use Nearest Neighbor Interchange (NNI) edit operations to define the neighborhood, we use the standard tree edit operations SPR [19,20] and TBR [19], which significantly extend upon the search space of the NNI neighborhood. The SPR and TBR local search problems find a tree in the SPR and TBR neighborhood of a given gene tree, respectively, that has the smallest reconciliation cost when reconciled with a given species tree. Using the algorithm from Zhang [21] the best known (naïve) runtimes are O(n3) for the SPR local search problem and O(n4) for the TBR local search problem, where n is the number of taxa in the given gene tree. These runtimes typically are prohibitively long for the computation of larger GT-ST reconciliations. We improve on these solutions by a factor of n for the SPR local search problem and a factor of n2 for the TBR local search problem. This makes the local search under the TBR edit operation as efficient as under the SPR edit operation, and it provides a high-speed gene tree error-correction protocol that is computationally feasible for large-scale genomic data sets.

We also evaluated the performance of our algorithms using the implementation of SPR based local search algorithms. Note, that the SPR neighborhood is properly contained in the TBR neighborhood for any given tree. Thus the performance of the SPR based program provides a conservative estimate of the performance of the TBR based program. We test our programs on a collection of 106 yeast gene trees, some of which contain hundreds of leaves, and we demonstrate how it can be easily incorporated into large-scale gene tree parsimony phylogenetic analyses.

Basic notation and preliminaries

Throughout this paper, the term tree refers to a rooted full binary (phylogenetic) tree.

Let T be a tree. The leaf set of T is denoted by Le(T). The set of all vertices of T is denoted by V(T) and the set of all edges by E(T). The root of T is denoted by Ro(T). The set of internal vertices of T is I(T):= V(T)\Le(T).

Given a vertex v V(T), we denote the parent of v by PaT(v). Let u := PaT(v). The edge that connects v with u is written as (u, v). The first element in the pair is always the parent of the second element. The set of all children of v is denoted by ChT(v) and the children are called siblings. For w ChT(v), the sibling of w is denoted by SbT(w).

We define ≤T to be the partial order on V(T) where xT y if y is a vertex on the path between Ro(T) to x, and write x <T y if x T y and x y. The least common ancestor of a non-empty subset L V(T), denoted as LCAT(L), is the unique smallest upper bound of L under ≤T. Given x, y V(T), dT(x, y) denotes the number of edges on the unique path between x and y in T.

Given U V(T), we denote by T(U) the unique rooted subtree of T that spans U with the minimum number of vertices. Furthermore, the restriction of T to U, denoted by T|U, is the rooted tree that is obtained from T(U) by suppressing all non-root vertices of degree two. The subtree of T rooted at u V(T), denoted by Tu, is defined to be T|U, for U:= {v Le(T): v T u}. Two trees T1 and T2 are called isomorphic if there exists a bijection between the vertex sets of T1 and T2 which maps a vertex u1 of T1 to vertex u2 of T2 if the subtree rooted at u1 in T1 has the same leaf set as the subtree rooted at u2 in T2. If an isomorphism exists between T1 and T2, we write T1 T2.

Given function f : A B, we denote by f(A') where A' ⊆ A a set of images of each element in A' under f.

The reconciliation cost models

A species tree is a tree that depicts the branching pattern representing the divergence of a set of species, whereas a gene tree is a tree that depicts the evolutionary history among the sequences encoding one gene (or gene family) for a given set of species. We assume that each leaf of the gene tree is labeled with the species from which that gene was sampled. Let G be a gene tree and S a species tree. In order to compare G with S, we require a mapping from each gene g V(G) to the most recent species in S that could have contained g.

Definition 1 (Mapping). The leaf-mapping ℒG,S : Le(G) → Le(S) is a surjection that maps each leaf gLe(G) to that unique leaf s Le(S) which has the same label as g. The extension G,S : V(G) → V(S) is the mapping defined by ℳG,S(g):= LCA(ℒG,S(Le(Gg))). For convenience, we write ℳ(g) instead of G,S(g) when G and S are clear from the context.

Definition 2 (Comparability). Given trees G and S, we say that G is comparable to S if a leaf-mapping ℒG,S(g) is well defined.

Throughout this paper we use the following terminology: G is a gene tree that is comparable to the species tree S through a leaf-mapping ℒG,S, and n is the number of taxa present in both input trees.

Now we define different reconciliation costs from G to S for a given mapping ℳG,S that extends ℒG,S. The reconciliation cost are based on the models of gene duplication [22,23], duplication-loss [21], and deep coalescence [21].

Definition 3 (Duplication cost).

The duplication cost from g V(G) to S, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M1">View MathML</a>

The duplication cost from G to S, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M2">View MathML</a>.

Definition 4 (Duplication-loss cost).

The loss cost from g V(G) to S,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M3">View MathML</a>

• The duplication-loss cost from G to S, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M4">View MathML</a>.

Definition 5 (Deep coalescence cost).

The number of lineages from g V(G) to h Ch(g) in S,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M5">View MathML</a>

The deep coalescence cost from G to S, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M6">View MathML</a>.

The error-correction problems

Here we give definitions for tree rearrangement operations TBR [19] and SPR [19,20], and then formulate the Error-Correction problems that were motivated in the introduction.

Definition 6 (Tree Bisection and Reconnection (TBR)). Let T be a tree. For this definition, we regard the planted tree Pl(T) as the tree obtained from adding the root edge{r, Ro(T)} to E(T), where r V(T).

Let e := (u, v) ∈ E(T), and X and Y be the connected components that are obtained by removing edge e from T such that v X and u Y. We define TBRT (v, x, y) for x ∈ X and y Y to be the tree that is obtained from Pl(T) by first deleting edge e, and then adjoining a new edge f between X and Y as follows:

1. If x Ro(X) then suppress Ro(X) and create a new root by subdividing edge (Pa(x), x).

2. Subdivide edges (Pa(y), y) by introducing a new vertex y'.

3. Re-connect components X and Y by adding edge f = (y', Ro(x).

4. Suppress the vertex u, and rename vertex y' as u.

5. Contract the root edge.

We say that, the tree TBRT (v, x, y) is obtained from T by a tree bisection and reconnection (TBR) operation that bisects the tree T into the components X and Y , and reconnects them above the nodes x and y. (See Figure 1.) We define the following neighborhoods for the TBR operation:

1. TBRG(v, x) := y∈Y TBRG(v, x, y)

2. TBRG(v) := x∈X TBRG(v, x)

3. TBRG := (u, v)∈E(G) TBRG(v)

thumbnailFigure 1. An TBR operation. Tree T' = TBRT(v, x, y) results from T after performing single TBR operation.

Definition 7 (Subtree Prune and Regrafting (SPR)). The SPR operation is defined as a special case of the TBR operation. Let e := (u, v) ∈ E(T), and X and Y be the connected components that are obtained by removing edge e from T such that v X and u Y. We define SPRT (v, y) for y Y to be TBRT (v, v, y). We say that the tree SPRT (v, y) is obtained from T by performing subtree prune and regraft (SPR) operation that prunes subtree Tv and regrafts it above y. (See Figure 2(a).)

thumbnailFigure 2. The NNI adjacency graph. (a) The tree <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M72','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M72">View MathML</a> is obtained from G by pruning and regrafting subtree Gv to the root of G. The vertex x V(G) is suppressed, and the new vertex above root in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M73','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M73">View MathML</a> is named x. (b) Two NNI operations NNIG'(z') and NNIG'(z) produce left-child G'l and right-child G'r of G' in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>.

We define the following neighborhoods for the SPR operation:

1. SPRG(v) := y∈Y SPRG(v, y)

2. SPRG := (u, v)∈E(G) SPRG(v)

We now state the SPR based error-correction problems for duplication (D), duplication-loss (DL), and deep coalescence (DC). Let Γ ∈ {D, DL, DC}.

Problem 1 (SPR based error-correction for Γ (SEC-Γ))

Instance: A gene tree G and a species tree S.

Find: A gene tree G* ∈ SPRG such that <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M7">View MathML</a>.

The TBR based error-correction for Γ (TEC-Γ) problems are defined analogously to the SPR based error-correction for Γ (SEC-Γ) problems.

Solving the SEC-Γ problems

In this section we study the SPR based error-correction problems, for duplication (D), duplication-loss (DL), and deep coalescence (DC), in more detail. Our efficient solution for these problems are based on solving restricted versions of these problems efficiently. For each Γ ∈ {D, DL, DC} we first define a restricted version of the SEC-Γ problem, which we call the restricted SPR based error-correction for the Γ (R-SEC-Γ) problem.

Problem 2 (Restricted SPR based error-correction for (R-SEC-Γ))

Instance: A gene tree G, a species tree S, and v V(G).

Find: A gene tree G* ∈ SPRG(v) such that <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M8">View MathML</a>.

Observation 1. Let Γ ∈ {D, DL, DC}. Given a gene tree G and a species tree S, the SEC-Γ problem can be solved as follows: (i) solve the R-SEC-Γ problem for every v V(G) where v Ro(G), (ii) under all solutions found return a minimum scoring gene tree G*.

Naïvely, the R-SEC-Γ problem can be solved in Θ(n2) time by computing the cost <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M75','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M75">View MathML</a> for each G' ∈ SPRG(v). The cost for a given gene and species tree can be computed in Θ(n) time [21]. We introduce a novel algorithm for the R-SEC-Γ problem that improves by a factor of n on the naïve solution. This speedup is achieved by semi-ordering the trees in SPRG(v), for each v V(G), such that the score-difference of any two consecutive trees in this order can be computed in constant time.

Ordering the trees in SPRG(v)

Consider a graph on trees in SPRG(v), in which every two adjacent trees are one NNI [19] operation apart. We show that such a graph is a rooted full binary tree, after providing necessary definitions.

Definition 8 (Nearest Neighbor Interchange (NNI)). We define the NNI operation as a special case of the SPR operation. Let e E(T) where e := (u, v), and X and Y be the connected components that are obtained by removing edge e from T such that v X and u Y. We define NNIT (v) to be SPRT (v, y) for y:= Pa(u), and say that NNIT (v) is obtained from T by performing nearest neighbor interchange (NNI) operation that prunes subtree Tv and regrafts it above the parent of v's parent. (See Figure 2(b).)

Definition 9 (NNI distance). Let the NNI-distance, denoted as dNNI(T1, T2), between two trees T1 and T2 over n taxa be the minimum number of NNI operations required to transform T1 into T2.

Definition 10 (NNI-adjacency graph). The NNI-adjacency graph, denoted as <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M76','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M76">View MathML</a>, is the graph where V = SPRG(v) and {T1, T2} ∈ E if dNNI(T1, T2) = 1.

Lemma 1. <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>is a tree.

Proof. We prove it by showing that there exists a unique path between every two vertices in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>. Let G', G'' ∈ V(<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>), thus G', G'' ∈ SPRG(v). Let G':= SPRG(v, x1) and G'':= SPRG(v, x2). We use induction on dG(x1, x2). Let dG(x1, x2) = 1 and assume without loss of generality that x2 = PaG(x1). Thus, G'' = NNIG'' (Sb(x1)). So the hypothesis holds for dG(x1, x2) = 1. Assume now that the hypothesis is true for dG(x1, x2) ≤ k and suppose dG(x1, x2) = k + 1. Since G is a tree, there must be a unique path between x1 and x2; let y be a vertex on this path. Let dG(y, x1) = 1, and Gn := SPRG(v, y). If y = PaG(x1), then Gn = NNIG'(v); otherwise Gn = NNIG'(Sb(y)). Since dG(y, x2) = k, thus (by induction hypothesis) the hypothesis is valid for dG(x1, x2) = k + 1. □

Theorem 1. <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>is a rooted full binary tree.

Proof. In view of Lemma 1, it suffices to show that except a unique vertex of degree 2 all other vertices in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a> are of degree 1 or 3. Let G' ∈ V(<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>), thus G' = SPRG(v, y) for some y V(G). There are three cases:

Case 1: y is a root. Let y1 ChG(y). Let G1 := SPRG(v, y1), thus <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M81','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M81">View MathML</a>). Hence {G1, G'} ∈ E(<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>). Since |ChG(y)| = 2, G' must be a degree 2 vertex in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>.

Case 2: y is a leaf. Let y1 = PaG(y). Let G1 := SPRG(v, y1), thus G1 = NNIG'(v). Hence {G1, G'} ∈ E(<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>), and consequently, G' is a degree 1 vertex in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>.

Case 3: y is an internal vertex. Let y1 = PaG(y) and y2 ChG(y). Let G1 := SPRG(v, y1), thus G1 = NNIG'(v). Let G2 := SPRG(v, y2), thus G' = NNIG2(v). Since y has one parent and two children in G, thus G' is a degree 3 vertex in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>.

This completes the proof.

The score difference of consecutive trees in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>

To solve the R-SEC-Γ problems we traverse tree <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>. Two adjacent trees in V(<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>) are one NNI operation apart. We show that <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M77','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M77">View MathML</a> score of a tree can be computed in constant time from the LCA computation of its adjacent tree.

Let e := (G', G'') be an edge in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>. Let x := Pa(v), y := Sb(v), and z, z' ∈ Ch(y) in G' (see Figure 2(b)). Without loss of generality, let G'':= NNIG'(z). (Observe, G'' is similar to <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M82">View MathML</a> of Figure 2(b).)

Lemma 2. ℳG'',S(y) = ℳG',S(x).

Proof. From NNI operation, v, z' ∈ ChG''(x) and z, x ChG''(y). Also, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M83">View MathML</a>, so <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M84','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M84">View MathML</a>. Thus, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M85','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M85">View MathML</a>. □

Lemma 3. ℳG'',S(w) = ℳG',S(w), for all w V(G')\{x, y}.

Proof. For <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M9">View MathML</a>, since <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M86','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M86">View MathML</a>, therefore ℳG',S(g) = ℳG'',S(g). Also, except for subtree <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M87','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M87">View MathML</a>, the rest of the tree remains the same in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M88','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M88">View MathML</a>. Thus by Lemma 2, ℳG',S(PaG' (x)) = ℳG'',S(PaG''(y)). Inductively, ℳG',S(g) = ℳG'',S(g), for all g V(G')\V(G'x). □

Lemma 4. ℳG'',S(x) = LCA(ℳG',S(v), ℳG', S(z')).

Proof. From Lemma 3, ℳG'',S(v) = ℳG',S(v) and ℳG'',S(z') = ℳG',S(z'). Thus, ℳG'',S(x) = LCA(G'',S(v),ℳG'',S(z')) = LCA(ℳG',S(v),ℳG',S(z')). □

Lemma 5. <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M78','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M78">View MathML</a>, for all g V(G'')\{x, y} and Γ ∈ {D, DL, DC}.

Proof. The gene duplication and loss status of a vertex, and the number of lineages from a vertex to its children in G' can change in G'' if its mapping or mapping of any of its children changes in ℳG'',S. From Lemma 3, and also, since ℳG'',S(w) = ℳG',S(w), for w Ch(PaG'(x)), must have <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M79','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M79">View MathML</a>. Thus the Lemma follows. □

Let e := (G', G'') ∈ E(<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>) and Γ ∈ {D, DL, DC}. We define <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M80','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M80">View MathML</a> with respect to the given species tree S. Observe that this score can be negative too. We study how Γe can be computed efficiently for each edge e in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>.

Theorem 2. <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M10">View MathML</a>.

Proof .<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M11">View MathML</a>

Definition 11. Let <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M12">View MathML</a>, and let PG' be a path from <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M13">View MathML</a>to G' in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>. For G', we define the score-difference <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M14">View MathML</a>as <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M15">View MathML</a>.

Theorem 3. For given S, G, and v V(G), the tree G' ∈ V(<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>) is the output of a R-SECproblem iff <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M16">View MathML</a>.

Proof. Let <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M17">View MathML</a>. We prove that G' is the output of R-SEC-Γ problem. Since <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M18">View MathML</a>, thus G' gives the minimum normalized <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M77','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M77">View MathML</a> score over all trees in V(<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>). Hence, G' must be the output of the R-SEC-Γ problem. The other direction follows similarly. □

The algorithm

We describe a general algorithm, called Algo-R-SEC-Γ, to solve the R-SEC-Γ problem for each Γ ∈ {D, DL, DC}. Initially Algo-R-SEC-Γ computes (i) the root vertex of the NNI-adjacency graph <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>, which we call <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M19">View MathML</a>, by regrafting the subtree Gv above the root of G, (ii) the LCA mapping from <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M20">View MathML</a> to S, and (iii) the Γ score from <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M21">View MathML</a> to S. Then recursively Algo-R-SEC-Γ computes the LCA mapping and Γ score for every vertex G' in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a> when the LCA mapping and Γ score of its parent vertex in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a> is known. Algorithm 1 details Algo-R-SEC-Γ.

Algorithm 1 - Algo-R-SEC-Γ

Input: A gene tree G, a species tree S, and v V(G)

Output: A tree G* ∈ SPRG(v) such that <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M22">View MathML</a>

01. Compute <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M23">View MathML</a>by pruning Gv and regrafting at Ro(G)

02. Compute LCA mapping <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M24">View MathML</a>

03. Call <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M25">View MathML</a>

04. Set <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M26">View MathML</a>, BestScore = 0

05. Set <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M27">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M28">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M29">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M30">View MathML</a>

06. For each <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M31">View MathML</a>in preorder traversal of <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M32">View MathML</a>, do

07.    If not backtracking, then

08.       Set x = PaG'(v), y = SbG'(v)

09.       Set G'' = NNIG'(SbG'(k))

10.       Set G''.S = ℳG',S, ℳG'',S(y) = ℳG',S(x)

11.       ℳG'',S(x) = LCA(ℳG',S(k),ℳG',S(v))

12.       Call <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M33">View MathML</a>

13.       <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M34">View MathML</a>

14.       If<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M35">View MathML</a>, then

15.          Set BestTree = G'', <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M36">View MathML</a>

16.       Else,

17.          Set <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M37">View MathML</a>

18.          Set G'' = NNIG'(v)

19.          Set <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M38">View MathML</a>

20.          Set <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M39">View MathML</a>

21.          Call <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M40">View MathML</a>

22.          Set <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M41">View MathML</a>

23.    Set <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M42">View MathML</a>

24. Return BestTree

Algorithm 2 - Algo-Comp-Score

Input: A gene tree G, a species tree S, and LCA mapping G,S

Output:<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M77','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M77">View MathML</a>(G,S)

01. score = 0

02. For each g I(G) in preorder traversal of G, do

03.    <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M43">View MathML</a>

04. If Γis DC, then

05.    score = score - |I(S)|

06. Return score

Algorithm 3 - Algo-G-Score

Input: A gene tree G, a species tree S, LCA mapping G,S, and g I(G)

Output:<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M77','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M77">View MathML</a>(G, S, g)

01. If Γ is D, then

02.    Ifℳ(g) ∈ ℳ(Ch(g)), then

03.       Return 1

04. ElseIf Γis DL, then

05    <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M44">View MathML</a>

06.    Ifℳ(g) ∈ ℳ(Ch(g)), then

07.       Return ls + 1

08. Else

09.    Return ls

10. Else //Γ is DC

11    Return<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M45">View MathML</a>

Lemma 6. The R-SEC-Γ problem is correctly solved by Algo-R-SEC-Γ.

Proof. Lemma 1-5 and Theorem 1-3 directly imply that in order to prove the correctness of algorithm Algo-R-SEC-Γ, it is sufficient to prove that it correctly returns G' of minimum <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M46','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M46">View MathML</a> among all G' ∈ V(<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>). We will show that algorithm Algo-R-SEC- accounts each G' ∈ V(<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>), correctly computes <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M47">View MathML</a> for Γ ∈ {D, DL, DC}, and returns the right G' as output.

From Definition 10, V(<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>) = SPRG(v). In Algo-R-SEC-Γ, step 1 prunes subtree Gv and regrafts it above the root of G to create <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M48','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M48">View MathML</a>. Step 5 sets G' to <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M49','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M49">View MathML</a>. The for-loop in step 6 traverses subtree <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M50">View MathML</a> in preorder. For each traversed vertex <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M51">View MathML</a>, step 9 builds the tree G'':= SPRG(v, k) by applying NNI operation on the last build G''. Each for-loop iteration sets G' to the last build G'' in step 23. <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M52">View MathML</a> and G''s constitute all the trees in SPRG(v).

For <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M53','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M53">View MathML</a>, step 2 computes the LCA mapping and step 5 sets <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M54">View MathML</a> to zero. Following Lemma 2-4 and Theorem 2, step 10 and 11 update the LCA of G'' and step 12 computes <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M55','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M55">View MathML</a> by calling algorithm Algo-G-Score. Depending on Γ ∈ {D, DL, DC}, there are three cases:

Case 1: Γ is D. Algo-G-Score returns 1, if the vertex g ∈ V(G'') maps to the same vertex in S as any of its children maps to, otherwise 0.

Case 2: Γ is DL. Algo-G-Score computes losses by applying the formula of Definition 4. Further, it adds 1 if there is a duplication.

Case 3: Γ is DC. Algo-G-Score, returns the number of lineages from g to each of its children h Ch(g) in S. For each h Ch(g), depth of ℳ(g) is subtracted from depth of ℳ(h) to count number of edges between ℳ(g) and ℳ(h).

In Algo-R-SEC-Γ, step 13 computes <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M56">View MathML</a> by adding <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M57','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M57">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M58','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M58">View MathML</a>. When backtracking, steps 17-22 are executed to restore the right G' to compute the next unique G'' ∈ Ch<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>(G). This ensures that the correct <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M59','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M59">View MathML</a> is computed for each G' ∈ V(<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>).

In Algo-R-SEC-Γ, step 4 sets <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M60','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M60">View MathML</a> as the BestTree and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M61','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M61">View MathML</a> as BestScore. Every time a new G'' ∈ Ch<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M74">View MathML</a>(G) is encountered, step 14 compares <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M62','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M62">View MathML</a> with BestScore, and updates BestTree with G'' of the minimum <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M63','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M63">View MathML</a>. After the for-loop, step 24 returns the BestTree. □

Lemma 7. The R-SECand SEC-Γ problems can be solved in Θ(n) and Θ(n2) time, respectively.

Proof. We will prove that the algorithm Algo-R-SEC-Γsolves the restricted SPR based error-correction problems for each Γ ∈ {D, DL, DC} in Θ(n) time. In Algo-R-SEC-Γ, step 1 takes constant time. Step 2 precomputes LCA values for species tree in O(n) time [24], and so, finds LCA mapping from <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M64','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M64">View MathML</a> to S in O(n) time in bottom-up manner. Step 3 computes the duplication, duplication-loss or deep coalescence score of <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M65','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M65">View MathML</a> and S by calling Algo-Comp-Score. In Algo-Comp-Score, step 1 and step 2 runs for O(1) and O(n) time, respectively. Step 3 calls Algo-G-Score in each iteration of for-loop. Algo-G-Score runs for O(1) time for Γ ∈ {D, DL, DC}.

When Γ is DC, steps 4 and 5 are further executed in Algo-Comp-Score for constant time. Thus in Algo-R-SEC-Γ, step 3 runs for O(n) time. Further, steps 4 and 5 take constant time. The loop of step 6 runs for Θ(n) time. If condition of step 7 is true, steps 8-10 executes in constant time. With precomputed LCA values from step 2, step 11 executes in constant time. Algo-G-Score runs for constant time for Γ ∈ {D, DL, DC}, and lets step 12 to execute in constant time. Further, steps 13-15 execute for constant time too. If the condition in step 7 is false, then steps 17-22 execute in constant time, similarly. Finally, step 23 runs for constant time, and hence, the R-SEC-Γ problem can be solved in Θ(n) time. From Observation 1, Algo-R-SEC-Γ is called Θ(n) time to solve SEC-Γ problem. Thus, the SEC-Γ problem can be solved in Θ(n2) time. □

Solving the TEC-Γ problems

In this section we study the TBR based error-correction problems, for duplication (D), duplication-loss (DL), and deep coalescence (DC). More precisely, we extend our solution for the SEC-Γ problems to solve the TEC-Γ problems. A TBR operation can be viewed as an SPR operation, except that the pruned subtree can be rerooted before it is regrafted. Our speed-up for the SEC-Γ problems is achieved by observing that the scores Γof any re-rooted pruned subtree and its remaining pruned tree are independent of each other. We define the R-TEC-Γ problems for the TEC-Γ problems, as we defined the R-SEC-Γ problems for the SEC-Γ problems. We will show that the R-TEC-Γ problems can be solved by solving two smaller problems separately and combining their solutions.

Definition 12. Let T be a tree and x V(T). RR(T, x) is defined to be the tree T , if x = Ro(T) or x Ch(Ro(T)). Otherwise, RR(T, x) is the tree obtained by suppressing Ro(T), and subdividing the edge (Pa(x), x) by the new root node.

Lemma 8. Given a tuple G, S, v〉, and G'':= TBRG(v, x, y), for x V(Gv), y V(G)\V(Gv). Then, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M66','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M66">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M67','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M67">View MathML</a>.

Proof. (⇒) Let G1 := TBRG(v, x1, y), for x1 V(Gv), and x1 x. Now observe that, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M68','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M68">View MathML</a> Also, let G2 := TBRG(v, x, y1), for y1 V(G)\V(Gv), and y1 y. Observe that, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M69','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M69">View MathML</a>. Thus, if G'' gives the minimum duplication, duplication-loss, or deep coalescence score among all trees in TBRG(v), then the score contribution of vertices in V(Gv) and V(G)\V(Gv) is independent. Now looking at vertices of G, the best score is achieved when Gv is rooted at x, i.e. <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M70','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M70">View MathML</a>; also the best score is achieved when RR(Gv, x) is regrafted at y, i.e., <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S11/mathml/M71">View MathML</a>. This follows similarly. □

Lemma 8 implies that a tree in TBRG(v) with the minimum duplication, duplication-loss, or deep coalescence cost can be obtained by optimizing the rooting for the pruned subtree, and the regraft location, separately. A best rooting for the pruned subtree is linear time computable [17,25], and the solution to the R-SEC problem identifies a best regraft location in Θ(n) time. This allows to obtain a tree in TBRG(v) with the minimum duplication, duplication-loss, or deep coalescence cost by evaluating only Θ(n) trees. Thus the R-TEC-Γ problem can be solved in Θ(n) time. The TEC-Γ problem can be solved by calling the solution of R-TEC-Γ problem Θ(n) times, and Theorem 4 follows.

Theorem 4. The TEC-Γ problem can be solved in Θ(n2) time.

Experimental results

We tested the performance of the gene tree rearrangement algorithms on a set of 106 gene alignments containing sequences from 8 yeast taxa from Rokas et al. [26]. There is a well accepted phylogeny for the yeast species, and the data set has been used to test algorithms for gene tree parsimony based on the deep coalescence problem [27,28]. We constructed maximum likelihood gene trees for each gene using RAxML-VI-HPC version 7.0.4 [29], the gene trees were rooted with the outgroup Candida albicans. We used the new error correction algorithms to examine how much a single SPR rearrangement in the gene tree reduces the reconciliation cost based on deep coalescence and also gene duplications and losses. Over all genes the SPR error correction reduced the total deep coalescence cost from 151 to 53 (Table 1) and the duplication and loss cost from 481 to 175 (Table 2). Both the algorithms took only seconds to run for all 106 genes on a standard laptop.

Table 1. Error correction based on deep coalescence model

Table 2. Error correction based on duplication and loss model

We also implemented a protocol to use the gene rearrangement algorithm to correct for gene tree error in gene tree parsimony phylogenetic analyses. We first took a collection of input gene trees and performed a SPR species tree search using Duptree [30], which seeks the species tree with the minimum gene duplication cost. We used the duplication only cost (instead of duplications and losses) because when there is no complete sampling of all existing genes, the loss estimates may be inflated by missing sequences. After finding the locally optimal species tree, we used our SPR gene tree rearrangement algorithm to find gene tree topologies with a lower duplication cost. We then performed another SPR species tree search using Duptree, starting from the locally optimal species tree and using the new gene tree topologies. This search strategy is similar to re-rooting protocol in Duptree, which checks for better gene tree roots after a SPR species tree search [30,31]. We tested this protocol on data set of 6,084 genes (with a combined 81,525 leaves) from 14 seed plant taxa. This is the same data set used by [31], except that all gene tree clades containing sequences from a single species were collapsed to a single leaf. Our original SPR tree search found a species tree with 23,500 duplications. The SPR tree search after the gene tree rearrangements identified the same species tree, but the new gene trees had a reconciliation cost of only 18,213. This tree search protocol took just under 4 hours on a Mac Powerbook with a 2 GHz Intel Core 2 Duo processor and 2 GB memory.

Conclusion

GT-ST reconciliation provides a powerful approach to study the patterns and processes of gene and genome evolution. Yet it can be thwarted by the error that is an inherent part of gene tree inference. Any reliable method for GT-ST reconciliation must account for gene tree error; however, any useful method also must be computationally tractable for large-scale genomic data. We introduce fast and effective algorithms to correct error in the gene trees. These algorithms, based on SPR and TBR rearrangements, greatly extend upon the range of possible errors in the gene tree from existing algorithms [17,18], while remaining fast enough to use on data sets with thousands of genes. These algorithms can correct trees based on a broad variety of evolutionary factors that can cause conflict between gene trees and species trees, including gene duplication, duplications and losses, and deep coalescence.

Our analysis on 106 yeast gene trees demonstrates that even a single SPR correction on the gene trees can radically improve upon the reconciliation cost. Our results in the yeast analysis are very similar to the 2-3 fold improvement in implied duplications and losses reported from the parametric gene tree estimation and reconciliation method of Rasmussen and Kellis [5]. However, in contrast, to this computationally complex method, the gene tree rearrangement algorithm is extremely fast, does not require assumption about the rates of duplication and loss, and is also amenable to corrections based on deep coalescence and duplications as well as duplications and losses. We do not claim that the gene correction algorithms produce a more accurate reconciliation than these parametric methods. However, they do present an extremely fast and flexible alternative.

We also demonstrated that this error correction protocol could easily be incorporated into a gene tree parsimony phylogenetic analysis. Previous studies have emphasized that gene tree parsimony is sensitive to the topology of the input trees. For example, the species tree may differ whether the gene trees are made using parsimony or maximum likelihood [8,10]. In our study, although the gene tree rearrangement did not affect the species tree inference, it did greatly reduce the gene duplication reconciliation cost.

While the results of the experiments are promising, they also suggest several directions for future research. First, further investigation is needed to characterize the effects of error on gene tree topologies. For example, it seems likely that gene tree errors may extend beyond a single SPR or TBR neighborhood. Yet, if we allow unlimited rearrangements, the gene trees will simply converge on the species tree topology. One simple improvement may be to weight the possible gene tree rearrangements based on support for different clades in the gene tree. Thus, well-supported clades may be rarely or never be subject to rearrangement, while poorly supported clades may be subject to extensive rearrangements. Finally, these approaches implicitly assume that all differences between gene trees and species trees are due to either coalescence, duplications, or duplications and losses. Future work will seek to combine these objectives and also address lateral transfer.

Competing interests

The authors declare that they have no competing interests.

Author's contributions

RC was responsible for algorithm design and program implementation, and wrote major parts of the paper. JGB performed the experimental evaluation and the analysis of the results, and contributed to the writing of the paper. OE supervised the project, contributed to the algorithmic design and writing of the paper.

All authors read and approved the final manuscript.

Acknowledgements

The authors thank André Wehe for his generous support with the implementation. This work was conducted in parts with support from the Gene Tree Reconciliation Working Group at NIMBioS through NSF award EF-0832858, with additional support from the University of Tennessee. R.C. and O.E. were supported in parts by NSF awards #0830012 and #10117189.

This article has been published as part of BMC Bioinformatics Volume 13 Supplement 10, 2012: "Selected articles from the 7th International Symposium on Bioinformatics Research and Applications (ISBRA'11)". The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S10.

References

  1. Maddison WP: Gene Trees in Species Trees.

    Systematic Biology 1997, 46:523-536. Publisher Full Text OpenURL

  2. Goodman M, Czelusniak J, Moore GW, Romero-Herrera AE, Matsuda G: Fitting the gene lineage into its species lineage. A parsimony strategy illustrated by cladograms constructed from globin sequences.

    Systematic Zoology 1979, 28:132-163. Publisher Full Text OpenURL

  3. Guigó R, Muchnik I, Smith TF: Reconstruction of Ancient Molecular Phylogeny.

    Molecular Phylogenetics and Evolution 1996, 6(2):189-213. PubMed Abstract | Publisher Full Text OpenURL

  4. Slowinski JB, Knight A, Rooney AP: Inferring Species Trees from Gene Trees: A Phylogenetic Analysis of the Elapidae (Serpentes) Based on the Amino Acid Sequences of Venom Proteins.

    Molecular Phylogenetics and Evolution 1997, 8:349-362. PubMed Abstract | Publisher Full Text OpenURL

  5. Rasmussen MD, Kellis M: A Bayesian approach for fast and accurate gene tree reconstruction.

    Molecular Biology and Evolution 2011, 28:273-290. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  6. Burleigh JG, Bansal MS, Wehe A, Eulenstein O: Locating Large-Scale Gene Duplication Events through Reconciled Trees: Implications for Identifying Ancient Polyploidy Events in Plants.

    Journal of Computational Biology 2009, 16:1071-1083. PubMed Abstract | Publisher Full Text OpenURL

  7. Hahn MW: Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolution.

    Genome Biology 2007, 8:R141. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  8. Burleigh JG, Bansal MS, Eulenstein O, Hartmann S, Wehe A, Vision TJ: Genome-scale phylogenetics: inferring the plant tree of life from 18,896 discordant gene trees.

    Systematic Biology 2011, 60(2):117-125. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  9. Huang H, Knowles LL: What Is the Danger of the Anomaly Zone for Empirical Phylogenetics?

    Systematic Biology 2009, 58:527-536. PubMed Abstract | Publisher Full Text OpenURL

  10. Sanderson MJ, McMahon MM: Inferring angiosperm phylogeny from EST data with widespread gene duplication.

    BMC Evolutionary Biology 2007., 7(suppl 1:S3) OpenURL

  11. Berglund-Sonnhammer A, Steffansson P, Betts MJ, Liberles DA: Optimal Gene Trees from Sequences and Species Trees Using a Soft Interpretation of Parsimony.

    Journal of Molecular Evolution 2006, 63:240-250. PubMed Abstract | Publisher Full Text OpenURL

  12. Vernot B, Stolzer M, Goldman A, Durand D: Reconciliation with non-binary species trees.

    Computational Systems Bioinformatics 2007, 53:441-452. OpenURL

  13. Yu Y, Warnow T, Nakhleh L: Algorithms for MDC-Based Multi-locus Phylogeny Inference. In RECOMB, Volume 6577 of Lecture Notes in Computer Science. Edited by Bafna V, Sahinalp SC. Springer; 2011:531-545. OpenURL

  14. Cotton JA, Page RDM: Going nuclear: gene family evolution and vertebrate phylogeny reconciled.

    P Roy Soc Lond B Biol 2002, 269:1555-1561. Publisher Full Text OpenURL

  15. Joly S, Bruneau A: Measuring Branch Support in Species Trees Obtained by Gene Tree Parsimony.

    Systematic Biology 2009, 58:100-113. PubMed Abstract | Publisher Full Text OpenURL

  16. Arvestad L, Berglund A, Lagergren J, Sennblad B: Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution.

    RECOMB 2004, 326-335. OpenURL

  17. Chen K, Durand D, Farach-Colton M: Notung: a program for dating gene duplications and optimizing gene family trees.

    Journal of Computational Biology 2000, 7:429-447. PubMed Abstract | Publisher Full Text OpenURL

  18. Durand D, Halldórsson BV, Vernot B: A Hybrid Micro-Macroevolutionary Approach to Gene Tree Reconstruction.

    Journal of Computational Biology 2006, 13(2):320-335. PubMed Abstract | Publisher Full Text OpenURL

  19. Allen BL, Steel M: Subtree transfer operations and their induced metrics on evolutionary trees.

    Annals of Combinatorics 2001, 5:1-13. Publisher Full Text OpenURL

  20. Bordewich M, Semple C: On the computational complexity of the rooted subtree prune and regraft distance.

    Annals of Combinatorics 2004, 8:409-423. OpenURL

  21. Zhang L: On a Mirkin-Muchnik-Smith Conjecture for Comparing Molecular Phylogenies.

    Journal of Computational Biology 1997, 4(2):177-187. PubMed Abstract | Publisher Full Text OpenURL

  22. Page RDM: Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas.

    Systematic Biology 1994, 43:58-77. OpenURL

  23. Eulenstein O: Predictions of gene-duplications and their phylogenetic development.

    PhD thesis.

    University of Bonn, Germany 1998. [GMD Research Series No. 20/1998, ISSN: 1435-2699]

    OpenURL

  24. Bender MA, Farach-Colton M: The LCA Problem Revisited.

    LATIN 2000, 88-94. OpenURL

  25. Górecki P, Tiuryn J: Inferring phylogeny from whole genomes.

    ECCB (Supplement of Bioinformatics) 2006, 116-122. OpenURL

  26. Rokas A, Williams BL, King N, Carroll SB: Genome-scale approaches to resolving incongruence in molecular phylogenies.

    Nature 2003, 425(6960):798-804. PubMed Abstract | Publisher Full Text OpenURL

  27. Than C, Nakhleh L: Species tree inference by minimizing deep coalescences.

    PLoS Comput Biol 2009, 5(9):e1000501. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  28. Bansal MS, Burleigh JG, Eulenstein O: Efficient genome-scale phylogenetic analysis under the duplication-loss and deep coalescence cost models.

    BMC Bioinformatics 2010, 11(Suppl 1):S42. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  29. Stamatakis A: RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models.

    Bioinformatics 2006, 22(21):2688-2690. PubMed Abstract | Publisher Full Text OpenURL

  30. Wehe A, Bansal MS, Burleigh JG, Eulenstein O: DupTree: a program for large-scale phylogenetic analyses using gene tree parsimony.

    Bioinformatics 2008., 24(13) OpenURL

  31. Chang W, Burleigh JG, Fernández-Baca D, Eulenstein O: An ILP solution for the gene duplication problem.

    BMC Bioinformatics 2011, 12(Suppl 1):S14. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL