Département d'Informatique (DIRO), Université de Montréal, H3C 3J7, Canada

McGill Centre for Bioinformatics, McGill University, H3C 2B4, Canada

Abstract

Background

Reconciliation is the classical method for inferring a duplication and loss history from a set of extant genes. It is based upon the notion of embedding the gene tree into the species tree, the incongruence between the two indicating evidence for duplication and loss. However, results obtained by this method are highly dependent upon the considered species and gene trees. Thus, painstaking attention has been given to the development of methods for reconstructing accurate gene trees.

Results

This paper highlights the fact that errors in gene trees are not the only reasons for the inference of an erroneous duplication-loss history. More precisely, we prove that, under certain reasonable hypotheses based on the widely accepted link between function and sequence constraints, even a well-supported gene tree yield a reconciliation that does not correspond to the true history. We then provide the theoretical underpinnings for a conservative approach to infer histories given such gene trees. We apply our method to the mammalian interleukin-1 (IL) gene tree, that has been used as a model example to illustrate the role of reconciliation.

Introduction

Background

Duplication followed by modification is a major mechanism driving evolution. A significant obstacle obscuring our understanding of this mechanism is the inference of duplication and loss histories for a gene family. In 1979 Goodman et al.

If the gene tree represents the true evolutionary relationship on the gene set, then a reconciliation-based method is likely to give an accurate duplication and loss history. However uncertainty on gene trees is a serious limitation to reconciliation. In particular, it has been reported that a few misplaced leaves in the gene tree can lead to a completely different history, possibly with significantly more duplications and losses

However, even when the gene tree is statistically well-supported by the data, the reconciliation approach can lead to an erroneous duplication and loss history for the gene family. The reason is that the gene tree that best reflects the sequence similarity of the gene copies is not necessarily the true phylogeny for the gene family. In particular, homologous gene copies that are responsible for preserving an original ancestral function are likely to diverge at a lower rate than the copies that are not constrained by function. Such copies are therefore likely to appear as a subtree of the gene tree, even though they are not the most evolutionary closely related copies. The link between function and sequence constraint has been largely accepted and reported in the literature

Results

In this paper, we formally study the consequence of functional constraints on reconciliation. Our main result (Corollary 1) is a proof that there are certain simple conditions under which reconciliation fails, even when the gene tree is perfectly well supported (

This not only supports the need for efforts to efficiently find accurate gene trees and compute histories through probabilistic methods, but also raises the question, "what does the relationship between the gene and species tree tell us when the gene tree respects the isolocalization property?" We provide some foundational theory on the subject; we pose two fundamental problems associated with this question and provide an algorithm to solve the simpler of the two, under the duplication cost. Finally, we apply our methods to the mammalian interleukin-1 (IL) gene tree.

In the next two sections we introduce concepts related to reconciliation and distinguish between the various types of gene homology using the terminology recommended by Fitch

Preliminary concepts and notation

Trees

For our purposes, a genome is just a collection of genes. A _{1}, _{2}, _{3}, _{2}, _{3}}, where _{i}

S is a species tree for Σ = {1, 2, 3}; H represents a history, consistent with (

**S is a species tree for Σ = {1, 2, 3}; H represents a history, consistent with ( i.e. embedded in) the species tree S, with one duplication event preceding the speciation event leading to genomes 2 and 3**. Speciation events appear as bifurcations at obtuse angles, while duplication events appear at right angles. We represent the information on isorthology by positioning the retainer of parental function directly under the parental gene. Moreover, we label isorthologs with the same letter (all

Given a tree _{x }_{x }_{T }_{T }

Histories

We study the evolution of a family of genes taken from genomes

Given a set of genes Γ and a function _{1}, _{2}, _{3}, _{2}, _{3}}. Speciation events appear as bifurcations at obtuse angles, while duplication events appear at right angles. Losses are represented by dotted edges; the leaf labeled _{1 }is a loss. Both histories

As the true gene tree is unknown, phylogenetic information is usually inferred from molecular data. In this paper, we will distinguish between a _{1}, _{2}, _{3}, _{2}, _{3}}, whereas

Reconciliation

A _{1 }is implied by the duplication at the root, and the fact that there is only a single gene mapped to genome 1. Refer to

The parsimony criteria used to choose among the large set of possible reconciliations are the number of duplications (_{ℓ}_{r}_{ℓ }_{r }

Perspectives on homology

In the previous sections we vaguely referred to groups of genes from the same gene families as "homologous". In this section we solidify the notion of gene families by discussing the terminology related to homologous genes. There are many alternative definitions for homology and related concepts, the ambiguity being due to the many possible definitions of similarity between genes. Indeed, evolutionary, sequence, functional, or positional constraints give rise to definitions that are unfortunately not equivalent

**Definition 1 (homology) **

As the true history of genes is unknown, homology is usually inferred with some uncertainty, usually from amino acid or nucleotide similarity. Some confusion could remain about the definition of homology, since the belief is that all modern genes originated from a single gene or some small number of genes. In this context, all or most genes are homologous to all or most others. For this reason we posit that the evolutionary definition of homology might also include a notion of time. Fortunately, this issue does not have bearing on the results presented here.

The remainder of the definitions describe a hierarchy of homologous genes, implied by the true history of the genes.

**Definition 2 (orthology) **_{H}

As duplications may arise following a speciation event, the orthology relationship is not transitive. Thus it makes no sense to speak of sets/groups of orthologs. For example, in history _{1 }is orthologous to the other four genes but _{2 }and _{3 }are not orthologous to _{2 }or _{3}. This property is inherent to the evolutionary definition of orthology, which is not a definition about the functional relationship between genes (Definition 3), nor the positional or direct descendant relationship that we introduce in more detail below (following Definition 3). Thus, the term

**Definition 3 (isorthology) **

Isorthology has also been called _{i}_{i }_{j}

Notice that the notion of isorthology is different from that of "true exemplars"

**Definition 4 (paralogy) **_{H}

Duplications occur in an individual and copies can be fixed or lost on the population. From a functional point of view, the two copies (source and sink) of a duplication may evolve in different ways

When reconciliation is not the right tool

The fundamental hypothesis behind reconciliation is that the gene tree reflects the true phylogeny of the gene family. Therefore, a strict prerequisite is to have both gene tree and species tree free from error. As demonstrated by many authors

In

Hypothesis 1

For convenience, we coin the term

As pseudofunctionalization and neofunctionalization do not occur simultaneously with duplication, the underlying assumption in Hypothesis 1 is that enough time has passed after the duplication event to differentiate the two gene copies. Also notice that this hypothesis does not prevent subsequent functional loss. For example in the history

Isolocalization

Errors in the gene tree are not the only reason for doubting reconciliation. A well-supported gene tree does not necessarily represent the true phylogeny for the gene family, as it is not necessarily the case that all genes have evolved at the same rate (

**Definition 5 (isolocalization property) **_{1 }_{2}_{G}_{1}_{2})_{1 }_{2}.

For example in Figure

The following theorem shows that reconciliation is not the correct tool for finding the true history when the isolocalization property holds. Say that a duplication node _{H}_{1}, _{2 }such that _{H}_{1}, _{2}) =

**Theorem 1 **

**Proof: **Let _{1}, _{2 }such that _{H }_{1}, _{2}) = _{H }_{1}, _{2}. Then _{1 }and _{2}, gene _{2 } □

The following is the contrapositive of Theorem 1.

**Corollary 1 (isolocalization confounds reconciliation) **

A more powerful but slightly more technical version of the corollary is the following. It highlights the fact that reconciliation could falter even when only part of the gene tree respects the isolocalization property.

**Corollary 2 **_{1}_{2}) _{2}, _{1}, _{2 }

Figure

Isorthology respecting histories

Corollary 1 shows that reconciliation is not the right tool when a subtree of the gene tree adheres to the isolocalization property, yet there must be some information in the gene tree and species tree relationship. For instance, we expect subtrees corresponding to isorthologs in a well-supported gene tree to agree with the species tree. The following hypothesis formalizes this concept, where a tree

Hypothesis 2

_{i}, a_{j}, a_{k}_{i}, a_{j}, and a_{k}, but relabeled by s_{i}_{j}_{k}

We now elaborate the connection between isorthologous genes and the LCA mapping. In what follows, nodes of

**Lemma 1 **_{i}, a_{j}_{i }is a gene in genome i, and a_{j }is a gene in genome j. Then the node lca_{G}_{i}_{j}

Define a

**Corollary 3 **

An

**Definition 6 (Isorthology Respecting History (IRH)) **

In Figure _{1 }and _{2}, but the pair does not.

Notice that Corollary 3 does not _{1}}, {_{2}}, {_{3}}},{{_{1}}, {_{2}, _{3}}}, or {{_{1}, _{2}, _{3}}}. We will call an

Optimization problems

Following Corollary 3, an isorthologous respecting history appears as the most natural alternative to reconciliation. As many IRHs are possible for a given pair (

M

R

**Input: **A gene tree

**Output: **A

A restricted version of the MIRH problem would consider the maximal speciation subtrees of

The MIRH problem, as stated, ignores all the information on duplication and speciation nodes of

**Definition 7 (Triplet Respecting History (TRH)) **

We can now formulate our second optimization problem.

M

R

**Input: **A gene tree

**Output: **A

Taking our model example in Figure

On reconstructing isorthology respecting histories

In this section we justify the following algorithm to solve MIRH under the duplication cost. Start with a forest that corresponds to some isorthology respecting partition for (

The groundwork for this approach has been established, where the implications of Hypothesis 2 linked isorthogroups with subtrees of speciation nodes in Corollary 3. The main result of this section is Theorem 2, and the supporting lemmas follow.

**Theorem 2 **

**Proof: **Lemma 2 tells us that we need exactly

A linear-time algorithm to compute the duplication cost of the MIRH will be given later.

Inferring duplications

**Lemma 2 **

The rest of the subsection is a proof of Lemma 2.

**Proof: **We begin by proving some useful facts about the relationship between duplications and isorthologs. Say that

**Lemma 3 (isorthologs by retainers) **

**Proof: **Assume that

**Lemma 4 (non-isorthologs by mutants) **

Proof: Assume that there does not exist a duplication

The following property follows directly from Lemmas 3 and 4.

**Property 1 **_{H}

Since

**Remark 1 **

An implication of Property 1 is that the root of any isorthologous subtree must be joined to

In the other direction, if there are

The algorithm

We wrap up the section with a description of our algorithm that computes the duplication cost of the MIRH. Construction of the tree is straightforward, and is not presented here.

1. Label the nodes of

2. Compute

(a) Label all ancestors of a duplication node in a post-order traversal of

(b) Label each speciation node as a root of a tree in

3. Return

The LCA mapping can be computed in linear time

Applications

A model example used by Page and Holmes

Trees for mammals (

**Trees for mammals ( S) and for mammalian interleukin-1 genes (solid lines in G ) taken from **

The reconciliation and MTRH lead to four isorthogroups ({

In the MTRH of Figure

Discussion and future directions

Our theory is based on the assumption that enough time has passed to differentiate the products of each duplication (Hypothesis 2). In other words, even the most recent speciation events must be old enough to allow for clear formation of isorthogroups. Therefore, duplications that are inferred at the root by reconciliation will be lower in a MIRH. In terms of molecular clocks, this corresponds to an assumption of an infinite ratio between the molecular clock outside versus inside an isorthogroup. MTRH is more constrained in that it incorporates internal node information into the history. An alternative would be to allow for some constant ratio between rates. For example, an assumption that the substitution rate outside an isorthogroup is twice that of inside an isorthogroup would yield a different history. In the example of Figure

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

The first author would like to thank Rand Swenson for his help with the title.

This article has been published as part of