Département d'Informatique (DIRO), Université de Montréal, H3C 3J7, Canada

McGill Centre for Bioinformatics, McGill University, H3C 2B4, Canada

Abstract

Understanding the history of a gene family that evolves through duplication, speciation, and loss is a fundamental problem in comparative genomics. Features such as function, position, and structural similarity between genes are intimately connected to this history; relationships between genes such as orthology (genes related through a speciation event) or paralogy (genes related through a duplication event) are usually correlated with these features. For example, recent work has shown that in human and mouse there is a strong connection between function and inparalogs, the paralogs that were created since the speciation event separating the human and mouse lineages. Methods exist for detecting inparalogs that either use information from only two species, or consider a set of species but rely on clustering methods. In this paper we present a graph-theoretic approach for finding lower bounds on the number of inparalogs for a given set of species; we pose an edge covering problem on the similarity graph and give an efficient 2/3-approximation as well as a faster heuristic. Since the physical position of inparalogs corresponding to recent speciations is not likely to have changed since the duplication, we also use our predictions to estimate the types of duplications that have occurred in some vertebrates and drosophila.

Introduction

Gene duplication and subsequent modification or loss is a fundamental biological process that is well known to create novel gene function

If the speciation in question is a relatively recent speciation then inparalogs represent recent duplications. Thus, they have been used to study properties of duplications under the assumption that the inparalogs have not had time to significantly diverge from the state directly following the duplication

This motivates the study of large-scale detection of inparalogs. Tree-based inference such as reconciliation is considered to be the most accurate and comprehensive way to infer gene relationships

Thus, many studies rely on tools based on pairwise similarity measures between genes. Although a daunting number of tools have been developed for orthology detection (due to its relationship to function)

In this paper we simultaneously consider the global information given by multiple genes in multiple genomes; this extra information affords us the power to detect less similar pairs of inparalogs, and provides robustness against gene loss. In particular, our approach gives a lower bound on the number of inparalog pairs, based on finding an "orthogonal edge cover" of the colored similarity graph proposed in Zheng et al.

The covering step of the method corresponds to finding a so-called maximum orthogonal edge cover of the graph, a problem first posed for finding functional ortholog sets

We apply our method to the genomes of human, chimpanzee, mouse, rat, zebrafish, pufferfish,

Inparalogs and multiple species

Given species

We generalize the definition of inparalogy to consider multiple species with a known phylogeny. For a set of species

**Definition 1**.

In the genealogy of Figure

Inparalogs and multiple species

**Inparalogs and multiple species**. (a) is a connected component of the similarity graph (see Section

Inparalogs and edge covers

InParanoid builds sets of inparalog pairs which it then must merge based on an extensive set of rules. We forgo this complicated merging process by considering the pairwise similarities in a global fashion. Further, our method is robust to gene loss due to the fact that we consider the genes from multiple genomes at once. Consider the graph

**Property 1 **(orthogonality

An

A connected component of the similarity graph

**A connected component of the similarity graph**. See Section

Our method is based on the observation that inparalogs belong to orthology sets of size one, whereas in the absence of losses all other paralogs will be orthologous to at least one other gene. Figure

Thus, for a given subgraph of

1. find an orthogonal subgraph of

2. mark as inparalogs all uncovered vertices with high similarity to some other gene in the same genome.

Step 1 corresponds to solving the maximum orthogonal edge cover problem. In Section

Chains of duplications

A chain of multiple duplications, each originating from the previous duplicate copy, will result in multiple uncovered vertices of a single color, as depicted for zebrafish in Figure

Further motivation

The simplest notion of inparalogy requires only a single genome and a measure of similarity between genes: the most closely related genes would then just be the proposed inparalogs. For example, Ezawa et al.

Figure

A slightly more general, but simpler, approach than that of InParanoid would consider the similarity graph for two genomes; in this case the graph is bipartite. Thus, a maximum matching on the weighted graph covers the maximum number of genes with the maximum amount of global similarity. The uncovered vertices are then candidates to be inparalogs; those that are similar enough to other genes are considered to be inparalogous to those genes. While this method may not suffer from the problem of lower similarity between inparalogs (illustrated in Figure

Losses in the context of multiple genomes

**Losses in the context of multiple genomes**. (a) is a connected component of the similarity graph (see Section

Thresholds

Step 1 of our algorithm calls for a subset of the edges that results in a minimum number of uncovered vertices. Note that this measure does not have anything to do with the number of edges or the weight of the edges that are chosen; the maximum orthogonal partition problem is inherently unweighted. For this reason, our method requires a threshold for interspecies similarity scores; all edges labeled above the threshold will be considered significant. Similarly, to reduce false positives, the intraspecies similarity scores may have a different threshold.

An interspecies threshold that is too high will yield an unweighted graph with components that are very small, and we will lose the power of the multiple genome inference. An interspecies threshold that is too low may yield large components that have too many optimal solutions. While there may be some question as to what threshold is the best, we have yet to do a detailed study on this. Instead, we have chosen conservative thresholds for both measures; all the results reported in this paper have interspecies threshold of 80 and intraspecies threshold of 70.

Maximum orthogonal edge cover

In this section we describe the algorithms for maximum orthogonal edge cover. Our 2/3-approximation algorithm runs

Take a set

**Definition 2**. _{i}S_{i }_{i}

Take a graph

**Definition 3**.

Consider the partition of

**Definition 4**.

A maximum orthogonal edge cover of

Let

**Input**: Undirected graph

**Solution**: An orthogonal edge cover

**Measure**: The number of vertices covered (

We present a 2/3-approximation algorithm for MAX-OREC. Our approach is to first compute edges that cover the maximum number of vertices for each color, while ignoring the orthogonality constraint. We then show that the connected components of this edge cover have a particular structure, allowing us to ensure orthogonality without removing too many edges.

Bipartite matchings

Consider the bipartite graph

**Property 2**.

Now take a maximum orthogonal edge cover _{i }

**Lemma 1**. |*****)| ≤ |

If every connected component of

Covering bounded degree graphs

Consider the neighborhood of a particular vertex

Call a path in a component

**Lemma 2**.

Take a minimal edge cover

**Lemma 3**.

Lemmata 2 and 3 imply the following algorithm for finding an approximate orthogonal edge cover on a 2NL graph where

**Algorithm 1 **getMAX-2NL-OREC(

**return **

Bringing things together

Say Algorithm 1 returns an edge cover

**Lemma 4**. |

So counting the number of odd-length paths gives us an idea of how far we could be from the optimal. Since the shortest possible odd-length path has three vertices, and two of them can be covered, we get the desired approximation guarantee.

**Lemma 5. **

Now, using Section

**Theorem 1**. |*****)| **- **

**Theorem 2. **

**Algorithm 2 **getMAX-OREC(

**for **each color **do**

**end for**

**for **each component **do**

**end for**

**return **

Running time

The running time of Algorithm 1 is

A fast heuristic

We also developed a practical algorithm for MAX-OREC. It is simpler to implement and runs faster in practice and performs better on dense graphs (see Section

1. compute

2. compute

Note that the main difference with the approximation algorithm is that we do not compute the same maximum bipartite matchings.

Results and discussion

Experiments on simulated datasets

We implemented the 2/3-approximation algorithm and the heuristic in C++ and we applied them to simulated datasets in order to compare their performance. We generated random graphs using the

Figure

Comparison of the performance of the 2/3-approximation algorithm and the heuristic

**Comparison of the performance of the 2/3-approximation algorithm and the heuristic**. The results are averaged over 100 random graphs of 2500 vertices (genes) and 5 colors (genomes). Left: Comparison of the number of uncovered vertices. The number of odd-length alternating paths is also shown. Right: Running time comparison.

Experiments on real datasets

In this section, we present an analysis of the inparalog pairs inferred by our approach on the genomes of human, chimpanzee, mouse, rat, zebrafish, pufferfish,

Creating the input graph

We used CoGe:SynMap

Modes of duplication and recent inparalogs

The most studied duplication mechanisms are whole genome duplication, tandem duplication and retrotransposition. Whole genome duplication has the effect of simultaneously doubling all the chromosomes of a genome. It has been shown that whole genome duplication has occurred at least once

Another mode of duplication that has been receiving more attention in the recent years is the one responsible for the creation of segmental duplications. It has been named duplicative transposition in

In order to better understand duplication mechanisms and study the relative rates of the different types of duplications, it is interesting to study recently created gene duplicates. For example, a study on recently emerged paralogs in human, mouse, zebrafish,

Analysis of the inparalog pairs

We identified inparalog pairs in the studied genomes and retrieved information on their physical distance and percent similarity. Figure

Proportions of inparalog pairs inferred in the 8 species studied

**Proportions of inparalog pairs inferred in the 8 species studied**.

Only mouse and

For all the species, a large fraction of the inparalog pairs are unlinked. This is especially true for zebrafish and pufferfish, where more than 80% of the inparalog pairs are located on different chromosomes. Interestingly, the majority of the unlinked pairs in the fish species have a low percent similarity. We hypothesize that this could be the result of ongoing fractionation after the fish-specific whole genome duplication. Human, chimpanzee and rat all have at least 10% of recent unlinked inparalog pairs (>95% similarity). This could be evidence of recent duplicative transpositions or retrotransposition. Older unlinked inparalog pairs (<95% similarity) do not necessarily correspond to older duplicative transposition events. For example, a scenario involving tandem duplication followed by genomic rearrangement events could have produced the same results.

Conclusion

We presented a new graph-theoretic approach for the detection of inparalogs. Our method uses a maximum orthogonal edge cover on the similarity graph and then identifies inparalogs in the set of uncovered vertices. We developed a 2/3-approximation algorithm for this problem and a heuristic that was shown to be faster and more efficient on dense graphs. Note that our method is not suitable for finding orthologous gene relationships since our edge covers aggressively leave the minimum number of genes unmatched. Zheng et al.

We have shown compelling examples of why using the information for multiple species gives more accurate inparalog predictions and how our method allows us to infer inparalogs that would not have been found by other methods like InParanoid. We then presented an example of how we can use recent inparalogs to study modes of duplication. Our analysis of the genomes of human, chimpanzee, mouse, rat, zebrafish, pufferfish,

We did not show speed comparisons with other existing methods like InParanoid because our method was very fast on real data. The results on the real datasets were obtained in 10 seconds on a typical Linux workstation.

On the methodological side, algorithmic improvements that consider edge weights while finding an edge cover are possible, as well as improved preprocessing of the data. The question remains as to which other measures of similarity our method is most powerful with.

On the evaluation side, we attempted to make large-scale comparisons against inparalogy given by reconciliation (Ensembl gene trees), but we were not able to convert in an automated manner a statistically significant number of gene names from SynMap to Ensembl IDs in order to do so. While computing statistics -- like the number of inparalog pairs shared with a method like InParanoid -- are possible, direct comparison as to which method finds the correct inparalog relationships remains difficult since few independent methods or bench experiments exist for finding such relationships.

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

This article has been published as part of