Laboratory for Computational Biology and Bioinformatics, EPFL, Lausanne, Switzerland

Abstract

Computing the edit distance between two genomes under certain operations is a basic problem in the study of genome evolution. The double-cut-and-join (DCJ) model has formed the basis for most algorithmic research on rearrangements over the last few years. The edit distance under the DCJ model can be easily computed for genomes without duplicate genes. In this paper, we study the edit distance for genomes with duplicate genes under a model that includes DCJ operations, insertions and deletions. We prove that computing the edit distance is equivalent to finding the optimal cycle decomposition of the corresponding adjacency graph, and give an approximation algorithm with an approximation ratio of 1.5 +

Introduction

The combinatorics and algorithmics of genomic rearrangements have been the subject of much research since the problem was formulated in the 1990s

A basic problem in genome rearrangements is to compute the edit distance, i.e., the minimum number of operations needed to transform one genome into another. For unichromosomal genomes, Hannenhalli and Pevzner gave the first polynomial-time algorithm to compute the edit distance under signed inversions

All of the above algorithms for computing edit distances assume equal gene content and no duplicate genes. El-Mabrouk

In this paper, we focus on the problem of computing the edit distance between two genomes in the presence of duplications. We define the edit distance at the adjacency set level on a unit-cost model including DCJ operations, insertions and deletions (duplications are a special case of insertions). We reduce the problem of computing such an edit distance to finding the maximum number of certain cycles in the adjacency graph, Finally we give a (1.5 +

Edit distance

We represent the genomes using the notations introduced by Bergeron _{h }_{t}_{t}b_{t}_{h}b_{t}_{t}b_{h}_{h}b_{h }_{h}b_{t }_{t}a_{h}_{t }_{h}

We define three operations on an adjacency set. The corresponding operations on the structure of the genome (relative positions and orientations of genes on chromosomes) are illustrated on Figure

The effect of DCJ operations, insertions and deletions on the genomic structure

**The effect of DCJ operations, insertions and deletions on the genomic structure**. (**a**) (**b**) and (**c**) represent DCJ operations, (**d**) (**e**) (**f**) and (**g**) represent insertion and deletion. In each subfigure, the central part represents operations, and the left part and right part represent the genomic structures.

1.

2. _{h}g_{t }_{t}_{h}q_{h}_{t}q_{t}_{h}_{h}_{t}_{t}g_{h}_{t}_{h}

3. _{h}g_{t }_{t}_{h}q_{t}_{h}_{t}g_{h}_{t}_{h}

The _{1 }and _{2}, denoted as _{1}, _{2}), is the minimum number of operations (including DCJ operations, insertions and deletions) that transform _{1 }into _{2}. Here we use a unit-cost model, in which all operations have the same cost.

Note that the edit distance is defined at the adjacency set level. For genomes without duplicate genes, an adjacency set denotes a unique genomic structure. However, for genomes with duplicate genes, two genomes with different structures may share the same adjacency set as illustrated in Figure _{1}, _{2}) defined above is a lower bound for the edit distance between the two genomic structures. Given two adjacency sets _{1 }and _{2 }from two genomes, let _{i }_{i}_{1}\_{2 }into _{1}; similarly, we produce _{2 }from _{2}\_{1}. Clearly, to transform _{1 }into _{2}, atleast |_{1}| deletions and |_{2}| insertions are needed. The following theorem shows that these insertions and deletions are both necessary and sufficient.

Two genomes with different structures share the same adjacency set

**Two genomes with different structures share the same adjacency set**. Each edge in this figure represents a gene, each node represents an adjacency.

**Theorem 1**. _{1 }_{2}_{1}_{2}_{1 }_{2}.

_{1}| deletions and more than |_{2}| insertions. Assume that _{1}_{2 }... _{m }^{0}^{1}^{2 }... ^{m }_{1 }in the process of transformation, where ^{0 }= _{1 }and ^{m }_{2}. Note that for any insertion (or deletion) beyond the |_{1}| deletions and |_{2}| insertions, there must be a matching deletion (or insertion) to preserve gene content. Thus every optimal series of operations has at least a pair of insertion and deletion on the same gene. Without loss of generality, assume _{i }_{h}g_{t }_{j }_{h}g_{t }_{i }_{j }_{h}g_{t}_{h}g_{t }_{i }_{j}_{i }_{h }_{t}^{k }^{k}^{k}^{k }^{k }^{k}_{k }_{k }_{k }_{k }^{k }^{k}_{k}_{k }^{k}^{-1}^{k}^{-1}, ^{k}^{-1}^{k}^{-1}

Building a new series of operations to replace

**Building a new series of operations to replace **_{i }_{k }_{+ 1}for _{j }_{j}

Since _{k }^{k }^{k}^{-1}. Besides, we have ^{k }

Recall that _{j }_{h}, bg_{t}_{h }_{t }_{h }_{t }_{h}, bg_{t}_{h}g_{t}^{j}^{-1}^{j}^{-1}, _{h}g_{t}^{j}^{-1}_{h}, q^{j}^{-1}_{t}

Adjacency graph decomposition

Given two adjacency sets _{1 }and _{2 }from two genomes, their corresponding _{1 }∪ _{2}, _{2 }∪ _{1}, _{1 }∪ _{2 }and _{2 }∪ _{1 }are linked by _{1 }∪ _{2 }and _{2 }∪ _{1 }have the same set of extremities; we use _{1 }= _{2 }= ∅, and each vertex in the adjacency graph has degree 2, which means that the adjacency graph consists of vertex-disjoint cycles and paths. We define the _{1 }= _{2 }= ∅ implies there exists an optimal solution without insertion and deletion, thus _{1}, _{2}) is just the minimum number of DCJ operations needed to transform _{1 }into _{2}. When _{1 }has been transformed into _{2}, the corresponding adjacency graph only consists of cycles of length 2 and paths of length 1. Since each DCJ operation can increase the number of cycles at most by 1, or increase the number of odd-length paths at most by 2, and we can always find out this kind of operation when _{1 }and _{2 }are different, we have _{1}, _{2})=

In the presence of duplicate genes, the adjacency graph may contain vertices with degree larger than 2, so that there may be multiple ways of choosing vertex-disjoint cycles and paths that cover all vertices as illustrated in Figure _{1 }or all in _{2}), since adjacencies in _{1 }and adjacencies _{2 }do not have common extremities and thus cannot be linked in the adjacency graph. Now we show how to perform DCJ operations, insertions and deletions to transform _{1 }into _{2 }based on a decomposition of the corresponding adjacency graph.

An example of adjacency graph with duplicate genes

**An example of adjacency graph with duplicate genes**. (**a**) Structures of the two genomes. (**b**) Adjacency graph. (**c**) A decomposition with 2 cycles. (**d**) A decomposition with only 1 cycle. Diamonds and rectangles represent

**Lemma 1**. _{1 }_{2}_{1 }∪ _{2}, _{2 }∪ _{1}, _{1 }_{2}_{1}_{2}_{1}_{2}

_{1 }has been transformed into _{2}. In the following, we will prove that an

For a _{1 }and _{2}, we first perform _{2}, we choose one of its non-_{1 }and perform an insertion to create one more _{2 }are handled, we transform the cycle of length ℓ into one of length ℓ - 2

Examples of performing operations under the guidance of decomposition

**Examples of performing operations under the guidance of decomposition**. In each subfigure, the above part shows the transformation of the adjacency graph; the below part shows the corresponding change in the genomic structure.

For a _{1}, we can perform ℓ/2 deletions to remove the adjacencies in _{1}. For a _{2}, we can first insert a gene as initial operand, then perform ℓ/2 - 1 insertions to create ℓ/2 cycles of length 2--see Figure

For a path with odd length ℓ, we need (ℓ - 1)/2 operations, and for a path with even length ℓ, we need ℓ/2 operations--see Figure

In sum, there are |_{1}| deletions, |_{2}| insertions and _{1}| - |_{2}| DCJ operations.

Lemma 1 states that any decomposition of the adjacency graph gives an upper bound on the edit distance. The following lemma shows that an optimal decomposition also provides a lower bound.

**Lemma 2**. _{1 }∪ _{2}_{2 }∪ _{1}_{D }and o_{D }

_{1}| deletions and |_{2}| insertions. Summing over all Δ_{P }_{1}|) is the sum of the number of _{1 }has been transformed into _{2}. Define _{DCJ }_{INS }_{DEL }_{P }_{P}, P

We prove Δ_{P }_{P }_{P }_{P}_{σ}_{A"}_{σ}_{A"}_{σ}_{A′}_{σ}_{A'}_{P}_{σ}_{A'}_{σ}_{A'}_{γ}_{A'}_{γ}_{A'}_{σ}_{A"}_{σ}_{A"}_{γ}_{A'}_{γ}_{A'}_{P}

If _{σ}_{A"}_{σ}_{A"}_{γ}_{A'}_{γ}_{A'}_{σ}_{A"}_{σ}_{A"}_{γ}_{A'}_{γ}_{A'}

If _{σ}_{A"}_{σ}_{A"}_{γ}_{A'}_{γ}_{A'}_{σ}_{A"}_{σ}_{A"}_{γ}_{A'}_{γ}_{A'}

If _{σ}_{A"}_{σ}_{A"}_{γ}_{A'}_{γ}_{A'}

Combining Lemma 1 and Lemma 2, we have the following theorem.

**Theorem 2**. _{1 }∪ _{2}, _{2 }∪ _{1}, _{D }and o_{D }are the numbers of helpful cycles and odd-length paths in D, respectively

Approximation algorithm

We design an approximation algorithm by using techniques employed on the problem of B

To make use of such algorithm, we must remove telomeres and keep only cycles in the adjacency graph. This can be done by introducing _{1 }and _{2 }with 2_{1 }and 2_{2 }telomeres respectively, we replace each telomere _{1 }<_{2}, we must add (_{2 }- _{1}) null adjacencies _{1 }in order to balance the degrees. The corresponding adjacency graph is constructed in the same way as the case without null extremities: two adjacencies are linked by one edge if they share one extremity, by two edges if they share two extremities. Now we prove that this "telomere-removal" operation does not change _{1}, _{2}).

**Theorem 3**. Let _{1 }_{2 }_{1 }_{2 }

_{1 }∪ _{2}, _{2 }∪ _{1}, _{1 }and _{2 }and _{1 }of them contain two telomeres in _{1 }and _{2 }of them contain two telomeres in _{2}. Suppose _{1 }and _{2 }contains 2_{1 }and 2_{2 }telomeres respectively (w.l.o.g., assume _{1 }≤ _{2}). Since an odd-length path contains one telomere in each adjacency set while an even-length path contains two telomeres in one adjacency set, we have _{1 }= 2_{1 }and _{2 }= 2_{2}. We can perform the following modifications on _{1}, arbitrarily choose one even-length path with both telomeres in _{2 }and link these two paths to form a _{2 }- _{1 }even-length paths, use _{2 }- _{1 }= _{2 }- _{1 }null adjacencies _{2 }helpful cycles in this decomposition of _{2}) - _{2 }= _{1}, _{2}). Now we prove

One example of the "telomere-removal" and "telomere-recovery" process

**One example of the "telomere-removal" and "telomere-recovery" process**. Thick circles represent adjacencies containing null extremities, and thick lines represent edges connecting null extremities.

Two cases of the adjacency graph with more than 2 edges that are linked through

**Two cases of the adjacency graph with more than 2 edges that are linked through τ**. Dashed lines might represent more than one edge.

In summary, based on Theorems 2 and 3, we have stated the equivalence of the problem of computing the edit distance and that of finding a valid decomposition with a maximum number of

Now we give the approximation algorithm and prove that its approximation ratio is 1.5 +

**Approximation Algorithm**

**Input: **Two adjacency sets _{1 }and _{2 }from two genomes

**Output: **A series of operations to transform _{1 }into _{2}.

**Step 1 **Add null adjacencies to _{1 }and _{2 }to obtain

**Step 2 **Collect all

**Step 3 **Remove the adjacencies covered by cycles in

**Step 4 **Remove the null adjacencies of cycles in _{1 }into _{2 }according to Lemma 1 guided by these cycles and paths.

The running time of the above algorithm is dominated by the time complexity of the (2 +

**Theorem 4**.

_{1}, _{2}) =

Conclusion

We studied the edit distance problem for two genomes under a unit-cost model including DCJ operations, insertions (including duplications) and deletions. We proved that this problem is equivalent to finding maximum number of

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

MS and YL conceived the idea, performed the analysis, and wrote the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We thank Bernard Moret for helpful discussions.

This article has been published as part of