IME, Universidade Federal Fluminense, Niterói, Brazil

Inmetro - Instituto Nacional de Metrologia, Qualidade e Tecnologia, Duque de Caxias, 25250-020, Brazil

Abstract

Background

The double-cut-and-join (DCJ) is a model that is able to efficiently sort a genome into another, generalizing the typical mutations (inversions, fusions, fissions, translocations) to which genomes are subject, but allowing the existence of circular chromosomes at the intermediate steps. In the general model many circular chromosomes can coexist in some intermediate step. However, when the compared genomes are linear, it is more plausible to use the so-called restricted DCJ model, in which we proceed the reincorporation of a circular chromosome immediately after its creation. These two consecutive DCJ operations, which create and reincorporate a circular chromosome, mimic a transposition or a block-interchange. When the compared genomes have the same content, it is known that the genomic distance for the restricted DCJ model is the same as the distance for the general model. If the genomes have unequal contents, in addition to DCJ it is necessary to consider indels, which are insertions and deletions of DNA segments. Linear time algorithms were proposed to compute the distance and to find a sorting scenario in a general, unrestricted DCJ-indel model that considers DCJ and indels.

Results

In the present work we consider the restricted DCJ-indel model for sorting linear genomes with unequal contents. We allow DCJ operations and indels with the following constraint: if a circular chromosome is created by a DCJ, it has to be reincorporated in the next step (no other DCJ or indel can be applied between the creation and the reincorporation of a circular chromosome). We then develop a sorting algorithm and give a tight upper bound for the restricted DCJ-indel distance.

Conclusions

We have given a tight upper bound for the restricted DCJ-indel distance. The question whether this bound can be reduced so that both the general and the restricted DCJ-indel distances are equal remains open.

Background

The distance between two genomes is often computed using only the common content, which occurs in both genomes. Such distance takes into consideration only

When comparing genomes with the same content and without duplicated DNA segments, it is already known that the genomic distance for the restricted DCJ model is the same as the distance for the general model and can be computed in linear time

(i) An optimal sorting sequence in the general DCJ model - many circular chromosomes can coexist in the intermediate species

**(i) An optimal sorting sequence in the general DCJ model - many circular chromosomes can coexist in the intermediate species**. (ii) An optimal sorting sequence in the restricted DCJ model - a circular chromosome is immediately reincorporated after its excision. The distance is always the same for both general and restricted DCJ models.

If the genomes have unequal contents, in addition to DCJ operations it is necessary to consider

The general DCJ-indel model has the flexibility of assigning different positive costs to DCJ and indel operations

(i) An optimal sorting sequence in the general DCJ-indel model - many circular chromosomes can coexist in the intermediate species

**(i) An optimal sorting sequence in the general DCJ-indel model - many circular chromosomes can coexist in the intermediate species**. (ii) An optimal sorting sequence in the restricted DCJ-indel model - a circular chromosome is immediately reincorporated after its excision. Although the number of steps in (i) and (ii) is the same, the question whether the distance is the same for both general and restricted DCJ-indel models is open. (The common content of the initial and the final genomes is represented in black, while the content exclusive to the initial genome is represented in red.).

This paper is organized as follows. In the remainder of this section we recall some key concepts of the DCJ-indel model with distinct operation costs

The DCJ model

A linear genome is composed of linear chromosomes and can be represented by a set of strings as follows. For each chromosome

Given two linear genomes

Given two genomes ^{t }^{h }_{i }_{1 }and _{2 }in

DCJ operations

Consider a DCJ _{1 }= _{1}_{1}_{4}_{4 }and _{2 }= _{3}_{3}_{2}_{2}, which creates _{1 }= _{1}_{1}_{2}_{2 }and _{2 }= _{3}_{3}_{4}_{4}. We represent such an operation as _{1}_{1}|_{4}_{4}, _{3}_{3}|_{2}_{2}} → {_{1}_{1}|_{2}_{2}, _{3}_{3}|_{4}_{4}}). The two adjacencies _{1 }and _{2 }are called the _{1 }and _{2 }are called the _{1}, _{2}, _{3 }and _{4 }can be equal to _{1}, _{2}, _{3 }and _{4 }can be equal to ○ (a telomere), A DCJ operation can correspond to several rearrangement events, such as an inversion, a translocation, a fusion or a fission

Adjacency graph and the DCJ distance

Given two genomes

For genomes

**For genomes A and B, the graph has one BB and two AB-paths**.

Components with 3 or more vertices need to be reduced - by applying DCJ operations - to components with only 2 vertices, that can be cycles or _{DCJ }

**Theorem 1**([**3**])

The DCJ-indel model with distinct costs

Although the DCJ-model was defined in the previous sections for genomes with unequal contents, only the common markers were handled. In this section we explain how to deal with unique markers, that are markers which occur only in genome

Indel operations

In order to deal with unique markers, we need operations that change the content of a genome. These operations can be an

Given _{3 }≠ _{3 }from the adjacency _{1}_{1}_{3}_{2}_{2 }is represented as (_{1}_{1}|_{3}|_{2}_{2 }→ _{1}_{1}|_{2}_{2}), while the insertion of _{3 }in the adjacency _{1}_{1}_{2}_{2 }is represented as (_{1}_{1}|_{2}_{2 }→ _{1}_{1}|_{3}|_{2}_{2}). One or both extremities among _{1 }and _{2 }can be equal to ○ (a telomere), as well as one or both labels among _{1 }and _{2}, can be equal to

Given two genomes

Runs, indel-potential and the DCJ-indel distance

Let us recall the concept of

An

**An AB-path with 3 runs (extracted from Figure 3)**. The first and the second runs are compact, while the third run is long and composed of three vertices.

A set of labels of one genome can be accumulated with DCJs. In particular, when we apply optimal DCJs on only one component of the adjacency graph, we can accumulate an entire run into a single adjacency _{DCJ }_{DCJ }_{DCJ}_{c∈AG(A, B) }_{DCJ}

Runs can be merged by DCJ operations. Consequently, during the optimal DCJ-sorting of a component

Two optimal sequences for DCJ-sorting an

**Two optimal sequences for DCJ-sorting an AB-path with Λ = 3 (the cuts of each DCJ in each sequence are represented by "|")**. In (i) the overall number of runs in the resulting components is three, while in (ii) the resulting components have only two runs. Indeed, in this case, the best we can have is the indel-potential λ = 2.

The indel-potential of a component depends only on its number of runs:

**Proposition 1 **(

Let _{0 }and _{1 }be, respectively, the sum of the indel-potentials for the components of the adjacency graph before and after a DCJ operation _{1 }- _{0}. If

**Lemma 1 **(

Recombinations

Until this point, we have explored the possible effects of any DCJ that is applied to two adjacencies belonging to a single component of the graph. However, there is another type of DCJ that must be considered. A DCJ operation

This recombination is a neutral DCJ that has Δ

**This recombination is a neutral DCJ that has Δ λ = -2 (we represent only the labels of the adjacencies, the cuts of the recombination are represented by "/"and "\")**.

Although many different recombinations can occur, it is possible to explore the space of recombinations in linear time and compute the maximum deduction that we can obtain with respect to the upper bound of Lemma 1

Results

In this section we develop a restricted DCJ-indel sorting algorithm, from which we can derive an upper bound for the restricted DCJ-indel distance.

Chained operations

Let us generalize to the DCJ-indel model a concept introduced in _{1}_{2 }... _{n-1}_{n }_{i }and _{i+1 }of _{i+1 }is a resultant of _{i}_{i+1 }use as a source a resultant from _{i}_{i }_{i+1 }are said to be

Bi-directional approach

Although in general a sorting algorithm is conceived to follow a single direction, in which all operations are applied on the initial genome, here we design a bi-directional algorithm, in which some operations are applied on genome ^{-1 }= (_{1}_{2 }... _{n }

**Proposition 2 **(_{1 }_{2 }

Figure

(i) Two sequences of lengths 3 and 2, sorting

**(i) Two sequences of lengths 3 and 2, sorting **. (ii) A corresponding sequence of length 5 sorting A into B. (Unique markers are represented in red.).

Accumulating x splitting labels

A DCJ that accumulates labels is always applied to two labeled adjacencies and results into a clean adjacency and an adjacency containing the concatenation of the labels of the original adjacencies. In general, we can represent such an accumulating DCJ _{1}_{1}|_{4}, _{3}|_{2}_{2}}→{_{1}_{1}|_{2}_{2}, _{3}|_{4}}). If

The inverse of an accumulating DCJ ^{-1 }= ({_{1}_{1}|_{2}_{2}, _{3}|_{4}} → {_{1}_{1}|_{4}, _{3}|_{2}_{2}}). Observe that, if ^{-1 }is applied on

Accumulating and splitting DCJ operations

**Operation**

**Direction**

**Effect**

**Inverse**

Accumulate labels of an

Inversely split label of a

Accumulate labels of a

Inversely split label of an

Accumulation-deletion x insertion-split

Let _{1 }= _{1}_{1}_{2}_{2 }... _{i}x_{i }_{j}x_{j }_{n-1}_{n-1}_{n }_{1 }and _{n }are labeled, each _{k }_{k
}_{i }and _{j }(1 ≤ _{1 }are _{i }and _{j }are labeled and all vertices between _{i }and _{j }in _{1 }are clean. We can apply an accumulating DCJ on the two partners _{i }and _{j}_{i-j}, reducing _{1 }to _{2 }= _{1}_{1}_{2}_{2 }... _{i-1}_{i-1}_{i-j}_{j}_{j+1}_{j+1 }... _{n-1}_{n-1}_{n}_{2}, reducing _{2 }to _{3}, and so on. Assuming that the initial _{1 }has _{m}_{1}. Observe that all labeled vertices will be used in some accumulating DCJ, until the compact-run _{m }

As an example, take _{1 }= _{1}_{1}_{2}, _{1 }= _{2}_{3}, _{2 }= _{3}_{2}_{4}, _{2 }= _{4}_{5}, _{3 }= _{5}_{3}_{6}, _{3 }= _{6}_{7}, _{4 }= _{7}_{4}_{8}, with all _{k }_{1 }= _{1}_{1}_{2}_{2}_{3}_{3}_{4 }be a _{2 }and _{3}, creating _{2-3 }= _{3}_{2}_{3}_{6 }and _{4}_{5}, reducing _{1 }to _{2 }= _{1}_{1}_{2-3}_{3}_{4}. We then apply another DCJ of type _{1 }and _{2-3}, creating _{1-2-3 }= _{1}_{1}_{2}_{3}_{6 }and _{2}_{3}, reducing _{2 }to _{3 }= _{1-2-3}_{3}_{4}. Finally, we apply a DCJ of type _{1-2-3 }and _{4}, creating _{1-2-3-4 }= _{1}_{1}_{2}_{3}_{4}_{8 }and _{6}_{7}, reducing _{3 }to _{4 }= _{1-2-3-4}. If we follow the accumulation of a run, considering only the labeled vertices, we obtain a rooted tree that is built from the leafs to the root (see Figure

The rooted tree of an accumulation of a

**The rooted tree of an accumulation of a **. Inversely, the rooted tree of an inverted-split of a

The inversion of the run accumulation described in the example above is the inverted-split of the label of the compact-run _{4 }= _{1-2-3-4 }into the labeled vertices _{1 }= _{1}_{1}_{2}, _{2 }=_{3}_{2}_{4}, _{3 }= _{5}_{3}_{6 }and _{4 }= _{7}_{4}_{8}. We start by applying a _{1-2-3-4 }= _{1}_{1}_{2}_{3}_{4}_{8 }and _{6 }_{7}, obtaining _{1-2-3 }= _{1}_{1}_{2}_{3}_{6 }and _{4 }= _{7}_{4}_{8}. We then apply a _{1-2-3 }and _{2}_{3}, obtaining _{2 }= _{1}_{1}_{1 }and _{2-3 }= _{3}_{2}_{3}_{6}. Finally we apply a _{2-3 }and _{4}_{5}, obtaining _{2 }= _{3}_{2}_{4 }and _{3 }=_{5}_{3}_{6}. If we follow the inverted-split of a run, considering only the labeled vertices, we obtain a rooted tree that is built from the root to the leafs (see Figure

An indel does not have to occur while a circular chromosome exists

We now show that an indel must not be applied while a circular chromosome exists.

Proposition 3 shows that an insertion can always be "moved up" in a DCJ-indel sorting sequence.

**Proposition 3 **_{1}_{2 }... _{n-1}_{n }be a DCJ-indel sequence sorting genome A into genome B, such that, for an integer _{i }is a DCJ operation and ρ_{i+1 }_{i }ρ_{i+1 }_{1}_{2}, _{1 }_{2 }_{1}_{2 }... _{i-1}_{1}_{2}_{i+ 2 }... _{n-1}_{n }is also a DCJ-indel sequence sorting genome A into genome B

_{i }and _{i}_{1 }= _{i+1 }and _{2 }= _{i}_{i }_{i+1 }are chained.

Observe that a DCJ in any optimal sorting scenario either accumulates or does not change the composition of runs. Take _{i }= ({_{1}_{1}|_{4}, _{3}|_{2}_{2}} → {_{1}_{1}|_{2}_{2}, _{3}|_{4}}). Furthermore, since an insertion in any optimal sequence is performed without breaking any existing label, without loss of generality, take _{i+1}_{1}_{1}_{2}|_{2 }→ _{1}_{1}_{2}|_{3}|_{2}). Then _{i}_{i+1 }could be replaced by: _{1 }= (_{3}_{2}|_{2 }→ _{3}_{2}|_{3}|_{2}) followed by _{2 }= ({_{1}_{1}|_{4}, _{3}|_{2}_{3}_{2}} → {_{1}_{1}|_{2}_{3}_{2}, _{3}|_{4}}).

Similarly, a deletion can always be "moved down" in a DCJ-indel sorting sequence.

**Proposition 4 **_{1}_{2 }... _{n-1}_{n }_{i }is a deletion and ρ_{i+1 }_{i}ρ_{i+1 }_{1}_{2}, _{1 }_{2 }_{1}_{2 }... _{i-1}_{1}_{2}_{i+2} ... _{n-1}_{n }is also a DCJ-indel sequence sorting genome A into genome B

From the previous propositions we observe that finding a position to perform an indel imposes no difficulties to design a restricted DCJ-indel sorting algorithm. The trick is how to determine the DCJ part of the sorting sequence, so that we reincorporate each circular chromosome after its creation and achieve the indel-potential per component.

Restricted DCJ-indel sorting

Basically, our approach disregards recombinations and sorts the components of the graph separately, using optimal DCJ operations to achieve the minimum number of indels per component, that is given by the indel-potential. In this way, we achieve the distance given by the upper bound of Lemma 1, as we will see in the remainder of this section.

Capping

Disregarding recombinations, we can first perform the genome capping, a technique that helps us to avoid difficulties and special cases produced by telomeres: we adjoin new markers (caps) to the ends of the chromosomes (and new chromosomes composed of caps only, if necessary) so that we do not change the distance and we do not have to worry about telomeres

Merging runs in cycles

An important step of the DCJ-indel sorting is to merge runs in cycles with at least 4 runs, so that the indel-potential for each cycle is achieved.

**Proposition 5 **

Chromosome reincorporation

In the restricted sorting of linear genomes a circular chromosome has to be immediately reincorporated after its excision - these two consecutive operations mimic either a transposition or a block-interchange

Suppose that a DCJ performed an excision of a circular chromosome. Let (_{1}, _{2}) be a pair of vertices, such that _{1 }and _{2 }are in the same genome and belong to the same cycle in _{1 }is an adjacency at the circular chromosome and _{2 }is an adjacency at a linear chromosome. The pair (_{1}, _{2}) is called a link. Since _{1 }and _{2 }are in the same cycle, a chromosome reincorporation can always be done by applying a DCJ on the two vertices _{1 }and _{2 }

The cycle to which a link (_{1}, _{2}) belongs is called a

The two vertices _{1 }and _{2 }of a link in a connection cycle _{1 }and _{2 }is given by the number of edges in the shortest path connecting them. Since both _{1 }and _{2 }are in the same genome, this distance is always even and positive. If the distance between _{1 }and _{2 }is 2, _{1 }and _{2 }have a common neighbor, and (_{1}, _{2}) is called a

**Proposition 6 **

_{1}_{1}_{2}_{2 }... _{n}_{n }be a connection cycle in _{1}, ..., _{n }are in _{1}, ..., _{n }are in _{i}, v_{j}_{i }is in the circular chromosome and _{j }is in a linear chromosome of _{k}, i _{i }and _{j }belonging to the circular chromosome. Then (_{k}, v_{k+1}) is a short-link. □

In order to find out whether the indel-potential of the connection cycle _{1}, _{2}), basically we need to analyze how the connection cycle _{1 }and _{2 }in

We focus on the short-links only. Let (_{1},_{2}) be a short-link in a connection cycle _{1 }= _{1}_{1}_{2 }and _{2 }= _{3}_{2}_{4 }(_{1 }and _{2 }can be equal to _{2}_{3}_{3 }be the common neighbor of _{1 }and _{2 }(_{3 }can also be equal to _{1}, _{2}) = ({_{1}, _{2}} → {_{1}, _{2}}), such that _{1 }= _{2}_{3 }and _{2 }= _{1}_{1}_{2}_{4}. Observe that _{1}, _{2}) always extracts _{1 }into a cycle, and accumulates the labels of _{1 }and _{2 }into a new vertex _{2}, which is extracted into a cycle with the remaining vertices of

1.

2. _{1}, _{2}) be a short-link in _{1 }and _{2 }is a compact-run. An optimal DCJ _{1}, _{2}) extracts the compact-run _{1}, _{2}) preserves the indel-potential of

3. _{1}, _{2}) is not a gap nor is separated by a compact-run, only one possiblity remains: the common neighbor _{1 }and _{2 }is labeled and belongs to a long-run _{1}, _{2}) splits

Although the overall indel-potential seems to be increased, the DCJ described above is an inverted-split of type

It is important to guarantee that, after applying a DCJ that inversely splits a run _{1 }and another DCJ that inversely splits another run _{2}, the runs _{1 }and _{2 }are not merged. We do this by simply extracting the residual part of an inversely split run into a new cycle. Furthermore, during the merging or accumulation of runs, a run

We can always reincorporate the circular chromosome with a DCJ applied to any short-link (_{1}, _{2}), except if _{1}, _{2}) splits a run _{1}, _{2}) cannot be chained with a previous inverted-split of

After an excision, suppose that the circular chromosome is in genome

**Proposition 7 **

The sorting algorithm and an upper bound for the restricted DCJ-indel distance

We put everything together in Algorithm 1 (Additional file

Click here for file

Conclusions

In this work we have presented a method to compute a restricted DCJ-indel sequence of operations that sort a linear genome into another linear genome. This method leads to a tight upper bound for the restricted DCJ-indel distance. The general DCJ-indel distance can be computed exactly and is a lower bound for the restricted DCJ-indel distance. However, the question whether these bounds are equal, meaning that both distances are equal, remains open.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

PHS, MDVB, RM and SD have elaborated the model, proved the results and written the paper.

**Algorithm 1 **Restricted sorting of genome A into B with optimal DCJs and indels

**Input: **Two linear genomes

**Output: **A restricted sequence of DCJ and indel operations sorting

cap genomes

[MERGING:]

**if **there is a cycle **then**

**while ****do**

extract

**if **a circular chromosome was created **then**

find a short-link (_{1}, _{2}); [

**if **(_{1}, _{2}) is a gap or a compact-run **then**

apply the optimal DCJ _{1}, _{2});

**else**

let _{1 }be the run that would be inversely split by _{1}, _{2});

**if **_{1}, _{2}) is the first inverted-split of _{1 }**then**

apply the optimal DCJ _{1}, _{2});

let _{2 }be the residual part of _{1};

**if **_{2 }is in a cycle with more runs **then**

_{2}; [_{2 }from its cycle in the next step

**else**

[_{1 }

find a link (_{1}, _{2}) such that _{1 }is a vertex created by a previous inverted-split of _{1};

apply the optimal DCJ _{1}, _{2});

**if ****then**

[ACCUMULATING:

**while **there is a long-run r in **do**

apply an optimal DCJ accumulating the labels of two partners of

**if **a circular chromosome was created **then**

find a short-link (_{1}, _{2}); [

**if **(_{1}, _{2}) is a gap or a compact-run **then**

apply the optimal DCJ _{1}, _{2});

**else**

let _{1 }be the run that would be inversely split by _{1}, _{2});

**if **_{1}, _{2}) is the first inverted-split of _{1 }**then**

apply the optimal DCJ _{1}, _{2});

**else**

[_{1 }

find a link (_{1}, _{2}) such that x_{1 }is a vertex created by a previous inverted-split of _{1}; [

apply the optimal DCJ _{1}, _{2});

[DCJ-SORTING:

**while **there is cycle **do**

extract a cycle from

**if **a circular chromosome was created **then**

find a short-link (_{1}, _{2}); [

[

apply the optimal DCJ _{1}, _{2});

invert all DCJs applied on genome

insert each

move up insertions that occur in circular chromosomes;

delete all

Acknowledgements

This research was partially supported by the Brazilian research agencies CNPq and FAPERJ.

This article has been published as part of