LIFL, UMR 8022 CNRS, Université Lille 1, INRIA Lille Nord Europe, Villeneuve d’Ascq, France

Abstract

Background

Segmental duplications in genomes have been studied for many years. Recently, several studies have highlighted a biological phenomenon called

Results

In this paper, we introduce and study a combinatorial problem, inspired from the breakpoint-duplication phenomenon, called the

Conclusions

We present the

Introduction

Gene duplication is an important source of variations in genomes. Recently, several studies have highlighted biological evidence for abundant segmental duplications that occur around breakpoints of rearrangement events in the evolution of eukaryotes.

In mammals, an evidence for a strong association between duplications, genomic instability and large-scale chromosomal rearrangements in primate evolution was first reported in

The association between segmental duplications and regions of breaks of synteny was also reported in the Drosophila species group. In

A rearrangement event is an operation that modifies the organization of a given genome by cutting the genome at some points called

In this paper, we are interested in using the segmental duplications of a given present-day genome that has undergone breakpoint-duplication rearrangements, in order to reconstruct a non-duplicated ancestral genome. We formally define the breakpoint-duplication phenomenon, and introduce a combinatorial problem called the

In the following, we study the Genome Dedoubling problem under the Double-Cut-and-Join (DCJ) and the reversal rearrangement models. In Section **Methods**, we formally present breakpoint-duplication (BD) rearrangements and the Genome Dedoubling Problem. We show that the problem can always be regarded as a Dedoubling Problem on totally duplicated genomes. In Section **Genome dedoubling by DCJ**, we study the problem under the DCJ model, on multichromosomal then unichromosomal genomes. We prove the NP-completeness of the problems by reduction to an APX-complete problem, and provide algorithms with a linear time complexity, except for an APX-complete part that is 2-approximable. In Section **Genome dedoubling by reversal**, we study the problem under the reversal model on oriented genomes, making use of some results of the Hannenhalli-Pevzner (HP) theory **Application,** an application for the reconstruction of a non-duplicated ancestor of

Methods

In this section we give the main definitions and notations of duplicated genomes and rearrangements. Next, we generalize the definitions of rearrangements in order to introduce a formal definition of

Duplicated genomes

A genome consists of linear or circular chromosomes that are composed of genomic markers. Markers are represented by signed integers such that the sign indicates the orientations of markers in chromosomes. By convention, – –

Each genome contains at most two occurrences of each marker. Two copies of a same marker in a genome are called paralogs. If a marker

**Definition 1 **

For example,

An

**Definition 2**

For example, ^{R}^{R}

Rearrangement

A rearrangement operation on a given genome cuts a set of adjacencies of the genome called _{▲}, and the new adjacencies are indicated in the genome by dots.

A

A

A

The

Breakpoint-duplication rearrangements

We now generalize the definitions of rearrangement operations to account for possible duplications at their breakpoints.

A

• first adding marker

• then applying a DCJ operation that cuts adjacencies

A

• first adding markers

• then applying a DCJ operation that cuts adjacencies

**Definition 3 **

In the sequel, if some markers are duplicated by a BD-DCJ operation, they are indicated in bold font in the initial genome. For example, the following rearrangement is a 2-BD-DCJ operation that acts on adjacencies (–2 –1) and (4 –3), and duplicates markers 2 and 4. The intermediate step resulting in the duplication of markers 2 and 4 is shown above the arrow.

To summarize, a BD-DCJ operation consists of a

**Definition 4 **

For example, the following rearrangement is a BD-reversal that is a 1-BD-DCJ operation that acts on adjacencies (2 –1) and (–3 4), and duplicates marker 2.

A

**Definition 5 **

We now give an obvious, but useful property allowing to reduce a BD-DCJ scenario to a DCJ scenario.

**Proposition 1**^{R}

**Proof.** Let ^{R}^{R}

For example, in the following, a BD-reversal scenario of length 4 between

Genome dedoubling problem

We now state the genome dedoubling problems considered in this paper.

**Genome dedoubling problem:**

Given a duplicated genome _{dcj}_{rev}

**Proposition 2 **

The next proposition describes a further reduction of the genome dedoubling problem on a duplicated genome

**Proposition 3 **^{T} obtained from G by replacing every maximal subsequence of non-duplicated markers beginning with a marker

**Proof.** See proof in Additional file

**Supplemental proofs** Additional file 1 is a PDF file containing the proofs of Proposition 3, Property 1, Lemma 1, and Property 3.

Click here for file

For example, solving the DCJ (resp. reversal) genome dedoubling problem on ^{T}

In the sequel, **Genome dedoubling by DCJ** and **Genome dedoubling by reversal** on the problem of finding a dedoubled genome

Results

In this section, we first study the Genome Dedoubling Problem under the DCJ model. Next, we study the problem under the reversal model on oriented genomes described in the Hannenhalli-Pevzner (HP) theory on sorting by reversal

Genome dedoubling by DCJ

In this section, _{dcj}

Dedoubled adjacency graph

**Definition 6**

An example of dedoubled adjacency graph is depicted in Fig.

The adjacency graph of

The adjacency graph of

Note that all vertices in

Given a couple of paralogous markers

General sorting

In this section, we prove the following theorem:

**Theorem 1**_{i} be the maximum size of a subset of non-duplicated pairwise independent cycles in_{dcj}_{i}.

For example, in Fig. _{dcj}

**Property 1 **

_{i} of a set of non-duplicated pairwise independent cycles in the graph

_{i}

_{i} of a set of non-duplicated pairwise independent cycles by

**Proof.** See proof in Additional file

Algorithm 1 is an algorithm that provides a _{i}

We now have all the pre-requisites to give the proof of Theorem 1. The proof can be found in Additional file

**Lemma 1**

**Proof.** See proof in Additional file

From Lemma 1, the complexity of the Genome Dedoubling problem by DCJ follows immediately.

**Corollary 1**_{i} that is 2-approximable.

Sorting between linear unichromosomal genomes

In this section, we search for a minimum length DCJ scenario that transforms a duplicated genome consisting of a single linear chromosome into a dedoubled genome consisting of a single linear chromosome. The results of this section will then be used in the next section for the study of the Genome Dedoubling problem under the reversal model.

In this section and the sequel,

**Definition 7**

A DCJ operation that merges a cycle

Note that if

In the following, we always denote by _{i}

**Property 2**_{i}

**Proof.** See proof in Additional file

From Property 2, we then have the following lemma.

**Lemma 2**

**Proof.** See proof in Additional file

From Property 2 and Lemma 2, we immediately have the following complexity.

**Corollary 2**_{i}_{i} that is 2-approximable.

Genome dedoubling by reversal

We build and use a graph that behaves like the

Dedoubled overlap graph

For any couple

**Definition 8**

An example of dedoubled overlap graph is depicted in Fig.

a. The overlap graph of

a. The overlap graph of

The adjacency graph of

The adjacency graph of

The overlap graph of

The overlap graph of

The vertex

The overlap graph of

A connected component of the graph

Given an oriented vertex

**Property 3**

**Proof.** See proof in Additional file

In the sequel, we focus on sorting oriented genomes using reversal dedoubling scenarios. A totally duplicated genome

Sorting an oriented valid-path genome

In this section, we consider an oriented valid-path genome

**Theorem 2**_{rev}

**Proof.** See proof in Additional file **■**

Sorting an oriented non-valid-path genome

In this section,

An edge

**Lemma 3 **

**Proof.** See proof in Additional file

**Theorem 3 **_{rev}

**Proof.** See proof in Additional file

Prom Lemma 1 and Property 2, the complexity of the Genome Dedoubling problem by reversal on oriented genomes follows immediately.

**Corollary 3**_{i} that is 2-approximable.

Application

We applied Algorithm 2 to reconstruct an ancestral chromosome for the chromosome 2 of

**Experimental results** Additional file 2 is a PDF file containing a description of an application of the methods to real Drosophila data.

Click here for file

Conclusion

In this paper, we introduced the genome dedoubling problem in the DCJ rearrangement model, NP-complete in both the multichromosomal and the linear unichromosomal case, by reduction to an APX-complete problem. For both cases, we described an algorithm solving the problems in linear time complexity, except for an APX-complete part that is 2-approximable. We also presented some results on the Genome Dedoubling problem by reversal, providing an algorithm solving the problem on oriented genomes in quadratic time complexity, except for an APX-complete part that is 2-approximable. The case of unoriented genomes in the reversal model will be treated in a future paper. Unsurprisingly, partial results obtained so far tend to show that the general distance formula can be written as _{rev}

The second obvious extension of the present work, as in the the Genome Halving problem theory

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

The work was divided in four steps: 1) Formal introduction and reduction of the Genome Dedoubling problem. 2) Design of the study. 3) Study of the Genome Dedoubling problem by DCJ. 4) Study of the Genome Dedoubling problem by reversal on oriented genomes. AT participated in 2) and 3), and carried out 4). JSV participated in 2), 3) and 4). AO carried out 1), 2) and 3), and participated in 4). All authors participated in writing the manuscript, and approved the final manuscript.

Acknowledgements

We would like to thank Anne Bergeron for her useful comments on the breakpoint-duplication phenomenon, and the anonymous reviewers of the paper for their useful comments on the first version of the document.

This article has been published as part of