Department of Mathematics, Simon Fraser University, Burnaby BC, V5A1S6, Canada

Department of Computer Science, University of British Columbia, Vancouver BC, V6T1Z4, Canada

INRIA Rhône-Alpes, 655 avenue de I'Europe, F-38344 Montbonnot, France

Laboratoire de Biométrie et Biologie Évolutive, CNRS and Université de Lyon 1, 43 boulevard du 11 novembre 1918, F-69622 Villeurbanne, France

Genome Informatics, Faculty of Technology and Institute for Bioinformatics, Center for Biotechnology (CeBiTec), Bielefeld University, 33594 Bielefeld, Germany

Abstract

Background

Recovering the structure of ancestral genomes can be formalized in terms of properties of binary matrices such as the Consecutive-Ones Property (C1P). The

Result

We show that, when restricted to binary matrices of degree two, which correspond to adjacencies, the genomic characters used in most ancestral genome reconstruction methods, this relaxed version of the Linearization Problem is polynomially solvable using a reduction to a matching problem. This result holds in the more general case where columns have bounded multiplicity, which models possibly duplicated ancestral genes. We also prove that for matrices with rows of degrees 2 and 3, without multiplicity and without weights on the rows, the problem is NP-complete, thus tracing sharp tractability boundaries.

Conclusion

As it happened for the breakpoint median problem, also used in ancestral genome reconstruction, relaxing the definition of a genome turns an intractable problem into a tractable one. The relaxation is adapted to some biological contexts, such as bacterial genomes with several replicons, possibly partially assembled. Algorithms can also be used as heuristics for hard variants. More generally, this work opens a way to better understand linearization results for ancestral genome structure inference.

Introduction

Genomes, meant as the linear organization of genes along chromosomes, have been successively modelled by several mathematical objects. Sturtevant and Tan

In order to scale up and handle the dozens of available genomes, another model was needed. Bergeron, Mixtacki and Stoye

An additional relaxation consists in allowing any graph, and not only a matching, to model genomes. Ancestral genome reconstruction methods often first compute sets of ancestral adjacencies (neighborhood relations between two genes)

Nevertheless, biological applications in general require linear genomes, which raises the question of

According to the definition of a linear structure, this can be described by some variant of the Consecutive-Ones property (C1P) of binary matrices

To the best of our knowledge, there is currently no tractability result known for the Linearization Problem. Currently all methods

In the present paper, we prove that the Linearization Problem for weighted adjacencies, when ancestral genomes can have several circular and linear chromosomes, is tractable. We prove this in a more general case, where multiple copies of columns are allowed. Here, instead of a permutation of the columns, one asks for a sequence on the alphabet of columns, containing at most

We show that this corresponds to finding a maximum weight

Results

A few definitions are needed to prove the two main results of this paper: (1) a polynomial algorithm for the linearization of degree 2 matrices with columns with multiplicity and weighted rows; and (2) an NP-completeness proof for the linearization of matrices with rows of degrees 2 and 3, even if all multiplicities and row weights are equal to one.

The

A binary matrix (or submatrix)

Given maximum copy numbers

The

These two problems are classical and have been defined independently from comparative genomics, but model well the linearization of genomes with linear chromosomes, or a single circular chromosome, respectively. But the general case would better be modelled by the following. A matrix is

MAX-ROW-component-mCi1P

**Input**. A matrix with maximum copy numbers assigned to all columns and weighted rows;

**Output**. A subset of rows of maximum cumulative weight such that the obtained submatrix is

Note that it is equivalent if some sequences are not required to be circular, so it handles well the case where both circular and linear chromosomes are allowed. It is a relaxation of the previous problems, so the NP-hardness does not follow from them. And in fact, the problem for degree 2 matrices (adjacencies) happens to be polynomial, as we now show in the next subsection.

A solution for matrices of degree two with weighted rows and multiplicites

For a degree 2 matrix _{M }_{M}_{M}_{M}_{M }

A 2

**Lemma 1 **

We give a sketch of the proof. For more details, we refer the reader to

Conversely, assume _{G' }(_{0 }and for each _{G' }(_{0}, _{G' }(_{0 }then this cyclic walk satisfies conditions (i) and (ii) for vertices in _{0}, then after omitting all occurrences of _{0 }we obtain a cyclic walk satisfying conditions (i) and (ii) for vertices in

It follows that solutions to the MAX-ROW-component-mCi1P for matrix _{M}^{3/2})), where

Given an edge weighted graph _{1}, _{2}, ..., _{f(x) }be in _{x }_{y }_{1}, _{x}_{f(x)}, _{x}_{x}_{y}_{1}, _{y}_{f(y)}, _{y}

Reduction used to transform the maximum weight

**Reduction used to transform the maximum weight f-matching problem to the maximum weight matching problem**. Edge weights are all one, unless otherwise indicated, and

**Property 1 **

An unweighted version of this property was shown in

Since a maximum weight matching can be found in time ^{3/2}) algorithms for the maximum weight

Intractability for matrices of degree larger than two

The tractability does not generalize to matrices, that is, the MAX-ROW-component-Ci1P is already NP-complete for unweighted matrices with rows of degrees 2 and 3. Note that the result for unweighted matrices implies NP-completeness also for the cases when rows are weighted and/or columns have multiplicities.

We will first show that the following hypergraph covering problem is NP-complete. Here we say that a hypergraph _{2}, _{3}), where _{2 }(resp., _{3}) is its set of 2-edges (resp., 3-edges). We denote the ^{S}

**Definition 1 **_{2}, _{3}) _{2 }∪ _{3}

**c**(

**c**(_{2 }**c**(_{3};

**c**(

**c**(

Informally, a graph covering of a 2,3-uniform hypergraph is a graph constructed by picking an edge from each 2-edge, and a pair of edges from each 3-edge.

**Problem 1 (The 2,3-Uniform Hypergraph Covering by Cycles and Paths by Edge Removal Problem (23UCR Problem)) **

Here we will show that Problem 1, the 23UCR Problem, is NP-complete. Later in this section we will show that this implies that the MAX-ROW-component-Ci1P Problem is NP-complete for matrices with rows of degrees 2 and 3. First, we must define the following NP-complete version of 3SAT, which we will use to show NP-completeness of Problem 1.

**Problem 2 (The 3SAT(2,3) Problem) **

We show that this version of 3SAT is NP-complete using a very similar proof to the one in

**Theorem 1 **

^{1}, ^{2},..., ^{k}^{i}

This "cycle" of implications (2-clauses) on ^{1},..., ^{k}^{i}^{i}

We now show that the 23UCR Problem is NP-complete by reduction from 3SAT(2,3).

**Theorem 2 **

Given a 3SAT(2,3) formula _{1}, ..., _{n}_{ϕ }_{ϕ }_{2 }+ 3_{3 }such vertices). The design of _{2 }+

Figure ^{1 }and ^{2}, and its one negative occurrence ¬^{1}, ^{2},

(a) The variable gadget for variable ^{1}, ^{2 }and ¬

**(a) The variable gadget for variable x with literal vertices x** (b) 2-clause gadget with literal vertices

Figure

**Lemma 2 **

For the variable gadget corresponding to

Two coverings of the variable gadget for

**Two coverings of the variable gadget for x, when x is set to: (a) false, or (b) true in the assignment**.

For a 2-clause (resp., 3-clause)

A covering of the (a) 2-clause gadget; and (b) 3-clause gadget, where literal

**A covering of the (a) 2-clause gadget; and (b) 3-clause gadget, where literal p is satisfied**.

In the above covering, since exactly _{2 }+

"⇐" Now we show that if

For hypergraph

(i) in every clause gadget, at least one literal vertex is selected, and

(ii) for every

We call a graph

**Observation 1 **

In the remainder of this lemma, we will give a set of transformations that converts a valid covering into an expected behavior covering while preserving the validity of the covering at each step. Assume that we have a valid covering of _{ϕ}

We say that a variable gadget is _{ϕ }

**Claim 1 **_{ϕ }into a valid covering that contains no undecided variable gadgets

_{ϕ}_{ϕ}

First, assume that the auxiliary hyperedge is removed in an undecided variable gadget. The set of possible configurations that the gadget can be in is depicted on the left in Figure

The transformation in the case when the auxiliary hyperedge of the variable gadget is removed

**The transformation in the case when the auxiliary hyperedge of the variable gadget is removed**.

(a) a 3-edge with a double arrow pointing to two vertices

**(a) a 3-edge with a double arrow pointing to two vertices**. (b)-(c) the two configurations that are represented by (a).

We can transform any configuration of Figure _{ϕ}^{1}, ^{2 }and ¬x) affected by the transformation has only decreased or remained the same, it follows that the covering of _{ϕ }

Hence, we can assume that the auxiliary hyperedge is present in any undecided variable gadget. Without loss of generality we can then assume that any configuration of the undecided variable gadget must be in one of the two forms depicted on the left in Figure

Two sets of possible configurations of an undecided variable gadget and the corresponding transformation of the covering

**Two sets of possible configurations of an undecided variable gadget and the corresponding transformation of the covering**. (Note that if edge {

We have the following claim.

**Claim 2 **_{ϕ}

_{ϕ}_{ϕ}

By Claims 1 and 2, at least _{2 }hyperedges have been removed from the variable and 2-clause gadgets, and since in any valid covering this is the maximum number of hyperedges which can be removed, we have the following corollary.

**Corollary 1 **
_{ϕ }into a valid covering where:

_{ϕ}

We have the following claim.

**Claim 3 **_{ϕ }into an expected behavior valid covering

_{ϕ }

Two sets of possible configurations of a decided variable gadget without expected behavior and the corresponding transformations

**Two sets of possible configurations of a decided variable gadget without expected behavior and the corresponding transformations**.

Secondly, in the valid covering of _{ϕ }

The only possible configuration of a 2-clause gadget without expected behavior and the corresponding transformation

**The only possible configuration of a 2-clause gadget without expected behavior and the corresponding transformation**.

Thirdly, in the valid covering of _{ϕ }_{ϕ}

It follows by Observation 1 and Claim 3, that if _{ϕ }

Finally, since the construction of _{ϕ }

Let the component-Ci1P by Row Removal Problem be the corresponding decision version of the MAX-ROW-component-Ci1P Problem as follows.

**Problem 3 (The component-Ci1P by Row Removal Problem) **

We now show that the component-Ci1P by Row Removal Problem is NP-complete for matrices with rows of degrees 2 and 3.

The following lemma shows the correspondence between the component-Ci1P by Row Removal Problem for matrices with rows of degrees 2 and 3 and the 23UCR Problem. A 2,3-uniform hypergraph _{H }

**Lemma 3 **_{H }has the component-Ci1P after removing at most k rows

_{H }_{H }_{H }

Conversely, assume that each component _{1},...,_{|C|}} of the submatrix of _{H }_{H}

By Theorem 2 and Lemma 3 it follows that the component-Ci1P by Row Removal Problem is NP-complete for matrices with rows of degrees 2 and 3. Since this decision problem is NP-complete, it follows that the MAX-ROW-component-Ci1P Problem is also NP-complete for matrices with rows of degrees 2 and 3.

**Theorem 3 **

Discussion/Conclusion

There are exact optimization

We report here two results: (1) a polynomial variant of the Linearization Problem, when the output allows paths and cycles and a maximum number of copies per gene, in the case of degree 2 matrices with weighted rows; and (2) an NP-completeness proof of the same problem for matrices with rows of degrees 2 and 3, even when multiplicities and weights are equal to one.

It is not the first time that a slight change in the formulation of a problem dramatically changes its computational status

Moreover, considering genomes composed of linear and circular segments is appropriate for bacterial genomes where linear segments can be seen as segments of not totally recovered circular chromosomes. Currently no ancestral genome reconstruction method is able to handle bacterial genomes with plasmids, but rather they are restricted to eukaryotes or bacterial chromosomes with a single circular chromosome. For example, Darling

Furthermore, genes are often duplicated in genomes, and in the absence of a precise and efficient phylogenetic context, which is still absent for bacteria (no ancestral genome reconstruction method is able to handle horizontal transfers for example), a multi-copy family translates into a multiplicity in the problem statements.

The ability to obtain such genomes in polynomial time from adjacencies also opens interesting perspectives for phylogenetic scaffolding of extant bacterial genomes

These applications are left as a future work.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JM, MP, RW, CC and ET formalized and solved the linearization problems and wrote the paper.

Acknowledgements

We thank Jens Stoye for useful discussions. JM and CC are funded by NSERC Discovery Grants. MP is funded by a Marie Curie Fellowship from the Alain Bensoussan program of ERCIM. ET is funded by the Agence Nationale pour la Recherche, Ancestrome project ANR-10-BINF-01-01.

This article has been published as part of