Department of Computer Science, Brown University, Providence (RI), USA

Abstract

Many cancer genome sequencing efforts are underway with the goal of identifying the somatic mutations that drive cancer progression. A major difficulty in these studies is that tumors are typically heterogeneous, with individual cells in a tumor having different complements of somatic mutations. However, nearly all DNA sequencing technologies sequence DNA from multiple cells, thus resulting in measurement of mutations from a mixture of genomes. Genome rearrangements are a major class of somatic mutations in many tumors, and the novel adjacencies (i.e. breakpoints) resulting from these rearrangements are readily detected from DNA sequencing reads. However, the assignment of each rearrangement, or adjacency, to an individual cancer genome in the mixture is not known. Moreover, the quantity of DNA sequence reads may be insufficient to measure all rearrangements in all genomes in the tumor. Motivated by this application, we formulate the k-minimum completion problem (

Introduction

Nearly all current genome sequencing studies sequence the DNA from a population of cells rather than from single cells. This is because present DNA sequencing technologies cannot sequence the DNA in a single cell without bias-inducing DNA amplification steps. In the majority of applications, sequencing such a population of cells is not problematic because the DNA in every cell is nearly identical. However, there are two notable examples: metagenomics (e.g. environmental sequencing or microbiome studies) and cancer sequencing. In the former case, the genomic differences between cells are due to the presence of mixtures of microorganisms in the sample. In the latter case, the genomic differences between cells are due to somatic mutations that accumulate in individual tumor cells during the progression of cancer

In this paper, we formulate the problem of inferring the organization of each genome present in a mixture in the case where: (1) the individual genomes result from an unknown sequence of genome rearrangements from a known (reference) genome; (2) the adjacencies (breakpoints) of the genomes in the mixture are measured. This situation arises in cancer genome studies where somatic structural aberrations (including inversions, translocations, duplications, deletions, or other rearrangements of large pieces of DNA) induce novel adjacencies, also called breakpoints, that join in the cancer genome two noncontiguous nucleotides from the normal genome. In current cancer sequencing projects, these novel adjacencies are determined from alignments of paired-end reads from cancer DNA to the reference human genome

We formulate the

We emphasize that the

In following sections, we first provide a precise formulation of the

Definitions and problem statement

In this section we present some preliminary definitions and give the formal definition of

A _{h }_{t}

Genome and genome graph

**Genome and genome graph**. (a) A genome

The

A _{h}_{t}

As described above a paired-end sequencing experiment provides the adjacencies

A _{S}_{A}_{B}

The genome graph is related to the

Our knowledge about a multi-genome can be incomplete. For example a tumor is a mixture of different cancer genomes, and during sequencing process, we obtain a

If

We use a distance function to distinguish between different completions. A

where

We now define the

**-Minimum Completion Problem (****-MCP) **Given a

As written, the

For two genomes

where

_{DCJ }_{1}, _{2}) =

Related work

In comparison to other genome rearrangement problems considered in the literature, the

Regarding the second feature, several authors have considered the problem of inferring missing adjacencies in a manner that optimizes a genome rearrangement distance. Notably,

Regarding the third feature, the genome median problem considers the problem of finding an ancestral genome that minimizes the distance between three given genomes

Results

In this section we first consider the 1-MCP problem. We present linear time algorithms that solve 1-MCP in the cases where: (i) the measured, incomplete genome has a single circular or linear chromosome; (ii) there are no restrictions on the chromosomal content of the measured, incomplete genome.

Next we prove that the unrestricted

1-MCP

Here, we consider the unrestricted 1-MCP and two restricted versions of 1-MCP problem: (1) the chromosomal condition set is {_{c}_{ℓ}. We first show that unrestricted version is linearly tractable. Then, we show that we can solve the 1-MCP_{c }in linear time. Finally, we prove a relation between 1-MCP_{c }_{ℓ }which implies that 1-MCP_{ℓ }is also solvable in linear time.

Note that 1-MCP_{ℓ }is a variation of the Block Ordering Problem (BOP) considered in _{ℓ }is simpler than the algorithm for the BOP in _{c }

We begin with the unrestricted 1-MCP, where we have the following result.

_{1}, . . ., _{r }_{i}_{i}

Possible mixture trees when

**Possible mixture trees when k = 1, 2**. (a) The only topology in 1-MCP. (b) Branch-tree and (c) path-tree topologies in 2-MCP.

The breakpoint graph

**The breakpoint graph **. Breakpoint graph

1-MCP_{c}

Here we consider 1-MCP_{c}

The first constraint on partitioning of

The second constraint on partitioning of

We define the _{h}, g_{t}

Adding adjacencies to a partial genome _{c}

**Adding adjacencies to a partial genome **. (a) The breakpoint graph

To find a completion of the partial genome ^{u }^{v }^{u }^{v }

(i)

(ii) If

(iii) If

Therefore, adding the adjacency {^{u }^{v }^{u }^{v }

If {_{c }_{b}_{b}

where _{b}

_{b}_{b}

Now suppose _{b}_{u }_{v }

Let _{b}_{b}

Thus, by induction hypothesis

Considering the above cases we have:

(i) After _{u }_{v }_{b}_{b}_{b}

(ii) After _{b}_{b}_{b}

(iii) After _{b}_{b}_{b}

By calculations above, choosing a pair {_{b}_{b}

We call a pair {_{c}

_{c}_{c }

_{c}

A linear time (in number of genes) algorithm for solving 1-MCP_{c }

_{c }

**Algorithm 1**: Solving 1-MCP_{c}

**Input **: Partial genome

**Output**: A 1-completion

1 **begin**

2 Construct the free-extremities graph

3

4 **while ****do**

5

6

7

8 **while ****do**

9

10

11

12 Add the single remaining excluded edge in

13 Output the resulting circular uni-chromosomal genome

14 **end**

1**-MCP**_{ℓ}**: linear uni-chromosomal completion**

In this section we consider the 1-MCP with chromosomal condition of a linear uni-chromosomal genome. We refer to this restricted problem as 1-MCP_{ℓ}. We relate solutions of 1-MCP_{ℓ }to solutions of 1-MCP_{c}_{ℓ}.

Recall that _{c}_{ℓ}. The following theorem relates the solutions of 1-MCP_{c }_{ℓ}.

_{c }_{ℓ}, respectively. From any solution _{c }_{ℓ}. Also, from any solution _{ℓ }we obtain a solution _{c}

_{c}

Relating 1-MCP_{c }_{ℓ}

**Relating 1-MCP_{c }and 1-MCP _{ℓ}**. (a) The breakpoint graph

where the last inequality follows from the definition of

Now suppose _{ℓ}, so

Thus by (3) and (4) we have _{c }_{ℓ }that are obtained from

Now suppose

Notice that the function

_{ℓ }is solvable in linear time.

_{c }_{ℓ }by viewing _{c}_{c }_{ℓ}. Since all of these steps are done in linear time (in number of genes), the proof is complete. □

(3

In the unrestricted case of the

Let

The following proposition shows the relation between the edge-coloring of a genome graph and the edge color classes of the corresponding breakpoint graph.

Using the same argument as in Proposition 1 we have:

Now, in the following theorem we show a relation between the edge-coloring of a genome graph and the

(⇐) Now assume that _{1}, . . ., _{k }_{i }

Now, by Theorem 4 and using the following two classic theorems, we show that deciding whether there exists a valid solution to a (

Note that in Corollary 4 we only considered the unrestricted version of

2**-MCP**

In this section, we prove that the unrestricted 2-MCP, and the restricted 2-MCP where all chromosomes are circular (i.e.,

In order to provide the proof of this theorem, we need the following lemmas.

_{1}, . . ., _{m }

_{1 }and _{2 }are maximal (and circular) and we cannot add any edge to them. Also, for each _{ij }_{j}

which shows that the _{DCJ}_{DCJ }

_{1 }V ℓ_{2 }V ℓ_{3 }be a clause (disjunction) of three literals. Define

By using basic Boolean rules we have Δ ⇔ V_{S∈ℓ(Δ) }

Now, suppose _{1}, . . ., Δ_{m. }_{j}_{j }

Now, consider an instance **remove **the edges in the other matching (see Figure

Representing variables with cycles

**Representing variables with cycles**. (a) A variable represented by a cycle, (b) a

Representing conjunctions with cycles

**Representing conjunctions with cycles**. (a) Three cycles representing the literals

Let ℓ(_{1}), ℓ(_{2}), ℓ(_{3}) be three literals of variables _{1}, _{2}, _{3}, and Δ = (ℓ(_{1}) Λ ℓ(_{2}) Λ ℓ(_{3})) be a conjunction in

1. For each _{i}_{i}_{i}_{i}

2. Add three new edges, called

It is easy to see that an assignment _{i}

If the literals of a variable appear in at most

Combining Lemmas 2-4 gives the proof of Theorem 7.

_{1}, . . ., _{3m+1 }where each _{i }_{δi}_{δi}

We end this section by considering the restricted version of _{c}_{∅}. If opt(_{c}_{∅}) are the _{DCJ}_{c }_{∅}, respectively, then:

_{c }_{∅ }versions of

_{c }_{∅}, since there is no restriction in _{c}_{∅}). Second, for each solution to _{∅ }if the resulting genomes are not circular we can add new edges to the genomes and make them circular. By adding the new edges the number of cycles in the breakpoint graph does not decrease which implies that the _{DCJ}_{∅}). Therefore, these circular genomes form a solution of _{∅}. So opt(_{c}_{∅}) completing the proof. □

Combining this theorem and Theorem 7 we have

_{c }

Discussion and conclusion

In this paper we introduced the

There are numerous further directions to pursue. As noted in the introduction, the model described in this paper does not consider all the complexities of cancer genome sequencing: most importantly copy number aberrations (duplications and deletions) and errors in the measured adjacencies are important features of cancer genome sequencing and should be addressed.

To handle errors, one might consider weighted versions of the

Another direction is to derive approximation algorithms. In the

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

All authors contributed equally to this work.

Acknowledgements

We thank the anonymous referees for helpful comments on an earlier version of this manuscript. This work was supported by a CAREER Award from the National Science Foundation (#1053753). In addition, BJR is supported by a Career Award from the Scientific Interface from the Burroughs Wellcome Fund and an Alfred P. Sloan Research Fellowship.

This article has been published as part of