Abstract
Many cancer genome sequencing efforts are underway with the goal of identifying the somatic mutations that drive cancer progression. A major difficulty in these studies is that tumors are typically heterogeneous, with individual cells in a tumor having different complements of somatic mutations. However, nearly all DNA sequencing technologies sequence DNA from multiple cells, thus resulting in measurement of mutations from a mixture of genomes. Genome rearrangements are a major class of somatic mutations in many tumors, and the novel adjacencies (i.e. breakpoints) resulting from these rearrangements are readily detected from DNA sequencing reads. However, the assignment of each rearrangement, or adjacency, to an individual cancer genome in the mixture is not known. Moreover, the quantity of DNA sequence reads may be insufficient to measure all rearrangements in all genomes in the tumor. Motivated by this application, we formulate the kminimum completion problem (kMCP). In this problem, we aim to reconstruct k genomes derived from a single reference genome, given partial information about the adjacencies present in the mixture of these genomes. We show that the 1MCP is solvable in linear time in the cases where: (i) the measured, incomplete genome has a single circular or linear chromosome; (ii) there are no restrictions on the chromosomal content of the measured, incomplete genome. We also show that the kMCP problem, for k ≥ 3 in general, and the 2MCP problem with the doublecutandjoin (DCJ) distance are NPcomplete, when there are no restriction on the chromosomal structure of the measured, incomplete genome. These results lay the foundation for future algorithmic studies of the kMCP and the application of these algorithms to real cancer sequencing data.
Introduction
Nearly all current genome sequencing studies sequence the DNA from a population of cells rather than from single cells. This is because present DNA sequencing technologies cannot sequence the DNA in a single cell without biasinducing DNA amplification steps. In the majority of applications, sequencing such a population of cells is not problematic because the DNA in every cell is nearly identical. However, there are two notable examples: metagenomics (e.g. environmental sequencing or microbiome studies) and cancer sequencing. In the former case, the genomic differences between cells are due to the presence of mixtures of microorganisms in the sample. In the latter case, the genomic differences between cells are due to somatic mutations that accumulate in individual tumor cells during the progression of cancer [1].
In this paper, we formulate the problem of inferring the organization of each genome present in a mixture in the case where: (1) the individual genomes result from an unknown sequence of genome rearrangements from a known (reference) genome; (2) the adjacencies (breakpoints) of the genomes in the mixture are measured. This situation arises in cancer genome studies where somatic structural aberrations (including inversions, translocations, duplications, deletions, or other rearrangements of large pieces of DNA) induce novel adjacencies, also called breakpoints, that join in the cancer genome two noncontiguous nucleotides from the normal genome. In current cancer sequencing projects, these novel adjacencies are determined from alignments of pairedend reads from cancer DNA to the reference human genome [2,3]. However, these approaches generally do not measure all adjacencies present in the tumor. For example, the quantity of DNA sequence reads (coverage) may be insufficient to measure all adjacencies in all genomes in the tumor, particularly adjacencies that are present in a minority of cancer cells. Moreover, alignment of reads to repetitive regions is challenging, particularly for short reads produced by current sequencing technologies, and thus some adjacencies may not be reliably measured.
We formulate the kMinimum Completion Problem (kMCP) of determining the k genomes present in a mixture from a set of measured adjacencies that minimize the total distance between the reference genome and the k measured (i.e. cancer) genomes. The kMCP is a general problem that encompasses different subproblems that depend on the genomic distance used and the desired chromosomal content of the measured genomes. We show the following results: (1) A linear time algorithm for the 1MCP in the double cut and join (DCJ) distance [4] when the desired genome has no restrictions on its chromosomal content; (2) A linear time algorithm for the 1MCP in the DCJ distance when the desired genome has a single circular or linear chromosome; (3) the kMCP is NPcomplete for any distance when k ≥ 3; and (4) the 2MCP with DCJ distance is NPcomplete when the desired genome has no restrictions on its chromosomal content, or when the desired genome has all circular chromosomes.
We emphasize that the kMCP does not model all the issues arising in cancer sequencing: in particular, we restrict attention to copyneutral structural variants, and ignore single nucleotide mutations, small indels, or other large copy number aberrations. Single nucleotide mutations and small indels can be addressed separately as they do not produce novel adjacencies of the type studied in kMCP. Copy number aberrations are common in cancer, but appropriate handling of these mutations when measured in a heterogeneous mixture introduces an entirely different set of challenges: e.g. a deletion of a genomic segment in half of the cells in the mixture with a duplication of the same segment in the other half of the cells will be difficult to distinguish from no copy number change. Finally, we assume that all measured adjacencies are real, while in fact there are likely to be false positive adjacencies. Extending the model to consider these additional complexities is left for future work.
In following sections, we first provide a precise formulation of the kMCP and describe related work. Then, we provide algorithms and proofs of the complexity of the problem in various cases.
Definitions and problem statement
In this section we present some preliminary definitions and give the formal definition of kMCP.
A gene g is an oriented sequence of nucleotides, with two extremities: a head g_{h }and a tail g_{t}. An adjacency is an unordered pair of gene extremities. A genome on n genes is a set of adjacencies such that each of the 2n gene extremities in is a member of at most one adjacency in . The gene extremities which are not members of any adjacency in are called telomeres of , and we denote the set of all telomeres by (Figure 1a). Through this work, we assume that the genes of a genome are distinct.
Figure 1. Genome and genome graph. (a) A genome on the set of genes {1, 2, 3, 4, 5} with two chromosomes (one linear and one circular). . (b) The genome graph (black edges) of with additional edges (dotted) connecting the extremities of the same gene. There is one cycle component and one path component.
The genome graph of a genome is a graph whose labeled vertices are the gene extremities in , and whose edge set is . We denote the genome graph of by . Because each extremity is in at most one adjacency of , the graph is a matching graph (not necessarily perfect). Note that the genome graph is uniquely determined by the genome, and conversely. For convenience, we also define the augmented genome graph to be the genome graph augmented with additional edges connecting extremities of the same gene, i.e., is the graph whose labeled vertices are the gene extremities in G, and whose edge set is .
A chromosome of is the set of all adjacencies and telomeres of gene extremities in a connected component of the augmented genome graph (Figure 1b). A chromosome is linear (resp. circular) if the corresponding connected component is a path (resp. cycle) (Figure 1b). Note that an adjacency {g_{h}, g_{t}} represents a circular chromosome with the single gene g. A genome is circular or linear if all of its chromosomes are circular or linear, and we say it is mixed if it has both circular and linear chromosomes. A genome is unichromosomal if it has only one chromosome, and it is multichromosomal, otherwise. A chromosomal condition is a condition on the number or type of chromosomes in a genome. For example we can describe the structure of a genome by two chromosomal conditions: being (i) unichromosomal, and (ii) circular.
As described above a pairedend sequencing experiment provides the adjacencies of the sequenced genome relative to the genes from a reference genome. However, our knowledge about a genome's adjacencies is typically incomplete. For a set of chromosomal conditions, a partial genome on n genes is a set of adjacencies such that there exists a set of pairs of gene extremities such that is a genome with chromosomal condition . When is clear in the context we will say partialgenome instead of partial genome. The problems we study below involve adding the missing adjacencies in partial genomes to complete them into genomes with chromosomal condition . Sometimes we have an idea about the number or the structure of chromosomes in a genome. We define a completion of a partial genome relative to these chromosomal conditions. If is a genome, we say provided . A of a partial genome is a genome with and satisfying the conditions in . When is clear in the context, we just say completion instead of .
A multigenome is a mixture of genomes with the same set of genes. Formally, the multigenome formed from genomes is a multiset obtained from , the disjoint union of (For a multiset S and an element r, if c_{S}(r) is the number of copies of r in S, the disjoint union of two multisets is a multiset in which each element r appears c_{A}(r) + c_{B}(r) times.). Note that the partition of the adjacencies in into is not known. There is a corresponding genome graph, a multigraph whose vertices are the gene extremities, and whose edge set is the multiset . We denote the genome graph of a multigenome by .
The genome graph is related to the breakpoint graph in genome rearrangement studies. The breakpoint graph of the genomes is an edgecolored multigraph whose labeled vertices are the 2n gene extremities and whose edges are all the adjacencies in , with each edge assigned a color according to its genome of origin. Thus, the only difference between the breakpoint graph and the genome graph is the lack of edgecoloring in the latter, reflecting our inability to measure the origin of each adjacency.
Our knowledge about a multigenome can be incomplete. For example a tumor is a mixture of different cancer genomes, and during sequencing process, we obtain a mixture of adjacencies from these genomes. We represent the mixtures of adjacencies by a partial multigenome. A partial multigenome is a multiset , where each is partial genome. We define the genome graph of a partial multigenome analogously to a multigenome.
If k is a positive integer and is a partial multigenome, a kcompletion of is a family of k genomes , such that . Note that existence of a completion for a partial (multi) genome is dependent on the structure of the partial (multi) genome and the chromosomal conditions. Also, the existence of a completion does not imply its uniqueness.
We use a distance function to distinguish between different completions. A distance function on pairs of genomes (with the same set of genes), is a measure of dissimilarity between the genomes. Having selected a pairwise distance function we must define a distance between the k genomes in a mixture. Motivated by the fact that the different cancer genomes in a tumor are obtained by somatic genome rearrangements from a healthy genome, we model the evolution of the cancer genomes by a rooted tree in which all the cancer genomes are descendants of the healthy one. Suppose represents a healthy genome, and a mixture of k cancer genomes obtained by rearrangements of the genome . A mixture tree is a rooted tree on such that the root vertex is and k genomes in are (some of) the vertices in . If ϕ is a distance function on a pair of genomes, then the ϕvalue of , denoted by is defined as follows:
where E is the set of edges in .
We now define the kMinimum Completion Problem.
kMinimum Completion Problem (kMCP) Given a partial multigenome , a positive integer k, a reference genome , and a distance function ϕ, find a kcompletion and a mixture tree such that is minimum over all kcompletions and mixture trees. If no kcompletion exists for , we say that this kMCP does not have a valid solution. We say the kMCP is unrestricted if , and is restricted, otherwise.
As written, the kMCP is a general problem that encompasses many subproblems depending on chromosomal condition set and the distance ϕ. Common distances in genome rearrangement studies include the breakpoint distance [5], the HannenhalliPevzner distance [6] (which generalizes the reversal distance [7]), and the doublecutandjoin (DCJ) distance [4]. Below we will use the DCJ distance, which approximates the other distances [8].
For two genomes and on the same set of n genes, their doublecutandjoin (DCJ) distance, denoted by , is equal to
where is the number of cycles in and is the number of paths in B with odd number of vertices [8].
Remark. When at least one of the are circular we have and d_{DCJ }(G_{1}, G_{2}) = n  c. Thus, having a larger number of cycles in their breakpoint graph is equivalent to having a smaller distance.
Related work
In comparison to other genome rearrangement problems considered in the literature, the kMCP has three distinguishing features. (1) The input is a mixture of adjacencies from multiple genomes and the genome of origin of each adjacency is unknown. (2) The set of adjacencies is incomplete: not every adjacency from every genome in the mixture is measured. (3) The ancestral relationships between the genomes in the mixture are unknown, and might include both "ancestral" and "present day" genomes. Some of these features have been considered individually in other work, but to our knowledge no previous work has considered all three together. The first feature bears some resemblance to the genome halving problem [9] of finding the doubled ancestor genome by minimizing a rearrangement distance. This problem and further generalizations to polyploidization [10] involves partitioning (or coloring) adjacencies to minimize a rearrangement distance. However, in general no adjacencies are missing and the distance is pairwise (i.e., no tree) in contrast to the 2MCP.
Regarding the second feature, several authors have considered the problem of inferring missing adjacencies in a manner that optimizes a genome rearrangement distance. Notably, [11] and [12] consider the problem of computing reversal distance between pairs of partially assembled genomes that are provided as unordered sequences of contigs. These problems were motivated by limitations in DNA sequence technologies that result in most wholegenome assemblies being highly fragmented and comprised of contigs whose relative ordering is unknown. These problems are variations of the 1MCP, where the reference genome also has missing adjacencies. In particular, [12] orient sets of contigs from two genomes in such a way that the number of cycles in the breakpoint graph of the resulting genomes is maximized, which they note "has been shown to approximate very well the reversal distance between them." However, there is no work on extending this analysis to more than two genomes.
Regarding the third feature, the genome median problem considers the problem of finding an ancestral genome that minimizes the distance between three given genomes [5,13]. This is different from kMCP in that the three individual genomes are known (rather than mixed) and the genomes are complete with no missing adjacencies. Also, in the median problem the topology of the phylogenetic tree has been already inferred, while in kMCP we have to find an optimal topology for the phylogenetic tree as well.
Results
In this section we first consider the 1MCP problem. We present linear time algorithms that solve 1MCP in the cases where: (i) the measured, incomplete genome has a single circular or linear chromosome; (ii) there are no restrictions on the chromosomal content of the measured, incomplete genome.
Next we prove that the unrestricted kMCP is NPcomplete when k ≥ 3 for any distance function ϕ. Finally, we show that the unrestricted 2MCP, and the restricted 2MCP where all chromosomes are circular (i.e., ), are NPcomplete for DCJ distance.
1MCP
Here, we consider the unrestricted 1MCP and two restricted versions of 1MCP problem: (1) the chromosomal condition set is {circular, unichromosomal}, which we denote by 1MCP_{c}; (2) the chromosomal condition set is {linear, unichromosomal}, which we denote by 1MCP_{ℓ}. We first show that unrestricted version is linearly tractable. Then, we show that we can solve the 1MCP_{c }in linear time. Finally, we prove a relation between 1MCP_{c }and, 1MCP_{ℓ }which implies that 1MCP_{ℓ }is also solvable in linear time.
Note that 1MCP_{ℓ }is a variation of the Block Ordering Problem (BOP) considered in [12]. In our terminology, the BOP considers two partial genomes, and aims to complete both partial genomes into linear, unichromosomal genomes such that the pairwise distance between the completed genome is minimal. In [12], Gaul and Blanchette provide a linear algorithm for BOP. The algorithm we present for 1MCP_{ℓ }is simpler than the algorithm for the BOP in [12], and our algorithm is obtained from a straightforward algorithm (Algorithm 1 below) which solves 1MCP_{c }in linear time.
We begin with the unrestricted 1MCP, where we have the following result.
Theorem 1. The unrestricted 1MCP with DCJ distance is linearly tractable.
Proof. In 1MCP we have a single partial genome and a reference genome (see Figure 2a). Since both and are matchings over the gene extremities, their breakpoint graph consists of some paths and cycles. Suppose P_{1}, . . ., P_{r }are all the paths such that the first and their last edges are adjacencies in . An optimal completion for can be obtained by adding an edge to which connects the end points of each P_{i}, for 1 ≤ i ≤ r (see Figure 3), since we only can add edges between the vertices which are not incident with any edge in , i.e., the end vertices of P_{i}'s. Note that adding other possible edges just create longer paths in . □
Figure 2. Possible mixture trees when k = 1, 2. (a) The only topology in 1MCP. (b) Branchtree and (c) pathtree topologies in 2MCP.
Figure 3. The breakpoint graph with possible edges to be added to the adjacencies of . Breakpoint graph consisting of paths and cycles. Thick edges are in , and thin edges are in . Dashed edges are the edges that should be added to for the paths whose first and last edges are in .
1MCP_{c}: circular unichromosomal completion
Here we consider 1MCP_{c}, the restricted 1MCP for a partial genome that we wish to complete to a circular unichromosomal genome . We assume that is not already a circular unichromosomal genome. Thus has a set of free extremities, i.e., the extremities that are not in any adjacency in . Equivalently, is the set of vertices of degree 0 in the genome graph . Finding the completion corresponds to finding a partition of into pairs of extremities, i.e., into adjacencies. However, this partition cannot be arbitrary as the adjacencies defined by the partition must satisfy two constraints: (1) The resulting genome is circular unichromosomal, meaning that the augmented genome graph has exactly one component, a cycle. Note that has only path components, since and . (2) The resulting genome must minimize the distance between the reference genome and .
The first constraint on partitioning of is that joining extremities at ends of a same path in by an edge, which we call an excluded edge, creates a cycle. This cycle must be selected carefully to obtain a unichromosomal genome. We define to be the set of all excluded edges.
The second constraint on partitioning of is provided by our desire to minimize the distance between the reference genome and . For the DCJ distance, we must maximize the number of cycles in the breakpoint graph . Adding an edge to increases the number of cycles in B if and only if the edge connects the endpoints of a same path in B. We call such an edge a desired edge and denote by the set of all desired edges. Now we combine these two constraints into a graph.
We define the freeextremities graph, to be a bicolored graph, whose vertex set is , and whose edge set is . The edges from are colored blue and the edges from are colored red. Note that R is a multigraph, and R consists of even cycles. This is because both and are perfect matchings on : since both and {{g_{h}, g_{t}}  g is a gene in } are perfect matchings on the set of all gene extremities. The restriction of these perfect matchings to are and . See Figure 4b. Thus, we have
Figure 4. Adding adjacencies to a partial genome to solve the 1MCP_{c}. (a) The breakpoint graph . Gray edges indicate adjacencies of , black edges indicate adjacencies of , and the dotted edges connect extremities of the same gene. The set of free vertices is . (b) The freeextremities graph consists of two even cycles. Blue edges are desired edges and red edges are excluded edges . (c) The resulting breakpoint graph after adding adjacency {1_{h}, 4_{h}}. (d) The resulting freeextremities graph after update(R, {1_{h}, 4_{h}}). The vertices 1_{h }and 4_{h }are no longer free extremities and thus are removed during update(R, {1_{h}, 4_{h}}).
To find a completion of the partial genome we select pairs {u, v} of free extremities from and add them as adjacencies to . Respecting the constraints encoded in the freeextremities graph R, we define a transformation update(R, {u, v}) that records the effect of adding adjacency {u, v} to (Figure 4). In particular, since u and v are free vertices of , there are paths and in B with an endpoint equal to u and v, respectively. Similarly, there are paths and in having an endpoint equal to u and v, respectively. We may have or . By the definition of , and are represented by blue edges b^{u }and b^{v }in R incident to u and v. Similarly by the definition of , and are represented by red edges r^{u }and r^{v }in R incident to u and v. Adding the adjacency {u, v} to will have the following effects on B and :
(i) u and v are no longer free vertices.
(ii) If then these paths merge into one path in B ∪ {u, v}. Otherwise these paths merge to create a cycle in B ∪{u, v}, and the number of cycles in the breakpoint graph increases by one.
(iii) If these paths merge into one path in . Otherwise these paths merge into a cycle in . In the latter case, we should add {u, v} as an adjacency if and only if . This is because adding {u, v} creates a cycle component in (i.e., a circular chromosome) and if there are other free vertices any subsequent completion will not be unichromosomal.
Therefore, adding the adjacency {u, v} to will have three corresponding effects on R: removing the vertices u and v from R based on (i) above, identifying b^{u }and b^{v }based on (ii) above, and identifying r^{u }and r^{v }based on (iii) above. We denote this process of updating R by update(R, {u, v}). Figure 4 gives an illustration of this process.
If {u, v} is a blue edge in R, then update(R, {u, v}) increases the number of cycles in the breakpoint graph B by one. Hence, to find a solution to 1MCP_{c }we want to perform update(R, {u, v}) transformations with as many blue edges as possible. On the other hand, adding new adjacencies has to merge the paths in the graph in such a way that we end with a genome with exactly one circular chromosome. Let M_{b}(R) be the maximum possible number of update transformations using blue edges for the graph R. The following theorem provides the exact value of M_{b}(R).
Theorem 2. Suppose is a partial genome, is a reference genome, and is their freeextremities graph. We have
where N_{b}(R) is the number of blue edges, and c(R) is the number of cycles in R.
Proof. We prove the theorem by induction on N_{b}(R). Suppose N_{b}(R) = 1. Then necessarily R consists of a cycle of length 2 with one blue and one red edge, and c(R) = 1. Thus, we update the graph R with the unique (and the only possible) blue edge obtaining
Now suppose N_{b}(R) >1. Then , since . Suppose u, , and , i.e., there is no red edge between u and v in R. Then, we have the following three cases for u and v: (i) u and v are from different cycles C_{u }and C_{v }in R, (ii) u and v are connected with a blue edge in a cycle C of R, or (iii) u and v are nonneighboring vertices in a cycle C of R.
Let R' = update(R, {u, v}) be the freeextremities graph after the update. Since u and v are incident with blue edges in R, after update(R, {u, v}) the number of blue edges decreases by one, i.e., N_{b}(R') = N_{b}(R)  1.
Thus, by induction hypothesis
Considering the above cases we have:
(i) After update(R, {u, v}), C_{u }and C_{v }will shrink into one cycle, and c(R') = c(R)  1. Thus by (2), M_{b}(R') = N_{b}(R)  c(R) + 1. By choosing such an edge we can update R with N_{b}(R)  c(R) + 1 blue edges.
(ii) After update(R, {u, v}), C shrinks into a smaller cycle, and c(R') = c(R). Thus, by (2), M_{b}(R') = N_{b}(R)  c(R). Since {u, v} is a blue edge, we can update R with N_{b}(R)  c(R) + 1 blue edges.
(iii) After update(R, {u, v}), C splits into two smaller cycles. Thus c(R') = c(R) + 1. Thus, by (2), M_{b}(R') = N_{b}(R)  c(R)  1. So by choosing {u, v} we can update R with N_{b}(R)  c(R)  1 blue edges.
By calculations above, choosing a pair {u, v} satisfying cases (i) or (ii) will result in a greater number of update moves with blue edges, than choosing a pair satisfies the case (iii). Moreover, considering pairs {u, v} from cases (i) and (ii) gives M_{b}(R) = N_{b}(R)  c(R) + 1. □
We call a pair {u, v} (which may or may not be an edge in R) satisfying case (i) or (ii) in the proof of Theorem 2 an optimal adjacency. Optimal adjacencies play an important role in finding a solution of 1MCP_{c}: updating the freeextremities graph with these adjacencies results in the maximum number of blue edges used in update transformations. We have the following important corollary to this theorem.
Corollary 1. Suppose is a partial genome and is a reference genome. Adding any optimal adjacency to leads to a solution for 1MCP_{c}. In other words, for any optimal adjacency e, there exists a solution for 1MCP_{c }which includes e as an adjacency.
Proof. By Theorem 2, adding any optimal adjacency to will allow the maximum number of blue edges in the update process. Since each update transformation on a blue edge increases the number of cycles in the breakpoint graph by one, a sequence of update transformations on optimal adjacencies gives a solution to 1MCP_{c}. Hence, if is the resulting completion of , we obtain the maximum number of cycles in the breakpoint graph . □
A linear time (in number of genes) algorithm for solving 1MCP_{c }adds optimal adjacencies according to cases (i) and (ii) in Theorem 2, and is shown in Algorithm 1. The following corollary is an immediate consequence of Corollary 1 and Algorithm 1.
Corollary 2. The 1MCP_{c }is solvable in linear time.
Algorithm 1: Solving 1MCP_{c}
Input : Partial genome and reference genome A.
Output: A 1completion that is circular unichromosomal and maximizes .
1 begin
2 Construct the freeextremities graph ;
4 while c(R) >1 do
5 u, v ← select two vertices from different cycles in R;
7 R ← update (R, {u, v});
8 while the number of blue edges in R >1 do
9 u, v ← select two vertices connected via a blue edge in R;
11 R ← update (R, {u, v});
12 Add the single remaining excluded edge in to ;
13 Output the resulting circular unichromosomal genome ;
14 end
1MCP_{ℓ}: linear unichromosomal completion
In this section we consider the 1MCP with chromosomal condition of a linear unichromosomal genome. We refer to this restricted problem as 1MCP_{ℓ}. We relate solutions of 1MCP_{ℓ }to solutions of 1MCP_{c}. Combined with the results in the previous section, we derive a linear time algorithm for 1MCP_{ℓ}.
Recall that is the number of alternating cycles in the breakpoint graph , for any solution of 1MCP_{c}. Similarly, we define to be the number of alternating cycles in , for any solution of 1MCP_{ℓ}. The following theorem relates the solutions of 1MCP_{c }to the solutions of 1MCP_{ℓ}.
Theorem 3. Let be a partial genome, be a circular unichromosomal genome, and be a linear unichromosomal genome obtained from by removing an adjacency e. Suppose and are the reference genomes in 1MCP_{c }and 1MCP_{ℓ}, respectively. From any solution to 1MCP_{c }we obtain a solution for 1MCP_{ℓ}. Also, from any solution to 1MCP_{ℓ }we obtain a solution for 1MCP_{c}. Moreover, , where
Proof. First, suppose e is not in any cycle in the graph , and hence θ(e) = 1. Let be a solution to 1MCP_{c}, and let be a linear unichromosomal genome obtained from by removing an adjacency , such that f and e are in the same cycle in . Note that such edge f exists, since e is not in any cycle in but it is in a cycle of . See Figure 5. Both gr and are perfect matchings as and are both circular. Removing the edges e and f from will decrease the number of cycles by exactly one since e and f are in a same cycle in . Hence , and we have,
Figure 5. Relating 1MCP_{c }and 1MCP_{ℓ}. (a) The breakpoint graph ; black edges are and and gray edges are . The edge e = {1_{t}, 6_{h}} is the only edge in . Since e is not in a cycle component of B, we have θ(e) = 1. (b) The breakpoint graph , where is a completion of and a solution to 1MCP_{c}. The adjacency f is in and shown by a gray dashed edge. B' has two cycles, and removing e and f decreases the number of cycles by one.
where the last inequality follows from the definition of as the largest number of cycles in any linear chromosomal completion of .
Now suppose is a solution to 1MCP_{ℓ}, so . Assume . Let be the circular unichromosomal genome obtained by adding to . Note that there is at least one path component in which becomes a cycle after adding the edges f' to and e to . Hence, , and we have
Thus by (3) and (4) we have , which implies that and . This means that and are solutions to 1MCP_{c }and 1MCP_{ℓ }that are obtained from and , respectively, which completes the proof for the case θ(e) = 1.
Now suppose e is in a cycle in , and thus θ(e) = 2. Using the same argument above, we have since we cannot find such edge f and the number of cycles in decreases by two, when we remove an edge from (to obtain a linear genome), and e from (to obtain the genome ). Also, , as adding the excluded edges of and will increase the number of cycles by 2. Thus, for this case we have □
Notice that the function θ depends only on the partial genome and the reference genome , and not on the completion . Also, it is easy to see that θ is computable in linear time (in number of genes). We have the following corollary.
Corollary 3. The 1MCP_{ℓ }is solvable in linear time.
Proof. Suppose is a partial genome and is a linear chromosomal reference genome. Since is linear and unichromosomal, . Assume that . Let be the circular unichromosomal genome obtained by adding e to . Using Algorithm 1 we obtain a solution for 1MCP_{c }with as the reference genome. Then by Theorem 3, we can transform the solution to a linear unichromosomal completion in linear time in the following way: If there exists an edge such that f and e are in the same cycle of the breakpoint graph , i.e. θ(e) = 1, remove f from . Otherwise θ(e) = 2 and we remove an arbitrary edge from to make a linear unichromosomal genome. Therefore, we obtain a solution to 1MCP_{ℓ }by viewing as a partial genome for a 1MCP_{c}, solving the problem, and converting the solution of 1MCP_{c }into a solution for 1MCP_{ℓ}. Since all of these steps are done in linear time (in number of genes), the proof is complete. □
(3 ≤ k)MCP
In the unrestricted case of the kMCP, the completion of a partial genome is always possible as we can add adjacencies and telomeres arbitrarily to the partial genome, since there is no restriction on the number and type of chromosomes in the resulting genome. The hardness of showing the existence of a kcompletion derives from the fact that finding a kcompletion for the partial multigenome results in a proper edge coloring for the genome graph of the partial multigenome.
Let G = (V, E) be a graph. We define the edgechromatic number of G, denoted χ'(G), to be the minimum number of colors required to obtain an edgecoloring of G. For each edgecoloring of G a color class is a set of all edges with a specific color. A color class defines a matching in the graph since no two edges of the same color share a vertex.
The following proposition shows the relation between the edgecoloring of a genome graph and the edge color classes of the corresponding breakpoint graph.
Proposition 1. If is a multigenome of k genomes then .
Proof. Suppose is a mixture of k genomes . Then the breakpoint graph can be partitioned into the sets of adjacencies, and each can be considered as color class. So the edges of B can be colored with k colors. Since B and are isomorphic, we have . □
Using the same argument as in Proposition 1 we have:
Lemma 1. If is a partial multigenome of k partial genomes then .
Now, in the following theorem we show a relation between the edgecoloring of a genome graph and the kcompletion of the corresponding partial multigenomes.
Theorem 4. Let be a partial multigenome. Then has an unrestricted kcompletion if and only if , for any positive integer k.
Proof. (⇒) If has a k completion, then it can be considered as a partial multigenome of k genomes. Then by Lemma 1 we have .
(⇐) Now assume that . Hence, we can color the edges of with k colors. If C_{1}, . . ., C_{k }are the color classes of G, we have . Each C_{i }is a matching in the graph , and is a set of adjacencies among the gene extremities. So we can define a partial genome by adjacencies . The color classes partition the edges of into k matchings, and we have . Since there is no restriction on the completions, taking any completion for each results in a a kcompletion for ; because . □
Now, by Theorem 4 and using the following two classic theorems, we show that deciding whether there exists a valid solution to a (k ≥ 3)MCP is NPcomplete. For a graph G let Δ(G) be the maximum degree of G.
Theorem 5 (Vizing [14]). If G is a simple graph, χ'(G) = Δ(G) or Δ(G) + 1.
Theorem 6 (Holyler [15]). For a graph G, deciding whether χ'(G) = Δ(G) or Δ(G) + 1 is NPcomplete, if Δ(G) ≥ 3.
Corollary 4. If k ≥ 3, deciding whether there exists a valid solution to the unrestricted kMCP is NPcomplete.
Proof. In order to prove this corollary we reduce the edgecoloring problem to kMCP. Suppose G = (V, E) is a simple graph and k = Δ(G) ≥ 3. If V  is not even, add an isolated vertex so that the number of vertices in G is 2n for some positive integer n. Consider these 2n vertices as gene extremities of a set of n genes. Now, G defines a partial multigenome on these n genes, since the kMCP is unrestricted and any graph can be considered as a partial multigenome with no restriction on the chromosomal structure of its partial genomes. If there is a polynomial algorithm for kMCP, we can input to this algorithm as the partial multigenome, along with an arbitrary distance function ϕ and a healthy reference . First, suppose the algorithm gives a valid output. Since the algorithm is polynomial, we can find a kcompletion for in polynomial time, and by Theorem 4, we can find an edge coloring of G with k colors in polynomial time.This implies that the χ'(G) ≤ k. Now if the algorithm does not give a valid output, by Theorem 4 we have χ'(G) > k. This implies that the kMCP is NPcomplete, since the genome graph of a partial multigenome is always a multigraph and the class of simple graphs is a subset of the class of multigraphs. □
Note that in Corollary 4 we only considered the unrestricted version of kMCP. This allows us to assume that for each (multi) graph G there exists a partial multigenome such that G and are isomorphic.Thus, if  for all partial multigenomes } and if is the set of all multigraphs, then . However, one can restrict the kMCP by taking a set of chromosomal conditions. Consequently we may have such that the new restricted kMCP is polynomially tractable for all partial multigenomes (whose genome graph is in ).
Corollary 5. If k ≥ 3, then the unrestricted kMCP is NPcomplete.
Proof. Since in solving a kMCP we need to find a kcompletion for its partial multigenome, by Corollary 4 the proof is complete. □
2MCP
In this section, we prove that the unrestricted 2MCP, and the restricted 2MCP where all chromosomes are circular (i.e., ), are NPcomplete for DCJ distance. The NPcompleteness of the unrestricted 2MCP is done by a reduction from MAX 3AND problem. The MAX 3AND is a satisfiability problem, where given a set of conjunctions, each with 3 literals, the goal is to determine an assignment of Boolean value to each variable that maximizes the number of satisfied conjunctions. Note that in 2MCP there are only two possible topologies for the mixture tree: the branchtree and pathtree (Figure 2b, c).
Theorem 7. The unrestricted 2MCP with DCJ distance is NPcomplete.
In order to provide the proof of this theorem, we need the following lemmas.
Lemma 2. Suppose is a partial multigenome whose genome graph, , consists of m cycles C_{1}, . . ., C_{m }with even lengths, and is a reference genome which consists of ℓ edges (i.e., it has ℓ adjacencies). Assume that there are ℓ' cycles among the cycles in such that no edge in A is connected to any of their vertices. If ℓ' >2ℓ then in every solution to the 2MCP, the optimal mixture tree is a pathtree.
Proof. Note that in 2MCP there are only two possible topologies for the mixture tree: the branchtree and pathtree (Figure 2b, c). Since the degree of each vertex in is two, if we partition the edges of into two perfect matchings and . Therefore, for any 2completion we have and , since G_{1 }and G_{2 }are maximal (and circular) and we cannot add any edge to them. Also, for each we have , where M_{ij }is a perfect matching on vertices of C_{j}. Obviously, . We have for i = 1, 2, since has ℓ edges and each of them can be in at most one cycle in . Therefore,
which shows that the d_{DCJ}value of a path tree is smaller than the d_{DCJ }value of a branch tree, and completes the proof. □
Lemma 3. Any MAX 3SAT instance is reducible to a MAX 3AND instance. Moreover, MAX 3AND is NPcomplete.
Proof. Let Δ = ℓ_{1 }V ℓ_{2 }V ℓ_{3 }be a clause (disjunction) of three literals. Define
By using basic Boolean rules we have Δ ⇔ V_{S∈ℓ(Δ) }S.
Now, suppose is a MAX 3SAT instance which has m clauses Δ_{1}, . . ., Δ_{m. }Let be an instance of MAX 3AND which consists of all the conjunctions in Since for every assignment to the variables at most one conjunction in L(Δ_{j}), 1 ≤ j ≤ m, is satisfied and this happens if and only if Δ_{j }is satisfied, then every optimal assignment to the variables in will be also an optimal assignment to the variables in . Therefore, MAX 3SAT is reducible to MAX 3AND, which implies that MAX 3AND is NPcomplete, as MAX 3SAT is NPcomplete [16]. □
Now, consider an instance of the MAX 3AND problem. We show how to represent by a genome graph and a reference genome, to make a reduction from MAX 3AND to 2MCP. Suppose we represent a variable x with a cycle C of even length, which we will call a variablecycle (see Figure 6a). This cycle has exactly two perfect matchings. We label one of these the true matching, T(x), and the other one the false matching, F(x) (see Figure 6b, c). We represent an assignment to a variable by choosing one of the matchings T(x) and F(x) and remove the edges in the other matching (see Figure 7).
Figure 6. Representing variables with cycles. (a) A variable represented by a cycle, (b) a true matching, and (b) a false matching.
Figure 7. Representing conjunctions with cycles. (a) Three cycles representing the literals , , and z, and the conjunction edges (bold) for a conjunction . (b) For x = y = false and z = true we obtain the conjunctioncycle Δ of length 6. (c) Any other assignment (e.g., x = true) destroys the conjunction cycle.
Let ℓ(x_{1}), ℓ(x_{2}), ℓ(x_{3}) be three literals of variables x_{1}, x_{2}, x_{3}, and Δ = (ℓ(x_{1}) Λ ℓ(x_{2}) Λ ℓ(x_{3})) be a conjunction in . A conjunctioncycle of Δ is a cycle which is obtained as follows:
1. For each i ∈ {1, 2, 3} consider an edge in T(x_{i}) if ℓ(x_{i}) = x_{i}. If take an edge in F(x_{i}).
2. Add three new edges, called conjunctionedges, to the three edges we chose in the previous step, and build a cycle of length 6. This cycle is a conjunctioncycle of Δ.
It is easy to see that an assignment α to x_{i}'s satisfy the conjunction Δ if and only if the corresponding matching assignment to α keeps all the edges in the conjunctioncycle of Δ. We call a representation of a MAX 3AND instance with cycles and conjunctioncycles a graphical representation of .
If the literals of a variable appear in at most t conjunctions, and the variablecycles have length at least 4t, then by choosing the edges of conjunctioncycles properly, we have a graphical representation of a MAX 3AND instance, where no edge in a variablecycle is incident with two conjunction edges from different conjunctioncycles. This implies the following lemma:
Lemma 4. For each MAX 3AND instance there exists a graphical representation such that any assignments to the variables in which maximizes the number of satisfied conjunctions, induces a matching assignment that maximizes the number of conjunctioncycles, and vice versa.
Combining Lemmas 24 gives the proof of Theorem 7.
Proof of Theorem 7. Since the MAX 3AND is NPcomplete by Lemma 3, it suffices to reduce the MAX 3AND problem to the 2MCP. Suppose is a MAX 3AND instance. Assume has m conjunctions. We can add 3m + 1 new conjunctions δ_{1}, . . ., δ_{3m+1 }where each δ_{i }consists of a new single variable x_{δi}; obviously in any optimal assignment the value of all the x_{δi}'s should be true. Now by Lemma 4, there is a graphical representation such that finding an optimal assignment in is equivalent to finding a matching for each variablecycle such that the number of preserved conjunctioncycles are maximized. Note that there are 3m conjunctionedges and 3m + 1 variablecycles which are not connected to any conjunctionedge. Now, consider all the vertices in as gene extremities, and all the edges in the variablecycles as the adjacencies of a partial multigenome G. Also, consider all the conjunctionedges as the adjacencies of a reference healthy genome . In the 2MCP problem with partial multigenome G and reference healthy genome , the optimal tree is forced to be a pathtree by Lemma 2 (Figure 2). Therefore, in the optimal solution of this 2MCP, should be a genome such that the number of cycles in the breakpoint graph is maximized, i.e., the number of conjunctioncycles are maximized. Since is a union of perfect matchings of the variablecycles (see the proof of Lemma 2) it induces an assignment for the variables which maximizes the number satisfied conjunctions, and this completes the proof. □
We end this section by considering the restricted version of kMCP, where the chromosomal condition set is {circular}, i.e. all genomes have all circular chromosomes. We denote this restricted version by kMCP_{c}, and the unrestricted version of kMCP by kMCP_{∅}. If opt(kMCP_{c}) and opt(kMCP_{∅}) are the d_{DCJ}value of a solution to kMCP_{c }and kMCP_{∅}, respectively, then:
Theorem 8. For the kMCP_{c }and kMCP_{∅ }versions of kMCP with DCJ distance we have
Proof. First note that each solution to kMCP_{c }is also a solution of kMCP_{∅}, since there is no restriction in kMCP. Hence, opt(kMCP_{c}) ≥ opt(kMCP_{∅}). Second, for each solution to kMCP_{∅ }if the resulting genomes are not circular we can add new edges to the genomes and make them circular. By adding the new edges the number of cycles in the breakpoint graph does not decrease which implies that the d_{DCJ}value of the newly obtained genomes is not larger than opt(kMCP_{∅}). Therefore, these circular genomes form a solution of kMCP_{∅}. So opt(kMCP_{c}) ≤ opt(kMCP_{∅}) completing the proof. □
Combining this theorem and Theorem 7 we have
Corollary 6. If k ≥ 2, then kMCP_{c }with DCJ distance is NPcomplete.
Discussion and conclusion
In this paper we introduced the kMinimum Completion Problem (kMCP) motivated by the type of data produced in current cancer genome sequencing studies. We showed the following results. (1) A linear time algorithm for the unrestricted 1MCP; (2) a linear time algorithm for the restricted versions 1MCP where the genomes are circular or linear; i.e. the chromosomal condition set is {circular, unichromosomal} or is {linear, unichromosomal}; (3) the unrestricted kMCP is NPcomplete for any distance when k ≥ 3; and (4) the 2MCP with DCJ distance is NPcomplete in the unrestricted version and with the condition that all chromosomes are circular, i.e. . These results lay the foundation for future algorithmic studies of the kMCP and the application of these algorithms to real cancer sequencing data.
There are numerous further directions to pursue. As noted in the introduction, the model described in this paper does not consider all the complexities of cancer genome sequencing: most importantly copy number aberrations (duplications and deletions) and errors in the measured adjacencies are important features of cancer genome sequencing and should be addressed.
To handle errors, one might consider weighted versions of the kMCP where adjacencies have a weight corresponding to the confidence in the measurement. Regarding the current model, further work is needed on different chromosomal conditions, genomic distances, or other constraints on the relationships between the genomes in the mixture. For example, the case of linear chromosomes demands further attention, as human chromosomes are linear, although circular chromosomes have been observed in cancer [17]. Similarly, one may impose an upper bound on the number of chromosomes. One may also place restrictions on the structure of the mixture tree.
Another direction is to derive approximation algorithms. In the kMCP we aim to minimize distance over all possible kcompletion and mixture trees simultaneously. However, by separating the completion and distance optimization steps, one may employ techniques that have developed for other problems. For example, one may try to first complete the partial multigenomes using some clustering techniques, as have been employed in metagenomic studies [18]. With complete genomes, one could then try to find optimal mixture trees rooted at the reference genome. Depending on the allowed structure of the mixture tree, techniques from genome rearrangement phylogeny problems may be employed. For example, in the case of 2MCP, if the complete genomes are the leaves of the mixture tree, then the problem becomes the median problem (with the reference genome genome as the third genome) [5,13]. Alternatively, if the genomes are the vertices of the mixture tree, then the tree construction problem becomes the problem of finding a minimum spanning tree, which is in generally easier. In between these extremes, where some of the genomes in the mixture are the leaves and some are intermediate nodes (ancestors), the problem becomes a Steiner tree problem. In the cancer application, any of these cases might provide useful approximations, as the process of clonal evolution of cancer [1] might mean that cells at intermediate stages of cancer progression remain present in the tumor.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
All authors contributed equally to this work.
Acknowledgements
We thank the anonymous referees for helpful comments on an earlier version of this manuscript. This work was supported by a CAREER Award from the National Science Foundation (#1053753). In addition, BJR is supported by a Career Award from the Scientific Interface from the Burroughs Wellcome Fund and an Alfred P. Sloan Research Fellowship.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 19, 2012: Proceedings of the Tenth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S19
References

Nowell PC: The clonal evolution of tumor cell populations.
Science 1976, 194(4260):2328. PubMed Abstract  Publisher Full Text

Raphael BJ, Volik S, Collins C, Pevzner PA: Reconstructing tumor genome architectures.
Bioinformatics 2003, 19(Suppl 2):i162171. PubMed Abstract  Publisher Full Text

Meyerson M, Gabriel S, Getz G: Advances in understanding cancer genomes through secondgeneration sequencing.
Nat Rev Genet 2010, 11(10):685696. PubMed Abstract  Publisher Full Text

Yancopoulos S, Attie O, Friedberg R: Efficient sorting of genomic permutations by translocation, inversion and block interchange.
Bioinformatics 2005, 21(16):33403346. PubMed Abstract  Publisher Full Text

Tannier E, Zheng C, Sankoff D: Multichromosomal median and halving problems under different genomic distances.
BMC Bioinformatics 2009., 10 PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hannenhalli S, Pevzner PA: Transforming Men into Mice (Polynomial Algorithm for Genomic Distance Problem). In FOCS. IEEE Computer Society; 1995:581592.

Hannenhalli S, Pevzner PA: Transforming Cabbage into Turnip: Polynomial Algorithm for Sorting Signed Permutations by Reversals.
J ACM 1999, 46:127. Publisher Full Text

Bergeron A, Mixtacki J, Stoye J: A new linear time algorithm to compute the genomic distance via the double cut and join distance.
Theor Comput Sci 2009, 410(51):53005316. Publisher Full Text

ElMabrouk N, Sankoff D: The Reconstruction of Doubled Genomes.
SIAM J Comput 2003, 32(3):754792. Publisher Full Text

Warren R, Sankoff D: Genome Aliquoting Revisited. In RECOMBCG, Volume 6398 of Lecture Notes in Computer Science. Edited by Tannier E. Springer; 2010:112. PubMed Abstract  Publisher Full Text

Zheng C, Lenert A, Sankoff D: Reversal distance for partially ordered genomes.
ISMB (Supplement of Bioinformatics) 2005, 502508. PubMed Abstract  Publisher Full Text

Gaul É, Blanchette M: Ordering Partially Assembled Genomes Using Gene Arrangements. In Comparative Genomics, Volume 4205 of Lecture Notes in Computer Science. Edited by Bourque G, ElMabrouk N. Springer; 2006:113128.

Xu AW: A Fast and Exact Algorithm for the Median of Three ProblemA Graph Decomposition Approach. In RECOMBCG, Volume 5267 of Lecture Notes in Computer Science. Edited by Nelson CE, Vialette S. Springer; 2008:184197.

Vizing VG: On an estimate of the chromatic class of a pgraph. (Russian).

Holyer I: The NPCompleteness of EdgeColoring.
SIAM J Comput 1981, 10(4):718720. Publisher Full Text

Cook SA: The Complexity of TheoremProving Procedures. In STOC. Edited by Harrison MA, Banerji RB, Ullman JD. ACM; 1971:151158.

Raphael BJ, Pevzner PA: Reconstructing tumor amplisomes.
ISMB/ECCB (Supplement of Bioinformatics) 2004, 265273. PubMed Abstract  Publisher Full Text

Wu YW, Ye Y: A Novel AbundanceBased Algorithm for Binning Metagenomic Sequences Using ℓTuples. In RECOMB, Volume 6044 of Lecture Notes in Computer Science. Edited by Berger B. Springer; 2010:535549.