Email updates

Keep up to date with the latest news and content from BMC Systems Biology and BioMed Central.

This article is part of the supplement: Proceedings of the 23rd International Conference on Genome Informatics (GIW 2012)

Open Access Proceedings

A fast and accurate algorithm for single individual haplotyping

Minzhu Xie1*, Jianxin Wang2 and Tao Jiang3

Author Affiliations

1 College of Physics and Information Science, Hunan Normal University, Changsha 410081, P. R. China

2 School of Information Science and Engineering, Central South University, Changsha 410083, P. R. China

3 Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA

For all author emails, please log on.

BMC Systems Biology 2012, 6(Suppl 2):S8  doi:10.1186/1752-0509-6-S2-S8

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1752-0509/6/S2/S8


Published:12 December 2012

© 2012 Xie et al.; licensee BioMed Central Ltd.

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Due to the difficulty in separating two (paternal and maternal) copies of a chromosome, most published human genome sequences only provide genotype information, i.e., the mixed information of the underlying two haplotypes. However, phased haplotype information is needed to completely understand complex genetic polymorphisms and to increase the power of genome-wide association studies for complex diseases. With the rapid development of DNA sequencing technologies, reconstructing a pair of haplotypes from an individual's aligned DNA fragments by computer algorithms (i.e., Single Individual Haplotyping) has become a practical haplotyping approach.

Results

In the paper, we combine two measures "errors corrected" and "fragments cut" and propose a new optimization model, called Balanced Optimal Partition (BOP), for single individual haplotyping. The model generalizes two existing models, Minimum Error Correction (MEC) and Maximum Fragments Cut (MFC), and could be made either model by using some extreme parameter values. To solve the model, we design a heuristic dynamic programming algorithm H-BOP. By limiting the number of intermediate solutions at each iteration to an appropriately chosen small integer k, H-BOP is able to solve the model efficiently.

Conclusions

Extensive experimental results on simulated and real data show that when k = 8, H-BOP is generally faster and more accurate than a recent state-of-art algorithm ReFHap in haplotype reconstruction. The running time of H-BOP is linearly dependent on some of the key parameters controlling the input size and H-BOP scales well to large input data. The code of H-BOP is available to the public for free upon request to the corresponding author.

Background

Each human somatic cell contains 23 pairs of chromosomes, and there are about 0.5% differences between the DNA sequences of two copies of each chromosome [1]. The dominant DNA differences are single nucleotide polymorphisms (SNPs). Identification of the combination of alleles at the SNP loci on the same chromosome copy, i.e., haplotyping, is needed to fully understand the human genetic variation patterns and enhance the power of genome-wide association studies for complex diseases [2,3]. Currently, it is expensive and labor-intensive to separate two copies of chromosomes by biological techniques [4], and most published human individuals' genomes contain only the mixed information, i.e., genotype information, of the underlying two copies of chromosomes [5]. Therefore, to reduce the cost, accurate and fast computational haplotyping methods are of urgent importance.

There have been many computational haplotyping models [6-8] and they can be grouped into two main classes: haplotype inference and haplotype assembly [6]. Haplotype inference is to phase the haplotypes of individuals in a pedigree or a population from their genotypes. Computer algorithms of haplotype inference have been used in the International HapMap Project [9] and the 1000 Genomes Project [5] to identify haplotypes. Haplotype assembly is also called Single Individual Haplotyping (SIH). SIH assembles a pair of haplotypes from an individual's aligned DNA sequence fragments. With the dramatically dropped cost of human whole genome sequencing, more and more human individual's DNAs have been sequenced. With mate-pairs sequencing and read length improvements of the next-generation sequencing technologies, and with the development of new sequencing technologies, SIH methods have been used to build haplotype-resolved genome of human beings [10,11]. When there are enough DNA sequence fragments that cover two or more consecutive variant loci, SIH builds longer and more accurate haplotype blocks than haplotype inference does [12].

Since Lancia et al., first formalized the SIH problem [13], many optimization models and algorithms have been introduced to solve the problem [7,14-25]. The main models are MEC (minimum error correction) [24], MFR (minimum fragment removal), and MSR (minimum SNP removal) [13]. Recently, Duitama et al., proposed a new model MFC (maximum fragments cut) [16]. Most of the models are NP-hard and APX-hard [16,21,26], and their exact algorithms run in time exponentially dependend on at least one input parameter [15,17,19-21]. Therefore, a large number of heuristic algorithms have been designed to deal with the problem [16,22,23,25]. According to [16], one of the most accurate heuristic algorithms is HapCUT [25], while ReFHap [16] runs much faster than HapCUT without loss of accuracy. In this paper, we consider both quality measures "errors corrected" and "fragments cut", and propose a new optimization model, called Balanced Optimal Partition (BOP), for the SIH problem. The model generalizes the most popular model MEC and the recent model MFC. In fact, it could be made either model by setting some parameters to extreme values. To solve the model, we propose a dynamic programming algorithm H-BOP. By limiting the number of intermediate solutions at each iteration to an appropriately chosen small integer k, H-BOP is able to solve the model efficiently. The time complexity of H-BOP linearly depends on some of the key parameters controlling the input size and the algorithm scales well to large input data.

Results and discussion

We use a public available Java package SIH [27] to test the performance of H-BOP. The package contains a simulated data generator and implements algorithm ReFHap [12,16]. The simulated data are generated according to five parameters: number of SNPs (haplotype length) n, number of fragments m, average fragment length l, sequencing error rate e, and gap rate g. In our experiments, since we only consider heterozygous SNPs, for each data set, a haplotype h1 containing n SNPs is generated randomly first and then the other haplotype h2 is obtained by flipping each allele of h1. For each haplotype, m/2 fragments are randomly sampled from the haplotype and their lengths follow a normal distribution with mean l and variance 1. Finally for each fragment, every allele is flipped with probability e to introduce sequencing errors and, except at the first and last positions, every allele is deleted with probability g to introduce gaps. Given fragments generated as above, the average call coverage c is calculated by dividing the total number of alleles of the fragments by the haplotype length n. Please see [16] for more details.

Among many algorithms for the SIH problem, HapCUT and ReFHap are two of the most accurate heuristic algorithms [12,16]. Since ReFHap is much faster than HapCUT, we only compare our algorithm with ReFHap. We implemented our algorithm H-BOP in Java and embedded it in the java package SIH and tested the accuracies, phased haplotype lengths and running time of H-BOP and ReFHap on simulated data and a real data set provided by [12]. All tests are carried out on a Windows 7 64 bit PC (3GHz CPU, 4GB RAM). To measure the haplotype reconstruction accuracy of an SIH algorithm, the hamming distance between the reconstructed haplotype pair and the real haplotype pair was previously used widely in the literature [14,23]. However, a recent study [28] showed it over-penalizes simple switch errors. Therefore, as in [12,16], we use switch errors to measure the accuracy of an algorithm. A switch error is an inconsistency between the reconstructed haplotype pair and the real haplotype pair over two contiguous SNPs. There may be some SNP sites where an algorithm is unable to determine the alleles of a haplotype. The phased haplotype length is defined as the number of SNP sites where the alleles of the reconstructed haplotype pair are determined. And the number of switch errors divided by the phased haplotype length is called switch error rate.

If the allele of a fragment f at a SNP site s is known, we say f covers s. When there are no fragments covering two consecutive loci, it is not possible to determine the haplotype containing these consecutive loci for all SIH models. Therefore, for each test we divide a haplotype into blocks according to the input fragments as in [16]. A block corresponds to a connected component of a graph G = (V, E) where V is the set of the SNP sites and there is a edge between two SNP sites s1 and s2 if and only if there is a fragment covers both s1 and s2. The switch errors of an algorithm are the sum of the switch errors in all blocks. In the following simulation tests, the haplotype length n = 100, the gap rate g = 0.1 and each result is the average over 100 repeated experiments if there is no explicit specification.

Parameters of the algorithm H-BOP

There are two parameters w and k in H-BOP. The parameter w is a weighting factor. H-BOP tries to seek a solution with the minimum number of errors corrected when w = 0, and a solution with a maximum cut of the weighted conflict graph corresponding to the input fragments [16] when w is set sufficiently large. k is the maximum number of intermediate solutions that we will keep at each iteration of H-BOP. When k is large enough, H-BOP in fact becomes an exact algorithm. To choose appropriate values for w and k, we test H-BOP on different combinations of w and k.

In Figure 1, the haplotype length n = 100 and the average fragment length l = 3. In Figure 1(a), the number of fragments m = 140 (c = 4.25) and k = 16. In Figures 1(b) and 1(c), m = 210 (c = 6.42) and w = 0.1. Figure 1(a) shows that when the sequencing error rate e increases from 0.01 to 0.05, the switch errors of H-BOP increases accordingly. The switch errors of H-BOP is larger when w = 0 than those when w > 0 with e unchanged, which is obvious when e = 0.05. It indicates that the optimal objective minimum errors corrected leads large switch errors when sequencing error rate is high. When w = 0.01 and 0.1, the switch errors of H-BOP are smallest. Figure 1(b) shows that when k increases from 1 to 8, the switch errors decrease accordingly. When k increases from 8 to 64, there are no significant improvements in switch errors of H-BOP. When sequencing error rates are high, exact optimal solutions may incur large switch errors and Figure 1(b) indicates that when k increases from 8 to 64, switch errors of H-BOP increases accordingly. Figure 1(c) shows that the running time of H-BOP increases linearly with k when the haplotype length n, number of fragments m and average fragment length l remain fixed. In the following tests, we set w = 0.1 and k = 8 for H-BOP.

thumbnailFigure 1. Performances of H-BOP with different parameter values. Here, the haplotype length n = 100, the gap rate g = 0.1 and the average fragment length l = 3. (a) The number of fragments m = 140 and k is 16. (b) w = 0.1 and m = 211. (c) w = 0.1, m = 211 and e = 0.01.

Simulation results

We changed the sequencing error rate e, the average fragment length l and the number of fragments m to generate different fragment data sets, and compared the performance of H-BOP and ReFHap. Figure 2 shows that H-BOP and ReFHap are both accurate and there are only several switch errors in reconstructing haplotypes of 100 SNPs. The accuracies of both algorithms decrease with the increase of e and improves with the increase of m. These results are consistent with those in [16], which claim that the accuracy of an SIH algorithm increases with decreasing sequencing error rate and increasing call coverage. In a half of the total 48 cases shown in Figure 2, H-BOP produces fewer switch errors than ReFHap, especially when l = 3 and e ≥ 0.02. In the other half, H-BOP presents a few more switch errors than ReFHap only in two cases (i.e., Figure 2(a), m = 140, e = 0.005 and Figure 2(b), m = 111, e = 0.02). In the remaining 22 cases, H-BOP has the same switch errors as ReFHap.

thumbnailFigure 2. Switch errors of ReFHap and H-BOP with varying e, m and l. The parameters k and w of H-BOP are set as 8 and 0.1, respectively, and the haplotype length n is again set as 100.

The phased haplotype lengths of both algorithms generally increase with decreasing e and increasing m. Table 1 shows that when l = 3 and m < 351, or when e = 0.05 (except when l = 10 and m > 44), H-BOP is able to phase more SNPs than ReFHap. In other cases the phased haplotype lengths of H-BOP and ReFHap are equal.

Table 1. Average phased haplotype lengths of ReFHap and H-BOP

We set e = 0.01 and varied the haplotype length n, the number of fragments m and the average fragment length l to compare the running time of H-BOP and ReFHap (Figure 3). When n = 100, l = 3 and m increases from 100 to 400, the running time of H-BOP increases linearly, but the running time of ReFHap increases sharply (Figure 3(a)). When m reaches 400, the running time of H-BOP is only about 4 seconds, while that of ReFHap is about 468 seconds. When n increases from 50 to 200 while m = 200 and l = 3, the average call coverage decreases and the running time of both algorithms decreases accordingly (Figure 3(b)). When m = 200, n = 100 and l increases from 3 to 9, the running time of ReFHap increases significantly while that of H-BOP increases slowly and remains less than 5 seconds (Figure 3(c)). When the number of fragments and the average fragment length increase, the average call coverage c increases accordingly. When c is large, H-BOP runs much faster than ReFHap.

thumbnailFigure 3. Running time of ReFHap and H-BOP. (a) The number of fragments m varies with n = 100 and l = 3. (b) The haplotype length n varies with m = 200 and l = 3. (c) The average fragment length l varies with m = 200 and n = 100.

Results on real data

We downloaded a real data set from the SIH website [27], which contains the aligned sorted fosmid-based NGS DNA sequence fragments and gold-standard haplotypes of a HapMap trio child, NA12878 [12]. The total heterozygous SNP sites of the data are 1,704,166, the total fragments are 285,341, the average fragment length is 18.03, and the average call coverage is 3.02. Since the coverage is low, there are many consecutive heterozygous SNP site pairs not covered by any fragments, and hence the 23 pairs of chromosomes are divided into 17,839 haplotype blocks. Due to the low coverage, both H-BOP and ReFHap ran very fast and completed the reconstruction of haplotypes for all 23 pairs of chromosomes in about half a minute. The total phased haplotype lengths of H-BOP and ReFHap are 1,563,741 and 1,562,402 respectively. Compared with the gold-standard haplotypes, the total switch errors of the haplotypes built by H-BOP and ReFHap are 21,859 and 21,835 respectively. Though the switch errors of H-BOP is larger than that of ReFHap, the switch error rates of H-BOP and ReFHap are both 0.014.

To account for both completeness and quality, Duitama et al. [12] proposed an alternative measure QAN50 (quality adjusted N50). QAN50 is calculated as follows [12]:

(1) Break every haplotype block into the longest possible segments containing no switch errors.

(2) Calculate span (in reference base pairs) from the first phased SNP to the last phased SNP for every segment.

(3) Adjust each span by multiplying the span with phased SNPs ratio (the number of phased SNPs divided by the number of total SNPs) inside the segment (to correct for un-phased SNPs).

(4) Sort segments from the largest to the smallest adjusted span.

(5) Traverse the list and count the number of phased SNPs. When the count is more than a half of the total number of SNPs, the adjust span of the current segment is QAN50.

Clearly, algorithms with larger QAN50 values are more desirable. The QAN50 values of H-BOP and ReFHap on the above real data set are 114,261.09 and 113,831.67, respectively.

Conclusions

Haplotyping is regarded as one of the hardest challenges in personal genome sequencing. Though some single molecule sequencing technologies have been developed, they are still too expensive and labor-consuming. Computer algorithms are widely used to reconstruct haplotypes in personal genome sequencing. SIH uses computer algorithms to build a pair of haplotypes from an individual's aligned DNA sequence fragments. There are many different combinatorial optimization models for SIH, among which MEC is the most popular and MFC is the most recent introduction [16]. In this paper, we combine the two quality measures "errors corrected" and "fragments cut" used in MEC and MFC and introduce a new model BOP. We design a heuristic dynamic programming algorithm H-BOP to solve the model. By setting appropriate parameters of the algorithm, H-BOP is accurate and fast. Extensive simulation experiments show that H-BOP is generally faster and more accurate in assembling haplotypes than a recent state-of-art algorithm ReFHap. When the average fragment length is small and sequencing error rate is relatively high, H-BOP is significantly more accurate than ReFHap. The running time of H-BOP increases linearly as the average call coverage c increases, and H-BOP runs much faster than ReFHap when c is large. The test on a real data set also shows that H-BOP is superior to ReFHap considering both completeness and quality of the reconstructed haplotypes.

Methods

Formulation and problem

With the input of aligned DNA sequence fragments derived from a pair of chromosomes, SIH tries to reconstruct a pair of haplotypes of their underlying chromosomes. If there are no sequencing errors and for any two consecutive (but not necessarily adjacent) heterozygous SNP loci, there is at least one fragment covering both, we can easily determine the linkage relationship between two consecutive heterozygous SNPs and thus SIH is easy. However, sequencing errors are unavoidable which makes the problem complicated. In the following, we first introduce some notations and definitions similar to those in [15,16,20,23], and then propose a new optimization model.

Since we only consider the alleles at SNP loci in the SIH problem, the input aligned fragments are encoded as an m × n SNP matrix M [15,16,25], where m is the number of fragments and n the number of SNP loci. An entry at the ith row and the jth column of M is denoted as M[i, j]. M[i, j] takes a value from {0, 1, }, where '0' (or '1') encodes the major allele (or the minor allele, respectively) in the population and '' represents an unknown allele. As in previous work [16,25], we assume that all SNP loci are heterozygous and every fragment covers at least two heterozygous SNP loci. If this is not the case, a simple preprocessing as in [22] and [29] can be used to remove homozygous loci and fragments covering only one heterozygous SNP locus. In the remainder of the section, the ith row of M is equivalent to the ith fragment and the j column of M is equivalent to the jth SNP locus without explicit specification for briefness.

Let a, b ∈ {0, 1, −} and define

<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M1">View MathML</a>

(1)

Given an m × n SNP matrix M, let the underlying haplotype pair be <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M2">View MathML</a>. The allele at the jth SNP locus of H1 (or H2) is denoted as H1[j] (or H2[j], respectively). Notice that since all SNP loci of interest are heterozygous, for any j ∈ {1, ..., n}, H1[j] ≠ H2[j], i.e. when H1[j] = 0, H2[j] = 1 and when H1[j] = 1, H2[j] = 0. Let fi denote the ith fragment (i.e. the ith row of M). c(M[i1, j], M[i2, j]) = 1 means that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M3">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M4">View MathML</a>conflict at SNP locus j (i.e. column j), and that if both fragments come from the same chromosome, either M[i1, j] or M[i2, j] is a sequencing error. Similarly, c(M[i, j], Hp[j]) = 1 (p = 1 or 2) means that fi and Hp conflict at SNP locus j, and that if fi comes from the chromosome with haplotype Hp, M[i, j] must be a sequencing error.

Let <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M5">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M6">View MathML</a> and define

<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M7">View MathML</a>

(2)

If the real haplotype pair is <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M8">View MathML</a>, it is easy to verify that the number of sequencing errors in the input fragments is at least <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M9">View MathML</a>, which we call the errors corrected measure. Based on the principle of parsimony, it is a natural optimization objective that to minimize the number of errors and hence Minimum Error Correction [20,23,24] is the most popular model for SIH.

Minimum Error Correction (MEC): Given an m × n SNP matrix M, find a haplotype pair <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M8">View MathML</a> such that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M9">View MathML</a> is minimized.

Based on the underlying haplotype pair <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M10">View MathML</a>, it is easy to partition the fragments into two groups G1, G2 according to the following rule: For each fragment fi, if sc(H1, fi) < sc(H2, fi), add fi to G1, otherwise add fi to G2. Let <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M11">View MathML</a> denote the partition (G1, G2) obtained by the above rule. A partition <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M12">View MathML</a> of a fragment set R is encoded as a map such that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M13">View MathML</a> (or 1) if the fragment f in R belongs to G1 (or G2, respectively).

Conversely, given a partition <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M14">View MathML</a>, a haplotype pair <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M10">View MathML</a> can be constructed as follows. Let Ng,v[j] be the number of fragments of Gg whose allele at the jth locus is v for g = 1, 2 and v = 0, 1. For each SNP locus j, if N1,1[j] + N2,0[j] ≤ N1,0[j] + N2,1[j], H1[j] = 0 and H2[j] = 1; otherwise, H1[j] = 1 and H2[j] = 0. Let <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M15">View MathML</a> denote the haplotype pair obtained by the above method.

Theorem 1 Given an SNP matrix M and a haplotype pair <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M16">View MathML</a>.

There is another formulation of MEC equivalent to the above one: Given an SNP matrix M, find a partition <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17">View MathML</a> of the rows in M such that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M18">View MathML</a> is minimized. In the following, <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M18">View MathML</a> is called the errors corrected measure of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17">View MathML</a>. While MEC aims to find a partition such that the conflict between the fragments in the same group is minimized, Maximum Fragments Cut (MFC) [16] aims to find a partition such that the conflict between the fragments in G1 and the fragments in G2 is maximized.

Let a, b ∈ {0, 1, } and define

<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M19">View MathML</a>

(3)

Let <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M3">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M4">View MathML</a> be two rows of M and define <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M20">View MathML</a>. As in [16], we convert an m × n SNP matrix M into a weighted complete graph <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M21">View MathML</a>, where V is the set of rows in M and the weight of the edge between two rows <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M3">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M4">View MathML</a> is <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M22">View MathML</a>. Therefore, a partition <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17">View MathML</a> of the rows in M corresponds a cut of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M23">View MathML</a>. Given an SNP matrix M and a partition <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17">View MathML</a>, the fragments cut measure is defined as:

<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M24">View MathML</a>

(4)

Maximum Fragments Cut (MFC) [16]: Given an SNP matrix M, find a partition <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17">View MathML</a> of the rows of M such that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M25">View MathML</a> is maximized.

To take into account both the conflict between the two groups and the conflict between the fragments within the same group, we introduce a new score combining the above errors corrected measure and the fragments cut measure. Given a partition <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17">View MathML</a> of the rows of M, define the partition score as

<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M26">View MathML</a>

(5)

where w is a weight factor that is used to adjust the weight of the fragments cut measure. In the following we propose a new optimization model for the SIH problem.

Balanced Optimal Partition (BOP): Given an SNP matrix M, find a partition <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17">View MathML</a> of the rows in M such that <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M27">View MathML</a> is minimized. A solution to BOP of M is denoted by BOP(M), i.e. a partition with the minimum partition score.

Note that when w = 0, BOP becomes MEC which has been proved NP-hard [24] and APX-hard [26]. Therefore, BOP is NP-hard and APX-hard.

Algorithm

Given an m × n SNP matrix M, there are 2m-1 different partitions of m rows in M. Therefore, when m is large, it is impractical to enumerate all possible partitions and choose one with the minimum partition score. To solve the BOP model of M efficiently, we propose a dynamic programming algorithm in the subsection. We first consider the first row of M, then the first two rows and so on until we have considered all rows of M.

We need some definitions and notations. Let M [1..i, :] denote the SNP matrix consisting of only the first i rows of M. The first and last columns at which the ith row of M takes non '' values are denoted by l(i) and r(i), respectively. For a column j, if l(i) ≤ j ≤ r(i), row i spans column j. In the following, we assume that all the rows of M are sorted such that if i1 < i2, l(i1) < l(i2) or l(i1) = l(i2) ∧ r(i1) ≤ r(i2). Let R(i) denote the row set containing the rows in M[1..i, :] that span column l(i).

Let <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17">View MathML</a> be a partition of a row set R and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M28">View MathML</a> a partition of a subset <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M29">View MathML</a> of R. If for every row i R', <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M30">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M31">View MathML</a> is called the projection of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17">View MathML</a> on R' and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17">View MathML</a> is called an extension of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M32">View MathML</a> on R. Fix a row i and let <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17">View MathML</a> be a partition of R(i). <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M33">View MathML</a> is an optimal extension of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17">View MathML</a>, if the following conditions hold: (1) <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M34">View MathML</a> is an extension of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17">View MathML</a> on the row set R = {1, ..., i}; (2) for any possible extension <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M35">View MathML</a> of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17">View MathML</a> on R, <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M36">View MathML</a>.

Given a partition <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17">View MathML</a> of R(i), let <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M37">View MathML</a> denote an optimal extension of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17">View MathML</a>. We call <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M38">View MathML</a> the partition score of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17">View MathML</a> and denote it as <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M39">View MathML</a> for briefness.

Theorem 2 For an m × n SNP matrix M, let <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17">View MathML</a>be a partition of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M40">View MathML</a> is a solution to BOP of M if the following condition holds: for any possible partition <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M41">View MathML</a>of R(i), <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M42">View MathML</a>.

Consider the submatrix containing only the first row of M. Since R(1) contains only one row, i.e. R(1) = {1}, there are only two possible partitions <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M43">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M44">View MathML</a> of R(1) (<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M43">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M44">View MathML</a> are in fact equivalent): <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M45">View MathML</a> iff <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M46','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M46">View MathML</a>. It is easy to verify that the following equalities hold for i = 1; 2.

<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M47">View MathML</a>

(6)

After <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M48','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M48">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M49','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M49">View MathML</a> have been calculated for every possible partition <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M17">View MathML</a> of R(i), we consider the submatrix containing the first i + 1 rows of M. Let Rc(i, i + 1) = R(i) ∩ R(i + 1), and we calculate <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M50">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M51">View MathML</a> for every possible partition <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M52">View MathML</a> of R(i + 1) according to the following method. Let q be the number of the rows in R(i) but not in Rc(i, i + 1), i.e. q = |R(i)−Rc(i, i + 1)|. For a partition <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M53','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M53">View MathML</a> of Rc(i, i + 1), there are 2q distinct extensions of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M54">View MathML</a> on R(i). Suppose <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M55','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M55">View MathML</a> is the one whose partition score is the minimum among all 2q extensions. Then <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M56">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M57','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M57">View MathML</a> can be computed with the following equations:

<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M58','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M58">View MathML</a>

(7)

Since the rows in M are sorted, it is easy to verify that R(i + 1) = Rc(i, i + 1) ∪ {i + 1}, and that the number of all possible distinct partitions of R(i + 1) is two times that of Rc(i, i + 1). For each partition <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M59','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M59">View MathML</a> of Rc(i, i + 1), there are two distinct corresponding partitions <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M60','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M60">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M61','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M61">View MathML</a> of R(i + 1): for each l Rc(i, i + 1), <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M62','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M62">View MathML</a>, but <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M63','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M63">View MathML</a>. Optimal extensions of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M64','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M64">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M65','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M65">View MathML</a> and their partition scores can be calculated with the following equations:

<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M66','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M66">View MathML</a>

(8)

<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M67','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M67">View MathML</a>

(9)

<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M68','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M68">View MathML</a>

(10)

<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M69','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M69">View MathML</a>

(11)

In Equation (10) (or 11), <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M70','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M70">View MathML</a> is the difference between the errors corrected measures of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M71">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M72','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M72">View MathML</a>; and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M73','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M73">View MathML</a> is the difference between the fragments cut measures of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M74">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M75','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M75">View MathML</a>. The values <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M76','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M76">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M77','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M77">View MathML</a> are calculated by calling the following functions <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M78','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M78">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M79','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M79">View MathML</a>, respectively.

   <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M80','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M80">View MathML</a>

    { δ = 0;

      for j = l(i + 1),..., r(i + 1) do

      { if M[i + 1, j] = = '-' then continue;

         N1,0 = N2,0 = N1,1 = N2,1 = 0;

         for each row l Rc(i, i + 1) do

         { if M[l, j] = = '-' then continue;

            v = M[l, j], g = <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M97','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M97">View MathML</a>,Ng,v + +; }

         if N1,1 + N2,0 N1,0 + N2,1 then

         { δ = δ - (N1,1 + N2,0); }

         else { δ = δ - (N1,0 + N2,1); }

         v = M[i + 1, j], g = <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M98','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M98">View MathML</a>, Ng,v + +;

         if N1,1 + N2,0 N1,0 + N2,1 then

         { δ = δ + (N1,1 + N2,0); }

         else { δ = δ + (N1,0 + N2,1); }

      }

      return δ; }

   <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M81','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M81">View MathML</a>

   { δ = 0, g0 = <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M98','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M98">View MathML</a>;

      for j = l(i + 1), ..., r(i + 1) do

      { if M[i + 1, j] = = '-' then continue;

         for each row l Rc(i, i + 1) do

         { if M[l, j] = = '-' then continue;

            v = M[l, j], g = <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M97','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M97">View MathML</a>;

            if g == g0 then continue;

            if v == M[i + 1, j] then δ - -;

      else δ + +; }

   }

   return δ; }

When <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M82">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M83">View MathML</a> are known for every possible partition <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M84','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M84">View MathML</a> of R(m), a solution to BOP of M is easily obtained by using the following formula:

<a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M85','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M85">View MathML</a>

(12)

Based on the above equations, we can construct an exact dynamic programming algorithm for BOP. However, the complexity of this exact algorithm increases exponentially with the number of rows in R(i), which implies that the algorithm is impractical when the call coverage is large. Let <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M86','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M86">View MathML</a> be the projection on R(i) of the global optimal partition of M. If the partition score of <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M87','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M87">View MathML</a> is among the k smallest ones of all possible partitions of R(i), we only need to compute k partitions of R(i) whose partition scores are the smallest in each iteration without losing the global optimal partition at the end. Based on the above idea, we propose a heuristic algorithm H-BOP whose pseudo-code is shown in Figure 4. In the algorithm, a partition <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M88','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M88">View MathML</a> of a row set is encoded by a binary number P, and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M89','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M89">View MathML</a> is represented by the ith bit of P. Therefore, the number set {0, ..., 2q 1} encodes all possible partitions of a row set containing q rows (in fact, there are only 2q-1 different partitions, and in our implement of H-BOP, we use a binary number of q − 1 bits to encode a partition to save time and space). In each iteration of H-BOP, for each row i we maintain a binary max heap H to store the candidate partitions of R(i) whose partition scores are among the k smallest. The heap H can store at most k elements, and each element L of H contains a partition <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M90','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M90">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M91','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M91">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M92','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M92">View MathML</a>, which are denoted as L.P, L. <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M93','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M93">View MathML</a>, and L.s respectively. The value <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M94','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M94">View MathML</a> of each element in the heap is larger than or equal to those of its two children. When the number of elements of H is smaller than its expected size (i.e. k), and a new element L arrives, the element is inserted into H and H is adjusted to maintain the max heap property. When the number of elements of H reaches k, L is compared with the root r of H. If L.s < r.s, the root is replaced by L and H is adjusted accordingly; otherwise, the new element is discarded. The above operation is denoted by H.insert(L).

thumbnailFigure 4. H-BOP Algorithm.

The time complexity of H-BOP is O(mkk1k2), and the space complexity is O(mk1 + mk), where <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M95','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M95">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M96','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1752-0509/6/S2/S8/mathml/M96">View MathML</a>.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

MX designed the algorithm, performed the computational experiments, and drafted the manuscript. TJ and JW helped to draft and polish the manuscript. All authors read and approved the manuscript.

Acknowledgements

This research was supported by the National Natural Science Foundation of China under Grant Nos. 61070145 and 61232001.

This article has been published as part of BMC Systems Biology Volume 6 Supplement 2, 2012: Proceedings of the 23rd International Conference on Genome Informatics (GIW 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S2.

References

  1. Levy S, et al.: The diploid genome sequence of an individual human.

    PLoS Biology 2007, 5(10):e254-e254. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  2. Tewhey R, Bansal V, Torkamani A, Topol EJ, Schork NJ: The importance of phase information for human genomics.

    Nat Rev Genet 2011, 12(3):215-23. PubMed Abstract | Publisher Full Text OpenURL

  3. Casey JP, et al.: A novel approach of homozygous haplotype sharing identifies candidate genes in autism spectrum disorder.

    Hum Genet 2012, 131(4):565-79. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  4. Fan HC, Wang J, Potanina A, Quake SR: Whole-genome molecular haplotyping of single cells.

    Nat Biotechnol 2011, 29:51-7. PubMed Abstract | Publisher Full Text OpenURL

  5. The 1000 Genomes Project Consortium: A map of human genome variation from population-scale sequencing.

    Nature 2010, 467(7319):1061-73. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  6. Zhang XS, Wang RS, Wu LY, Chen L: Models and algorithms for haplotyping problem.

    Current Bioinformatics 2006, 1:105-114. Publisher Full Text OpenURL

  7. Xie M, Wang J, Chen J, Wu J, Liu X: Computational models and algorithms for the single individual haplotyping problem.

    Current Bioinformatics 2010, 5:18-28. Publisher Full Text OpenURL

  8. Browning SR, Browning BL: Haplotype phasing: existing methods and new developments.

    Nat Rev Genet 2011, 12(10):703-14. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  9. The International HapMap Consortium: A haplotype map of the human genome.

    Nature 2005, 437(7063):1299-1320. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  10. Kitzman JO, et al.: Haplotype-resolved genome sequencing of a Gujarati Indian individual.

    Nat Biotechnol 2011, 29:59-63. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  11. Suk EK, et al.: A comprehensively molecular haplotype-resolved genome of a European individual.

    Genome Res 2011, 21(10):1672-85. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  12. Duitama J, et al.: Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques.

    Nucleic Acids Res 2012, 40(5):2041-53. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  13. Lancia G, Bafna V, Istrail S, Lippert R, Schwartz R: SNPs problems, complexity and algorithms. In Proc. Ann. European Symp. on Algorithms (ESA), Volume 2161 of Lecture Notes in Computer Science. Edited by auf der Heide FM. Berlin/Heidelberg: Springer; 2001:182-193. OpenURL

  14. Geraci F: A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem.

    Bioinformatics 2010, 26(18):2217-25. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  15. Xie M, Wang J, Chen J: A model of higher accuracy for the individual haplotyping problem based on weighted SNP fragments and genotype with errors.

    Bioinformatics 2008, 24(13):i105-13. PubMed Abstract | Publisher Full Text OpenURL

  16. Duitama J, Huebsch T, McEwen G, Suk EK, Hoehe MR: ReFHap: a reliable and fast algorithm for single individual haplotyping. In Proceedings of the First ACM international Conference on Bioinformatics and Computational Biology. Niagara Falls, New York: ACM; 2010:160-169. OpenURL

  17. He D, Choi A, Pipatsrisawat K, Darwiche A, Eskin E: Optimal algorithms for haplotype assembly from whole-genome sequence data.

    Bioinformatics 2010, 26(12):i183-90. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  18. Bansal V, Bafna V: HapCUT: an efficient and accurate algorithm for the haplotype assembly problem.

    Bioinformatics 2008, 24(16):i153-9. PubMed Abstract | Publisher Full Text OpenURL

  19. Wang J, Xie M, Chen J: A practical exact algorithm for the individual haplotyping problem MEC/GI.

    Algorithmica 2010, 56(3):283-296. Publisher Full Text OpenURL

  20. Xie M, Wang J, Chen J: A practical parameterised algorithm for the individual haplotyping problem MLF.

    Mathematical Structures in Computer Science 2010, 20(5):851-863. Publisher Full Text OpenURL

  21. Bafna V, Istrail S, Lancia G, Rizzi R: Polynomial and APX-hard cases of the individual haplotyping problem.

    Theoretical Computer Science 2005, 335:109-125. Publisher Full Text OpenURL

  22. Panconesi A, Sozio M: Fast hare: a fast heuristic for single individual SNP haplotype reconstruction. In Proc. WABI, Volume 3240 of Lecture Notes in Computer Science. Edited by Jonassen I, Kim J. Berlin/Heidelberg: Springer; 2004:266-277. OpenURL

  23. Wang RS, Wu LY, Li ZP, Zhang XS: Haplotype reconstruction from SNP fragments by minimum error correction.

    Bioinformatics 2005, 21(10):2456-2462. PubMed Abstract | Publisher Full Text OpenURL

  24. Lippert R, Schwartz R, Istrail S: Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem.

    Brief. Bioinform 2002, 3:1-9. OpenURL

  25. Genovese LM, Geraci F, Pellegrini M: SpeedHap: an accurate heuristic for the single individual SNP haplotyping problem with many gaps, high reading error rate and low coverage.

    IEEE/ACM Trans Comput Biol Bioinform 2008, 5(4):492-502. PubMed Abstract | Publisher Full Text OpenURL

  26. Cilibrasi R, van Iersel L, Kelk S, Tromp J: The complexity of the single individual SNP haplotyping problem.

    Algorithmica 2007, 49:13-36. Publisher Full Text OpenURL

  27. SIH [http://www.molgen.mpg.de/~genetic-variation/SIH/] webcite

  28. Lo C, Bashir A, Bansal V, Bafna V: Strobe sequence design for haplotype assembly.

    BMC Bioinformatics 2011, 12(Suppl 1):S24. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  29. Xie M, Wang J: An improved (and practical) parameterized algorithm for the individual haplotyping problem MFR with mate-pairs.

    Algorithmica 2008, 52(2):250-266. Publisher Full Text OpenURL