Abstract
Background
Recent studies have shown that the patterns of linkage disequilibrium observed in human populations have a blocklike structure, and a small subset of SNPs (called tag SNPs) is sufficient to distinguish each pair of haplotype patterns in the block. In reality, some tag SNPs may be missing, and we may fail to distinguish two distinct haplotypes due to the ambiguity caused by missing data.
Results
We show there exists a subset of SNPs (referred to as robust tag SNPs) which can still distinguish all distinct haplotypes even when some SNPs are missing. The problem of finding minimum robust tag SNPs is shown to be NPhard. To find robust tag SNPs efficiently, we propose two greedy algorithms and one linear programming relaxation algorithm. The experimental results indicate that (1) the solutions found by these algorithms are quite close to the optimal solution; (2) the genotyping cost saved by using tag SNPs can be as high as 80%; and (3) genotyping additional tag SNPs for tolerating missing data is still costeffective.
Conclusion
Genotyping robust tag SNPs is more practical than just genotyping the minimum tag SNPs if we can not avoid the occurrence of missing data. Our theoretical analysis and experimental results show that the performance of our algorithms is not only efficient but the solution found is also close to the optimal solution.
Background
In recent years, Single Nucleotide Polymorphisms (SNPs) have become the preferred marker for association studies of genetic diseases or traits. A set of linked SNPs on one chromosome is called a haplotype. Recent studies have shown that the patterns of Linkage Disequilibrium (LD) observed in human populations have a blocklike structure [4,13]. The chromosome recombination only takes place at some low LD regions called recombination hotspots. The high LD region between these hotspots is often referred to as a "haplotype block." Within a haplotype block, there is little or even no recombination occurred, and the SNPs in the block tend to be inherited together. Due to the low haplotype diversity within a block, the information carried by these SNPs is highly redundant. Thus, a small subset of SNPs (called "tag SNPs") is sufficient to distinguish each pair of patterns in the block [7,13,1719]. Haplotype blocks with corresponding tag SNPs are quite useful and costeffective for association studies as it does not require genotyping all SNPs. Many studies have tried to find the minimum set of tag SNPs in a haplotype block. In a largescale study of human Chromosome 21, Patil et al. [13] developed a greedy algorithm to partition the haplotypes into 4,135 blocks with 4,563 tag SNPs. Zhang et al. [1719] used a dynamic programming approach to reduce the numbers of blocks and tag SNPs to 2,575 and 3,562, respectively. Bafna et al. [1] showed that the problem of minimizing tag SNPs is NPhard and gave efficient algorithms for special cases of this problem.
Figure 1. The influence of missing data and auxiliary tag SNPs. (A) A haplotype block defined by 12 SNPs and 4 haplotype patterns. Each column represents a haplotype pattern and each row represents a SNP locus. The black and grey boxes stand for the major and minor alleles at each SNP locus, respectively. (B) Tag SNPs genotyped without missing data. (C) Tag SNPs genotyped with missing data. (D) The auxiliary tag SNP S_{5 }for h_{2}. (E) The auxiliary tag SNP S_{8 }for h_{3}.
In reality, a SNP may not be genotyped and considered to be missing data (i.e., we fail to obtain the allele configuration of the SNP) if it does not pass the threshold of data quality [13,16,19,20]. These missing data may cause ambiguity when using the minimum set of tag SNPs to distinguish an unknown haplotype sample. Figure 1 illustrates the influence of missing data when identifying haplotype samples. In this figure, a haplotype block (see Figure 1 (A)) defined by 12 SNPs and 4 haplotype patterns is presented (from the public haplotype data of human Chromosome 21 [13]). We follow the same assumption as previous studies that all SNPs are diallelic (i.e., taking on only two values) [1,13]. Suppose we select SNPs S_{1 }and S_{12 }as tag SNPs. The haplotype sample h_{1 }is identified as haplotype pattern P_{3 }unambiguously (see Figure 1 (B)). Consider haplotype samples h_{2 }and h_{3 }with one missing tag SNP (see Figure 1 (C)). h_{2 }can be identified as haplotype patterns P_{2 }or P_{3}, and h_{3 }can be identified as P_{1 }or P_{3}. As a result, these missing tag SNPs result in ambiguity when distinguishing unknown haplotype samples.
Although we can not avoid the occurrence of missing data, the remaining SNPs within the haplotype block may provide abundant information to resolve the ambiguity. For example, if we regenotype an additional SNP S_{5 }for h_{2 }(see Figure 1 (D)), h_{2 }is identified as haplotype pattern P_{3 }unambiguously. On the other hand, if SNP S_{8 }is regenotyped (see Figure 1 (E)), h_{3 }is also identified unambiguously. These additional SNPs are referred to as "auxiliary tag SNPs," which can be found from the remaining SNPs in the block and are able to resolve the ambiguity caused by missing data.
Alternatively, instead of regenotyping auxiliary tag SNPs whenever encountering missing data, we work on a set of SNPs which is not affected by the occurrence of missing data. Figure 2 illustrates a set of SNPs which can tolerate one missing SNP. Suppose we select SNPs S_{1}, S_{5}, S_{8}, and S_{12 }to be genotyped. Note that no matter which SNP is missing, each of the 16 missing patterns can be distinguished by the remaining three SNPs. Therefore, all haplotype samples with one missing SNP can still be identified unambiguously. We refer to these SNPs as "robust tag SNPs," which are able to tolerate a number of missing data. The important feature of robust tag SNPs is that although they consume more SNPs than the "tag SNPs" defined in previous studies, they guarantee that all haplotype samples with a number of missing data can be distinguished unambiguously. When the occurrence of missing data is frequent, the cost of regenotyping processes can be reduced by robust tag SNPs.
Figure 2. The robust tag SNPs. A set of robust tag SNPs for tolerating one missing tag SNP.
This paper focuses on the problem of finding robust tag SNPs to tolerate a number of missing data. Throughout this paper, we denote m as the maximum number of missing SNPs to be tolerated, which corresponds to different missing rates in different genotyping experiments. And we wish to find a minimum set of robust tag SNPs which can distinguish each pair of haplotypes even when up to m SNPs are missing. We assume that the haplotype phases and block partition are available as the input. Numerous methods have been developed to infer haplotypes from genotype data [12,14,15]. Several algorithms have also been proposed to find the block partition [4,13,17]. The problem of finding minimum robust tag SNPs is shown to be NPhard (See Theorem 1). To find robust tag SNPs efficiently, we propose two greedy algorithms and one linear programming (LP) relaxation algorithm. The proposed algorithms have been implemented and tested on a variety of simulated and empirical data. We also analyze the efficiency and solutions of these algorithms. An algorithm for finding auxiliary tag SNPs is described assuming robust tag SNPs have been computed in advance.
Results
We propose two greedy algorithms which select the robust tag SNPs one by one in different greedy manners. In addition, we reformulate this problem as an integer programming problem and design an LPrelaxation algorithm to solve this problem. The greedy and LPrelaxation algorithms are able to find solutions within factors of (m + 1) , , and O(m ln K) of the optimal solution respectively, where m is the maximum number of missing SNPs allowed and K is the number of haplotype patterns in the block.
We have implemented the first and second greedy algorithms in JAVA [see Additional files 1 and 2]. The LPrelaxation algorithm has been implemented in Perl [see 3], where the LP problem is solved via a program called "lp_solve" [11]. The LPrelaxation algorithm is a randomized method. Thus, this program is repeated for 10 times to explore different solutions and the best solution among them is chosen as the output.
Additional File 1. The program for the first greedy algorithm. The Greedyl.zip file is compressed using WinZip and contains the JAVA source code for the first greedy algorithm.
Format: ZIP Size: 3KB Download file
Additional File 2. The program for the second greedy algorithm. The Greedy2.zip file is compressed using WinZip and contains the JAVA source code for the second greedy algorithm.
Format: ZIP Size: 3KB Download file
Additional File 3. The program for the iterative LPrelaxation algorithm. The ILP.zip file is compressed using WinZip and contains the Perl script for the iterative LPrelaxation algorithm.
Format: ZIP Size: 89KB Download file
In order to evaluate the solutions and efficiency of our algorithms, we also implement a program in JAVA (referred to as "OPT") which uses a brute force method to find the optimal solution. For a given data set of N SNPs, the OPT program examines all possible solutions (i.e., all subsets of , , ..., and ). The minimum subset of SNPs that can tolerate m missing SNPs is chosen as the output. Due to the NPhardness of this problem, the OPT program fails to output the optimal solution within a reasonable period of time in many data sets. As a consequence, we skip some impossible solution space to speed up this program by the following two observations: (1) the solutions with less than or equal to m SNPs are the impossible ones since m SNPs might be missing; and (2) for a data set containing K haplotype patterns, the minimum number of SNPs required to distinguish each of them is at least log K (see Lemma 2). As a result, we can examine the possible solutions only for subsets of , ..., and . By searching possible solutions from small subsets to large ones, the OPT program can stop and output the optimal solution immediately when a subset that can tolerate m missing SNPs is found.
Results on simulated data
Theoretically, all SNPs will reach complete linkage equilibrium after sufficient chromosome recombination takes place. We first generate 100 data sets containing short haplotypes which simulate this bottleneck model [12,14,15]. Each data set consists of 10 haplotypes with 20 SNPs. These haplotypes are created by randomly assigning the major or minor alleles at each SNP locus. Let m be the number of missing SNPs allowed and S_{a }be the average number of robust tag SNPs over 100 data sets. Figure 3 (a) plots S_{a }with respect to m (roughly corresponding to SNP missing rates from 0% to 33%). When m = 0, all programs find the same number of SNPs as the optimal solution. The iterative LPrelaxation algorithm slightly outperforms the others as m increases. When m > 6, more than 20 SNPs are required to tolerate missing data. Thus, no data sets contain enough SNPs for solutions.
Figure 3. Experimental results on random data. (a) Results from data sets containing 10 haplotypes and 20 SNPs. (b) Results from data sets containing 10 haplotypes and 40 SNPs.
We then generate 100 data sets containing long haplotypes. Each data set is composed of 10 haplotypes with 40 SNPs. Figure 3 (b) illustrates the experimental results on these long data sets (corresponding to SNP missing rates from 0% to 37%). The optimal solutions for m > 2 can not be found by the OPT program within a reasonable period of time (after one week computation) and are not shown in this figure. It is because the possible solutions in long data sets are too large to enumerate. On the other hand, both greedy and iterative LPrelaxation algorithms run in polynomial time and always output a solution efficiently. In this experiment, both greedy algorithms slightly outperforms the iterative LPrelaxation algorithm. In addition, the number of missing SNPs allowed is larger than those in short data sets. For example, to tolerate 10 missing SNPs (i.e., m = 10), all programs output less than 28 SNPs. The remaining SNPs in each data set are still sufficient to tolerate more missing SNPs.
Hudson (2002) [10] provides a program which can simulate a set of haplotypes under the assumption of neutral evolution and uniformly distributed recombination rate using the coalescent model. We use Hudson's program to generate 100 short data sets with 10 haplotypes and 20 SNPs and 100 long data sets with 10 haplotypes and 40 SNPs. Figure 4 (a) shows the experimental results on Hudson's short data sets (corresponding to SNP missing rates from 0% to 23%). The number of missing SNPs allowed are less than that of random data. It is because Hudson's program generates coalescent haplotypes which are similar to each other. As a result, many SNPs can not be used to distinguish haplotypes and the amount of tag SNPs is inadequate to tolerate larger missing SNPs. In this experiment, we observe that the iterative LPrelaxation algorithm finds solutions quite close to the optimal solutions and slightly outperforms the other two algorithms.
Figure 4. Experimental results on Hudson's data. (a) Results from data sets containing 10 haplotypes and 20 SNPs. (b) Results from data sets containing 10 haplotypes and 40 SNPs.
Figure 4 (b) illustrates the experimental results on long data sets generated by Hudson's program (corresponding to SNP missing rates from 0% to 29%). The optimal solutions for m > 2 again can not be found by the OPT program within a reasonable period of time. In this experiment, the performance of the first greedy and iterative LPrelaxation algorithms are similar, and they slightly outperform the second greedy algorithm as m becomes large.
Results on real data
We also test these programs on two real data sets: (1) public haplotype data of human Chromosome 21 released by Patil et al. [13]; and (2) a 500 KB region on human Chromosome 5q31 which may contain a genetic variant related to the Crohn disease by Daly et al. [4]. Patil's data include 20 haplotypes of 24,047 SNPs spanning over about 32.4 MB, which are partitioned into 4,135 haplotype blocks. By genotyping 103 SNPs with minor allele frequency at least 5%, Daly et al. partition the 500 KB region into 11 haplotype blocks. Each haplotype block in these real data sets contains different numbers of SNPs and haplotypes (e.g., from several SNPs to hundreds of SNPs). When m increases, some short blocks may not contain enough SNPs for tolerating missing data (e.g., m > the number of SNPs in a block). As a consequence, S_{a }here stands for the average number of robust tag SNPs over those blocks still containing solutions.
Figure 5 (a) shows the experimental results on Patil's 4,135 blocks. Because there are many long blocks in Patil's data (e.g., more than one hundred SNPs), the optimal solution for m > 2 can not be found within a reasonable period of time. The experimental result indicates that all algorithms find similar number of robust tag SNPs when m is small. The LPrelaxation algorithm slightly outperforms the others as m increases.
Figure 5. Experimental results on real data. (a) Results from Patil's Chromosome 21 data, (b) Results from Daly's Chromosome 5q31 data.
Figure 5 (b) illustrates the experimental results on Daly's 11 blocks. Because the haplotype blocks partitioned by Daly et al. are very short (e.g., most blocks contain less than 12 SNPs), all optimal solutions still can be found. The solutions found by each algorithm is almost the same as optimal solutions. Theoretically, S_{a }should grow monotonically as m increases. But due to the small number of blocks in Daly's data set, S_{a }does not grow smoothly when m increases from 2 to 3. To explain this phenomenon, we report the detailed result of the first greedy algorithm in Table 1. For each of the 11 blocks, the number of robust tag SNPs found with respect to different values of m is listed in the table. Note that as mentioned before, some blocks may not contain enough SNPs for tolerating large missing data as m increases. When m increases from 2 to 3, Blocks 4 and 10 (which consumes 8 and 5 SNPs) do not contain enough SNPs for a solution and are discarded. As a result, S_{a }(for m = 3) is computed only using Blocks 1 and 2 and the value is lower than the previous one (i.e., from 4.75 to 4). This phenomenon is not shown in Figure 5 (a) because it is amortized by thousands of blocks in Patil's data set.
Table 1. The detailed result of first greedy algorithm on Daly's 11 blocks.
Discussion
In terms of efficiency, the first and second greedy algorithms are faster than the LPrelaxation algorithm. The greedy algorithms usually returns a solution in seconds and the LPrelaxation algorithm requires about half minute for a solution. It is because the running time of LPrelaxation algorithm is bounded by the time of solving the LP problem. Furthermore, this LPrelaxation algorithm is repeated for 10 times to explore 10 different solutions. The OPT program for searching the optimal solution is apparently slower than the others. The optimal solution usually can not be found within a reasonable period of time if the size of the block becomes large. ¿From our empirical study, the optimal solution can be found in reasonable time by the OPT program if the block contains less than 20 SNPs (e.g., the short random data sets). But for those large data sets with more than 40 SNPs, the OPT program is significantly outperformed by the approximation algorithms (e.g., fail to output a solution within one week computation).
Assuming no missing data (i.e., m = 0), we compare the solutions found by each algorithm with the optimal solution. Table 2 lists the numbers of total tag SNPs found by each algorithm in previous experiments. In the experiments on random and Daly's data, the solution found by each algorithm is as good as the optimal solution. In the experiments on Hudson's and Patil's data, these algorithms still find solutions quite close to the optimal solution. For example, the approximation ratios of these algorithms are only and , respectively.
Table 2. The number of total tag SNPs found by each algorithm. The percentage of tag SNPs with respect to total SNPs is shown in parentheses.
We then analyze the genotyping cost that can be saved by using tag SNPs. In Table 2, the percentage of tag SNPs in each data set is shown in parentheses. The experimental results indicate that the cost of genotyping tag SNPs is significantly reduced in comparison with genotyping all SNPs in a block. For example, in Patil's data, we only need to genotype about 19% of tag SNPs in each block, which saves about 81% genotyping cost.
The tradeoffs between the number of additional tag SNPs required and the number of missing SNPs allowed are discussed in the following. In practice, missing data in the genotyping experiment are usually limited to certain missing rate. We transform the maximum number of missing SNPs allowed into maximum missing rates allowed by calculating the percentage of m with respect to the number of robust tag SNPs. Table 3 lists the results of the first greedy algorithm applied on random and Hudson's long data sets. The number of additional tag SNPs grows with respect to m linearly. However, we observe that the maximum missing rate allowed grows slowly as m becomes large. This is because more additional tag SNPs are required in order to tolerate more missing SNPs. But under the same SNP missing rate, genotyping these additional tag SNPs may also increase the number of missing SNPs, which reduces the power of robust tag SNPs. On the positive side, when m is small, the corresponding maximum missing rate allowed is sufficient for most genotyping experiments since their missing rates are usually less than 10%. For example, the robust tag SNPs with m = 1 are sufficient to tolerate 10% missing SNPs, and they only requires at most 3 additional SNPs. As a result, genotyping additional tag SNPs for tolerating missing data is costeffective under the current genotyping environment.
Table 3. The tradeoffs between additional tag SNPs required and maximum missing rates allowed. These results come from the first greedy algorithm applied on random and Hudson's data sets with 40 SNPs.
In reality, not all haplotypes are of equal importance or confidence. When selecting robust tag SNPs, it might be desirable to weight them according to their population frequency. To incorporate the frequency of haplotypes into this problem, there are two possible ways:
1. It can be easily done by discarding the rare haplotypes and retain the common haplotypes as the input of our algorithms. This approach would not require modification to our algorithms. But the retained common haplotypes will be processed as equally weighted.
2. Our algorithms try to find a set of SNPs such that each pair of haplotypes are distinguished by a threshold of at least (m + 1) SNPs. A simplest way to weight the haplotypes is choosing different thresholds for each pair of haplotypes according to their population frequency. The haplotype pairs with higher frequency can then be assigned with more tag SNPs than the lower ones by our algorithms.
Conclusion
In this paper, we show there exists a set of robust tag SNPs which is able to tolerate a number of missing data. Our study indicates that genotyping robust tag SNPs is more practical than genotyping minimum tag SNPs for association studies if we can not avoid the occurrence of missing data. We describe two greedy and one LPrelaxation approximation algorithms for finding robust tag SNPs. Our experimental results and theoretical analysis show that these algorithms are not only efficient but the solutions found are also close to the optimal solution. In terms of genotyping cost, we observe that the genotyping cost saved by using robust tag SNPs is significant, and genotyping additional tag SNPs to tolerate missing data is still costeffective. One future direction is to assign weights to different types of SNPs (e.g., SNPs in coding or noncoding regions), and design algorithms for the selection of weighted tag SNPs.
Software availability
Project name: efficient algorithms for utilizing SNP information.
Project home page: http://www.csie.ntu.edu.tw/~kmchao/tools/Robust_Tag_SNP webcite
Operating system: the implemented greedy algorithms are platform independent, and the implemented LPrelaxation algorithm runs on the Windows operating system.
Programming language: the greedy algorithms are implemented in JAVA, and the LPrelaxation algorithm is implemented in Perl.
Methods
Assume we are given a haplotype block containing N SNPs and K haplotype patterns. This block is denoted by an N × K binary matrix M_{h }(see Figure 6 (A)). Define M_{h}[i,j] ∈ {1,2} for each i ∈ [1, N] and j ∈ [1, K], where 1 and 2 represent the major and minor alleles, respectively. In reality, the haplotype block may also contain missing data. This formulation can be easily extended to handle missing data by treating them as wild card symbols. To simplify the presentation of this paper, we will assume no missing data in the block. Let C be the set of all SNPs in M_{h}. The robust tag SNPs C' ⊆ C are a subset of SNPs which is able to distinguish each pair of haplotype patterns unambiguously when at most m SNPs are missing. Note that the missing data may occur at any SNP locus and thus create different missing patterns (see Figure 2). For any haplotype pattern with up to m missing SNPs, the set of robust tag SNPs C' is required to distinguish all of them unambiguously.
Figure 6. Reformulation of the MRTS problem. (A) The haplotype matrix M_{h }containing N SNPs and K haplotype patterns. (B) The bipartite graph corresponding to M_{h}.
To distinguish a haplotype pattern unambiguously, each pair of patterns must be distinguished by at least one SNP in C'. For example (see Figure 6 (A)), we say patterns P_{1 }and P_{2 }can be distinguished by SNP S_{2 }since M_{h}[2,1] ≠ M_{h}[2,2]. A formal definition of this problem is given below.
Problem: Minimum Robust Tag SNPs (MRTS)
Input: An N × K matrix M_{h }and an integer m.
Output: The minimum subset of SNPs C' ⊆ C which satisfies:
(1) for each pair of patterns P_{i }and P_{j}, these is a SNP S_{k }∈ C' such that M_{h}[k, i] ≠ M_{h}[k, j];
(2) when at most m SNPs are discarded from C' arbitrarily, (1) still holds.
We then reformulate MRTS to a variant of the set covering problem [6]. Each SNP S_{k }∈ C (i.e., the kth row in M_{h}) is reformulated to a set = {(i, j)  M[k, i] ≠ M[k, j] and i <j}. For example, suppose the kth row in M_{h }is {1,1,1,2}. The corresponding set = {(1,4), (2,4), (3,4)}. In other words, stores the pairs of patterns distinguished by SNP S_{k}. Define P as the set that contains all pairs of patterns (i.e., P = {(i,j)  1 ≤ i <j ≤ K} = {(1,2), (1,3), ..., (K  l,K)}).
Consider each element in P and each reformulated set of C as nodes in an undirected bipartite graph (see Figure 6 (B). If SNP S_{k }can distinguish patterns P_{i }and P_{j }(i.e., (i,j) ∈ ), there is an edge connecting the nodes (i, j) and . The following lemma implies that each pair of patterns must be distinguished by at least (m + 1) SNPs to tolerate m missing SNPs.
Lemma 1. C' ⊆ C is the set of robust tag SNPs which allows at most m missing SNPs iff each node in P has at least (m + 1) edges connecting to each node in C'.
Proof. Let C' be the set of robust tag SNPs which allows at most m missing SNPs. Suppose patterns P_{i }and P_{j }are distinguished by only m SNPs in C' (i.e., (i, j) has only m edges connecting to nodes in C'). However, if these m SNPs are all missing, no other SNPs in C' are able to distinguish patterns P_{i }and P_{j}, which is a contradiction. Thus, each pair of patterns must be distinguished by at least (m + 1) SNPs, which implies that each node in P must have at least (m + 1) edges connecting to nodes in C'. The proof of the other direction is similar.
In the following, we give a lower bound regarding the minimum number of robust tag SNPs required, which is used to skip some solution space by the OPT program.
Lemma 2. Given K haplotype patterns, the minimum number of robust tag SNPs required is at least log K.
Proof. Recall that the value of a SNP is binary. The maximum number of distinct haplotypes which can be distinguished by N SNPs is at most 2^{N}. As a result, for a given data set containing K haplotype patterns, the minimum number of SNPs required is at least log K.
The following theorem shows the NPhardness of the MRTS problem, which implies there is no polynomial time algorithm to find the optimal solution of MRTS.
Theorem 1. The MRTS problem is NPhard.
Proof. When m = 0, MRTS is the same as the original problem of finding minimum tag SNPs, which is known as the minimum test set problem [6,17]. Since the minimum test set problem is NPhard and can be reduced to a special case of MRTS, MRTS is NPhard.
The first greedy algorithm
To solve MRTS efficiently, we propose a greedy algorithm which returns a solution not too larger than the optimal solution. By Lemma 1, to tolerate m missing tag SNPs, we need to find a subset of SNPs C' ⊆ C such that each pair of patterns in P is distinguished by at least (m + 1) SNPs in C'. Assume that the SNPs selected by this algorithm are stored in a (m + 1) × P table (see Figure 7 (A)). Initially, each grid in the table is empty. Once a SNP S_{k}, (that can distinguish patterns P_{i }and P_{j}) is selected, one grid of the column (i, j) is filled in with S_{k}, and we say that this grid is covered by S_{k}.
Figure 7. An example of the first greedy algorithm. The SNPs S_{1}, S_{4}, S_{2}, and S_{3 }are selected by the first greedy algorithm. (A) The table that stores each selected SNP.
This greedy algorithm works by covering the grids from the first row to the (m + 1)th row, and greedily selects a SNP which covers most uncovered grids in the ith row at each iteration. In other words, while working on the ith row, a SNP is selected if its reformulated set S' maximizes S' ∩ R_{i }, where R_{i }is the set of uncovered grids at the ith row.
Figure 7 illustrates an example for this algorithm to tolerate one missing tag SNP (i.e., m = 1). The SNPs S_{1}, S_{4}, S_{2}, and S_{3 }are selected in order. When all grids in this table are covered, each pair of patterns is distinguished by (m + 1) SNPs in the corresponding column. Thus, the SNPs in this table are the robust tag SNPs which can tolerate up to m missing SNPs. The pseudo code of this greedy algorithm is given below.
Algorithm: FlRSTGREEDYALGORITHM (C, P, m)
1 R_{i }← P, ∀i ∈ [1, m + 1]
2 C' ← φ
3 for i = 1 to m + 1 do
4 while R_{i }≠ φ do
5 select and remove a SNP S from C that maximizes S' ∩ R_{i}
6 C' ← C' ∪ S
7 j ← i
8 while S' ≠ φ and j ≤ m + 1 do
9 S_{tmp }← S' ∩ R_{j }//S_{tmp }is a temporary variable for holding the result of S' ∩ R_{i}
10 R_{j }← R_{j } S_{tmp}
11 S' ← S'  S_{tmp}
12 j ← j + l
13 endwhile
14 endwhile
15 endfor
16 return C'
The time complexity of this algorithm is analyzed as follows. At Line 4, the number of iterations of the intermediate loop is bounded by R_{i} ≤ P. Within the loop body (Lines 5–13), Line 5 takes O(CP) because we need to check all SNPs in C and examine the uncovered grids of R_{i}. The inner loop (Lines 8–13) takes only O(S'). Thus, the entire program runs in O(mCP^{2}).
We now show the solution C' returned by the first greedy algorithm is not too larger than the optimal solution C*. Suppose the algorithm selects the kth SNP when working on the ith row. Let  be the number of grids in the ith row covered by the kth selected SNP (i.e.,  = S' ∩ R_{i}; see Line 5 in FIRSTGREEDYALGORITHM). For example (see Figure 7), since the second selected SNP (i.e., S_{4}) covers two grids in the first row. We incur 1 unit of cost to each selected SNP, and spread this cost among the grids in [3]. In other words, each grid at the ith row and jth column is assigned a cost (see Figure 8), where
Figure 8. Analysis of the first greedy algorithm. This figure shows the cost
of each grid for the first greedy algorithm.
Since each selected SNP is assigned 1 unit of cost, the sum of for each grid in the table is equal to C',
i.e.,
Let be the number of uncovered grids in the ith row before the kth iteration (i.e., (k  1) SNPs have been selected by the algorithm). For example (see Figure 8), since two grids in the first row are still uncovered before the second SNP is selected. Define as the set of iterations used by the algorithm when working on the ith row. For example (see Figure 8), since this algorithm works on the second row in the third and fourth iterations. We can rewrite (1) as
Lemma 3. The kth selected SNP has .
Proof. Suppose the algorithm is working on the ith row at the beginning of the kth iteration. Let be the set of SNPs in C* (the optimal solution) that has been selected by the algorithm before the kth iteration, and the set of remaining SNPs in C* be . We claim that there exists a SNP in which can cover at least grids in the ith row. Otherwise (i.e., each SNP in covers less than grids), all SNPs in will cover less than grids in the ith row. But since , this implies that C* can not cover all grids in , which is a contradiction. Because all SNPs in are candidates to the greedy algorithm, the kth selected SNP must cover at least grids in the ith row, which implies since C* ≥  and . □
Theorem 2. The first greedy algorithm gives a solution of (m + 1) approximation.
Proof. Define the dth harmonic number as and H(0) = 0. By (2) and Lemma 3,
By (1) and (3), we get
The second greedy algorithm
This section describes the second greedy algorithm which returns a solution of better approximation than that of the first greedy algorithm. Let R_{i }be the set of uncovered grids at the ith row. Unlike the rowbyrow manner of the first greedy algorithm, this algorithm greedily selects a SNP that covers most uncovered grids in the table (i.e., its reformulated set S' maximizing S' ∩ (R_{1 }∪ ... ∪ R_{m+1})). Let T be the collection of R_{i }(i.e., T is the set of all uncovered grids in the table). If the grids in the ith row are all covered (i.e., R_{i }= φ), R_{i }is removed from T. This algorithm runs until T = φ (i.e., all grids in the table are covered).
Figure 9 illustrates an example for this algorithm with m set to 1. The SNPs S_{1}, S_{2}, S_{4}, and S_{5 }are selected in order. Since this algorithm runs until all grids are covered, the set of SNPs in this table is able to tolerate m missing tag SNPs. The pseudo code of this algorithm is given below.
Figure 9. An example of the second greedy algorithm. The SNPs S_{1}, S_{2}, S_{4}, and S_{5 }are selected by the second greedy algorithm. (A) The table that stores each selected SNP.
Algorithm: SECONDGREEDYALGORITHM (C, P, m)
1 R_{i }← P, ∀ i ∈ [1, m + 1]
2 T ← {R_{1}, R_{2},... ,R_{m+1}}
3 C' ← φ
4 while T ≠ φ do
5 select and remove a SNP S from C that maximizes S' ∩ (R_{1 }∪ ... ∪ R_{m+1})
6 C' ← C' ∪ S
7 for each R_{i }∈ T and S' ≠ ø do
8 S_{tmp }← S' ∩ R_{i }// S_{tmp }is a temporary variable for holding the result of S' ∩ R_{i}
9 R_{i }← R_{i } S_{tmp}
10 S' ← S'  S_{tmp}
11 if R_{i }= φ then T ← T  R_{i}
12 endfor
13 end while
14 return C'
The time complexity of this algorithm is analyzed as follows. At Line 4, the number of iterations of the loop is bounded by O(T) = O(mP). Within the loop, Line 5 takes O(CP) time because we need to check each SNP in C and examine if it can cover any uncovered grid in each column. The inner loop (Lines 7–12) is bounded by O(S') <O(P). Thus, the running time of this program is O(mCP^{2}).
We now evaluate the solution returned by the second greedy algorithm. Let C' and C* be the set of SNPs selected by this algorithm and the optimal solution, respectively. Let  be the number of grids in the table covered by the kth selected SNP. For example (see Figure 9),  = 4 since the second selected SNP (i.e., S_{2}) covers four grids in the table. Define T_{k }as the number of uncovered grids in the table before the kth iteration. We have the following lemma similar to Lemma 3.
Lemma 4. The kth selected SNP has .
Proof. The proof is similar to that of Lemma 3. Let be the set of remaining SNPs in C* which has not been selected before the kth iteration. We claim that there exists a SNP in which can cover at least grids in the table. Otherwise, we can get the same contradiction (i.e., C* fails to cover all grids) as in Lemma 3. Since C* ≥  and T_{k1 }≤ T_{k}, we have . □
Theorem 3. The second greedy algorithm gives a solution of approximation.
Proof. Each grid at the ith row and jth column is assigned a cost (see Figure 10) if it is covered by the kth selected SNP. The sum of for each grid is
Figure 10. Analysis of the second greedy algorithm. This figure shows the cost
of each grid for the second greedy algorithm.
By (4), we have
The iterative LPrelaxation algorithm
In practice, a probabilistic approach is sometimes more useful since the randomization can explore different solutions. In this section, we reformulate the MRTS problem to an Integer Programming (IP) problem. Based on the IP problem, we propose an iterative Linear Programming (LP)relaxation algorithm. The iterative LPrelaxation algorithm is described below.
Step 1. Given a haplotype block containing N SNPs and K haplotype patterns. Let {x_{1},x_{2}, ...,x_{N}} be the set of integer variables for the N SNPs, where x_{k }= 1 if the SNP S_{k }is selected and x_{k }= 0 otherwise. Define D(P_{i}, P_{j}) as the set of SNPs which are able to distinguish P_{i }and P_{j }patterns. By Lemma 1, to allow at most m missing SNPs, each pair of patterns must be distinguished by at least (m + 1) SNPs. Therefore, for each set D(P_{i}, P_{j}), at least (m + 1) SNPs have to be selected to distinguish P_{i }and P_{j }patterns. As a consequence, the MRTS problem can be formulated as the following IP problem:
Step 2. Since solving the IP problem is NPhard [6], we relax the integer constraint of x_{k}, and the IP problem becomes an LP problem defined as follows:
The above LP problem can be solved in polynomial time by efficient algorithms such as the interior point method (Forsgren et al., 2002) [5].
Step 3. Let {y_{1}, y_{2}, ..., y_{N}} be the set of linear solutions obtained from (6), where 0 ≤ y_{k }≤ 1. We construct the corresponding integer solutions {x_{1}, x_{2}, ..., x_{N}} by the following randomized rounding method:
Note that the constructed integer solutions do not necessary satisfy all inequalities in (5). The randomized rounding method simply assigns x_{k }to 1 or 0 using the value of y_{k }as the likelihood, regardless of the inequalities in (5).
Step 4. We check whether the integer solutions constructed in Step 3 satisfy all inequalities in (5) or not.
Case 1. If some inequalities in (5) are still unsatisfied, we repeat Steps 1, 2, and 3 only for those unsatisfied inequalities until all of them are satisfied.
Case 2. If all inequalities in (5) are satisfied, we construct a final solution by setting x_{k }= 1 if x_{k }is assigned to 1 in any one of the iterations and setting x_{k }= 0 otherwise.
We now evaluate the solution returned by the iterative LPrelaxation algorithm. The selection of each SNP is considered as a Bernoulli random variable x_{k }taking values 1 (or 0) with probability y_{k }(or 1  y_{k}). Let X_{i,j }be the sum of random variables in one inequality of (5), i.e.,
By (6), the expected value of X_{i,j }(after randomized rounding) is
Lemma 5. The probability that an inequality in (5) is not satisfied after randomized rounding is less than .
Proof. The probability that an inequality in (5) is not satisfied is P[X_{i,j }<m + 1] = P[X_{i,j }≤ m]. By the Chernoff bound (i.e., P[X ≤ (1  θ) E[X]] ≤ ), we have
By (7), we know E[X_{i,j}] ≤ m + 1. Since the righthand side of (8) decreases when E[X_{i,j}] > m, we can replace E[X_{i},_{j}] with (m + 1) to obtain an upper bound, i.e.,
Theorem 4. The iterative LPrelaxation algorithm gives a solution of O(m ln K) approximation.
Proof. Suppose this algorithm runs for t iterations. The probability that all inequalities in (5) are satisfied after t iterations is
When t = 2(m + 1) , the algorithm stops and returns a solution with probability e^{1}. Define OPT(IP) and OPT(LP) as the optimal solutions of the IP problem and the LP problem, respectively. Since the solution space of LP includes that of IP,
OPT(LP) ≤ OPT(IP).
Let the set of solutions returned in t iterations be {Z_{1}, Z_{2},...,Z_{t}}.
Note that we repeat this algorithm only for those unsatisfied inequalities. Thus, E[Z_{1}] ≥ E[Z_{2}] ≥ ... ≥ E[Z_{t}]. Let x_{p }denote the final solution obtained in Step 4. The expected final solution is
With a high probability, the iterative LPrelaxation algorithm stops after O(m ln K) iterations and finds a solution of O(m ln K) approximation. □
An algorithm for finding auxiliary tag SNPs
This section describes an algorithm for finding auxiliary tag SNPs assuming robust tag SNPs have been computed in advance. Given a haplotype block M_{h }containing N SNPs and K haplotypes, we define C_{tag }⊆ C as the set of tag SNPs genotyped from a haplotype sample with some missing data. This haplotype sample may fail to be distinguished because of the ambiguity caused by missing data. We wish to find the minimum number of auxiliary tag SNPs from the remaining SNPs in the block to resolve the ambiguity. A formal definition of this problem is given below.
Problem: Minimum Auxiliary Tag SNPs (MATS)
Input: An N × K matrix M_{h}, and a set of SNPs C_{tag }genotyped from a sample with missing data.
Output: The minimum subset of SNPs C_{aux }⊆ C  C_{tag }such that each pair of ambiguous patterns can be distinguished by SNPs in C_{aux}.
The following theorem shows the NPhardness of the MATS problem.
Theorem 5. The MATS problem is NPhard.
Proof. Consider that all SNPs in C_{tag }are missing. This special case of the MATS problem becomes finding the minimum tag SNPs from C  C_{tag}, which is already known to be NPhard [17]. Therefore, MATS is also NPhard. □
Although the MATS problem is NPhard, we show that auxiliary tag SNPs can be found efficiently when robust tag SNPs have been computed in advance. Without loss of generality, assume that these robust tag SNPs are stored in an (m + 1) × P table T_{r }(see Figure 11 (A)).
Figure 11. An example of finding auxiliary tag SNPs. The SNP S_{1 }is missing and SNP S_{4 }is the auxiliary tag SNP for h_{2}. (A) The table that stores the set of robust tag SNPs.
Step 1. The patterns that match the haplotype sample are stored into a set A. For example (see Figure 11), if we genotype SNPs S_{1}, S_{2}, and S_{3 }for the sample h_{2 }and the SNP S_{1 }is missing, patterns P_{1 }and P_{3 }both match h_{2}. Thus, A = {P_{1}, P_{3}}
Step 2. If A = 1, the sample is identified unambiguously and we are done (e.g., h_{1 }in Figure 11). If A > 1 (e.g., h_{2}), for each pair of ambiguous patterns in A (e.g., P_{1 }and P_{3}), traverse the corresponding column in T_{r}, find the next unused SNP (e.g., S_{4}), and add the SNP to C_{aux}. As a result, the SNPs in C_{aux }can distinguish each pair of ambiguous patterns, which are the auxiliary tag SNPs for the haplotype sample.
The worst case of this algorithm is that all SNPs in C_{tag }are missing data, and we need to traverse each column in T_{r}. Thus, the running time of this algorithm is O(T_{r}) = O(mP).
Authors' contributions
YTH and KMC design and implement the greedy algorithms. KZ and TC design and implement the iterative LP relaxation algorithm. All authors write and approve the manuscript.
Acknowledgements
We thank the referees for their valuable comments that resulted in numerous improvements in the presentation. YaoTing Huang and KunMao Chao were supported in part by NSC grants 932213E002029 and 942213E002091 from the National Science Council, Taiwan. Ting Chen was supported in part by NIH CEGS: Implications of Haplotype Structure in the Human Genome, Grant No. P50 HG002790.
References

Bafna V, Halldórsson BV, Schwartz R, Clark AG, Istrail S: Haplotypes and informative SNP selection algorithms: don't block out information.

Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA: Selecting a maximally informative set of singlenucleotide polymorphisms for association analyses using linkage disequilibrium.
Am J Hum Genet 2004, 74:106120. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cormen TH, Leiserson CE, Rivest RL, Stein C: Introduction to algorithms. The MIT Press; 2001.

Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES: Highresolution haplotype structure in the human genome.
Nat Genet 2001, 29(2):229232. PubMed Abstract  Publisher Full Text

Forsgren A, Gill PE, Wright MH: Interior methods for nonlinear optimization.

Garey MR, Johnson DS: Computers and intractability. Freeman, New York; 1979.

Halldórsson BV, Bafna V, Lippert R, Schwartz R, Vega FM, Clark AG, Istrail S: Optimal haplotype blockfree selection of tagging SNPs for genomewide association studies.
Genome Research 2004, 16331640. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Halperin E, Eskin E: Haplotype reconstruction from genotype data using imperfect phylogeny.

Hinds DA, Stuve LL, Nilsen GB, Halperin E, Eskin E, Ballinger DG, Frazer KA, Cox DR: Wholegenome patterns of common DNA variation in three human populations.
Science 2005, 307:10721079. PubMed Abstract  Publisher Full Text

Hudson RR: Generating samples under a WrightFisher neutral model of genetic variation.
Bioinformatics 2002, 18:337338. PubMed Abstract  Publisher Full Text

LP Solve [http://www.cs.sunysb.edu/~algorith/implement/lpsolve/implement.shtml] webcite

Niu T, Qin Z, Xu X, Liu JS: Bayesian haplotype inference for multiple linked singlenucleotide polymorphisms.
Am J Hum Genet 2002, 70:157159. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Patil N, Berno AJ, Hinds DA, Barrett WA, Doshi JM, Hacker CR, Kautzer CR, Lee DH, Marjoribanks C, McDonough DP, Nguyen BT, Norris MC, Sheehan JB, Shen N, Stern D, Stokowski RP, Thomas DJ, Trulson MO, Vyas KR, Frazer KA, Fodor SP, Cox DR: Blocks of limited haplotype diversity revealed by highresolution scanning of human chromosome 21.
Science 2001, 294:17191723. PubMed Abstract  Publisher Full Text

Stephens M, Donnelly P: A comparison of bayesian methods for haplotype reconstruction from population genotype data.
Am J Hum Genet 2003, 73:11621169. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wang L, Xu Y: Haplotype inference by maximum parsimony.
Bioinformatics 2003, 19(14):17731780. PubMed Abstract  Publisher Full Text

Yang Y, Zhang J, Hoh J, Matsuda F, Xu P, Lathrop M, Ott J: Efficiency of singlenucleotide polymorphism haplotype estimation from pooled DNA.

Zhang K, Deng M, Chen T, Waterman MS, Sun F: A dynamic programming algorithm for haplotype partitioning.
Proc Nat Acad Sci 2002, 99(11):73357339. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhang K, Sun F, Waterman MS, Chen T: Haplotype block partition with limited resources and applications to human chromosome 21 haplotype data.
Am J Hum Genet 2003, 73:6373. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhang K, Qin ZS, Liu JS, Chen T, Waterman MS, Sun F: Haplotype block partition and tag SNP selection using genotype data and their applications to association studies.
Genome Research 2004, 14:908916. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhao JH, Lissarrague S, Essioux L, Sham PC: GENECOUNTING: haplotype analysis with missing genotypes.
Bioinformatics 2002, 18:16941695. PubMed Abstract  Publisher Full Text