<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-6-263</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Methodology article</dochead>
      <bibl>
         <title>
            <p>Selecting additional tag SNPs for tolerating missing data in genotyping</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Huang</snm>
               <fnm>Yao-Ting</fnm>
               <insr iid="I1"/>
               <email>ythuang@acb.csie.ntu.edu.tw</email>
            </au>
            <au id="A2">
               <snm>Zhang</snm>
               <fnm>Kui</fnm>
               <insr iid="I3"/>
               <email>kzhang@ms.soph.uab.edu</email>
            </au>
            <au id="A3">
               <snm>Chen</snm>
               <fnm>Ting</fnm>
               <insr iid="I4"/>
               <email>tingchen@usc.edu</email>
            </au>
            <au id="A4" ca="yes">
               <snm>Chao</snm>
               <fnm>Kun-Mao</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>kmchao@csie.ntu.edu.tw</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan</p>
            </ins>
            <ins id="I2">
               <p>Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan</p>
            </ins>
            <ins id="I3">
               <p>Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, USA</p>
            </ins>
            <ins id="I4">
               <p>Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2005</pubdate>
         <volume>6</volume>
         <issue>1</issue>
         <fpage>263</fpage>
         <url>http://www.biomedcentral.com/1471-2105/6/263</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">16259642</pubid>
               <pubid idtype="doi">10.1186/1471-2105-6-263</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>26</day>
               <month>5</month>
               <year>2005</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>01</day>
               <month>11</month>
               <year>2005</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>01</day>
               <month>11</month>
               <year>2005</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2005</year>
         <collab>Huang et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Recent studies have shown that the patterns of linkage disequilibrium observed in human populations have a block-like structure, and a small subset of SNPs (called tag SNPs) is sufficient to distinguish each pair of haplotype patterns in the block. In reality, some tag SNPs may be missing, and we may fail to distinguish two distinct haplotypes due to the ambiguity caused by missing data.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We show there exists a subset of SNPs (referred to as robust tag SNPs) which can still distinguish all distinct haplotypes even when some SNPs are missing. The problem of finding minimum robust tag SNPs is shown to be NP-hard. To find robust tag SNPs efficiently, we propose two greedy algorithms and one linear programming relaxation algorithm. The experimental results indicate that (1) the solutions found by these algorithms are quite close to the optimal solution; (2) the genotyping cost saved by using tag SNPs can be as high as 80%; and (3) genotyping additional tag SNPs for tolerating missing data is still cost-effective.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>Genotyping robust tag SNPs is more practical than just genotyping the minimum tag SNPs if we can not avoid the occurrence of missing data. Our theoretical analysis and experimental results show that the performance of our algorithms is not only efficient but the solution found is also close to the optimal solution.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>In recent years, <it>Single Nucleotide Polymorphisms </it>(SNPs) have become the preferred marker for association studies of genetic diseases or traits. A set of linked SNPs on one chromosome is called a <it>haplotype</it>. Recent studies have shown that the patterns of <it>Linkage Disequilibrium </it>(LD) observed in human populations have a block-like structure <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B13">13</abbr></abbrgrp>. The chromosome recombination only takes place at some low LD regions called recombination hotspots. The high LD region between these hotspots is often referred to as a "haplotype block." Within a haplotype block, there is little or even no recombination occurred, and the SNPs in the block tend to be inherited together. Due to the low haplotype diversity within a block, the information carried by these SNPs is highly redundant. Thus, a small subset of SNPs (called "tag SNPs") is sufficient to distinguish each pair of patterns in the block <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B13">13</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp>. Haplotype blocks with corresponding tag SNPs are quite useful and cost-effective for association studies as it does not require genotyping all SNPs. Many studies have tried to find the minimum set of tag SNPs in a haplotype block. In a large-scale study of human Chromosome 21, Patil <it>et al</it>. <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> developed a greedy algorithm to partition the haplotypes into 4,135 blocks with 4,563 tag SNPs. Zhang <it>et al</it>. <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp> used a dynamic programming approach to reduce the numbers of blocks and tag SNPs to 2,575 and 3,562, respectively. Bafna <it>et al</it>. <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> showed that the problem of minimizing tag SNPs is NP-hard and gave efficient algorithms for special cases of this problem. </p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>The influence of missing data and auxiliary tag SNPs</p>
            </caption>
            <text>
               <p><b>The influence of missing data and auxiliary tag SNPs</b>. (A) A haplotype block defined by 12 SNPs and 4 haplotype patterns. Each column represents a haplotype pattern and each row represents a SNP locus. The black and grey boxes stand for the major and minor alleles at each SNP locus, respectively. (B) Tag SNPs genotyped without missing data. (C) Tag SNPs genotyped with missing data. (D) The auxiliary tag SNP <it>S</it><sub>5 </sub>for <it>h</it><sub>2</sub>. (E) The auxiliary tag SNP <it>S</it><sub>8 </sub>for <it>h</it><sub>3</sub>.</p>
            </text>
            <graphic file="1471-2105-6-263-1" hint_layout="single"/>
         </fig>
         <p>In reality, a SNP may not be genotyped and considered to be missing data (i.e., we fail to obtain the allele configuration of the SNP) if it does not pass the threshold of data quality <abbrgrp><abbr bid="B13">13</abbr><abbr bid="B16">16</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp>. These missing data may cause ambiguity when using the minimum set of tag SNPs to distinguish an unknown haplotype sample. Figure <figr fid="F1">1</figr> illustrates the influence of missing data when identifying haplotype samples. In this figure, a haplotype block (see Figure <figr fid="F1">1 (A)</figr>) defined by 12 SNPs and 4 haplotype patterns is presented (from the public haplotype data of human Chromosome 21 <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>). We follow the same assumption as previous studies that all SNPs are diallelic (i.e., taking on only two values) <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B13">13</abbr></abbrgrp>. Suppose we select SNPs <it>S</it><sub>1 </sub>and <it>S</it><sub>12 </sub>as tag SNPs. The haplotype sample <it>h</it><sub>1 </sub>is identified as haplotype pattern <it>P</it><sub>3 </sub>unambiguously (see Figure <figr fid="F1">1 (B)</figr>). Consider haplotype samples <it>h</it><sub>2 </sub>and <it>h</it><sub>3 </sub>with one missing tag SNP (see Figure <figr fid="F1">1 (C)</figr>). <it>h</it><sub>2 </sub>can be identified as haplotype patterns <it>P</it><sub>2 </sub>or <it>P</it><sub>3</sub>, and <it>h</it><sub>3 </sub>can be identified as <it>P</it><sub>1 </sub>or <it>P</it><sub>3</sub>. As a result, these missing tag SNPs result in ambiguity when distinguishing unknown haplotype samples.</p>
         <p>Although we can not avoid the occurrence of missing data, the remaining SNPs within the haplotype block may provide abundant information to resolve the ambiguity. For example, if we re-genotype an additional SNP <it>S</it><sub>5 </sub>for <it>h</it><sub>2 </sub>(see Figure <figr fid="F1">1 (D)</figr>), <it>h</it><sub>2 </sub>is identified as haplotype pattern <it>P</it><sub>3 </sub>unambiguously. On the other hand, if SNP <it>S</it><sub>8 </sub>is re-genotyped (see Figure <figr fid="F1">1 (E)</figr>), <it>h</it><sub>3 </sub>is also identified unambiguously. These additional SNPs are referred to as "auxiliary tag SNPs," which can be found from the remaining SNPs in the block and are able to resolve the ambiguity caused by missing data.</p>
         <p>Alternatively, instead of re-genotyping auxiliary tag SNPs whenever encountering missing data, we work on a set of SNPs which is not affected by the occurrence of missing data. Figure <figr fid="F2">2</figr> illustrates a set of SNPs which can tolerate one missing SNP. Suppose we select SNPs <it>S</it><sub>1</sub>, <it>S</it><sub>5</sub>, <it>S</it><sub>8</sub>, and <it>S</it><sub>12 </sub>to be genotyped. Note that no matter which SNP is missing, each of the 16 missing patterns can be distinguished by the remaining three SNPs. Therefore, all haplotype samples with one missing SNP can still be identified unambiguously. We refer to these SNPs as "robust tag SNPs," which are able to tolerate a number of missing data. The important feature of robust tag SNPs is that although they consume more SNPs than the "tag SNPs" defined in previous studies, they guarantee that all haplotype samples with a number of missing data can be distinguished unambiguously. When the occurrence of missing data is frequent, the cost of re-genotyping processes can be reduced by robust tag SNPs.</p>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>The robust tag SNPs</p>
            </caption>
            <text>
               <p><b>The robust tag SNPs</b>. A set of robust tag SNPs for tolerating one missing tag SNP.</p>
            </text>
            <graphic file="1471-2105-6-263-2" hint_layout="single"/>
         </fig>
         <p>This paper focuses on the problem of finding robust tag SNPs to tolerate a number of missing data. Throughout this paper, we denote <it>m </it>as the maximum number of missing SNPs to be tolerated, which corresponds to different missing rates in different genotyping experiments. And we wish to find a minimum set of robust tag SNPs which can distinguish each pair of haplotypes even when up to <it>m </it>SNPs are missing. We assume that the haplotype phases and block partition are available as the input. Numerous methods have been developed to infer haplotypes from genotype data <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>. Several algorithms have also been proposed to find the block partition <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B13">13</abbr><abbr bid="B17">17</abbr></abbrgrp>. The problem of finding minimum robust tag SNPs is shown to be NP-hard (See Theorem 1). To find robust tag SNPs efficiently, we propose two greedy algorithms and one linear programming (LP) relaxation algorithm. The proposed algorithms have been implemented and tested on a variety of simulated and empirical data. We also analyze the efficiency and solutions of these algorithms. An algorithm for finding auxiliary tag SNPs is described assuming robust tag SNPs have been computed in advance.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <p>We propose two greedy algorithms which select the robust tag SNPs one by one in different greedy manners. In addition, we reformulate this problem as an integer programming problem and design an LP-relaxation algorithm to solve this problem. The greedy and LP-relaxation algorithms are able to find solutions within factors of (<it>m </it>+ 1) <m:math name="1471-2105-6-263-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mtext>ln</m:mtext><m:mfrac><m:mrow><m:mi>K</m:mi><m:mo stretchy="false">(</m:mo><m:mi>K</m:mi><m:mo>&#8722;</m:mo><m:mn>1</m:mn><m:mo stretchy="false">)</m:mo></m:mrow><m:mn>2</m:mn></m:mfrac></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqqGSbaBcqqGUbGBdaWcaaqaaiabdUealjabcIcaOiabdUealjabgkHiTiabigdaXiabcMcaPaqaaiabikdaYaaaaaa@363F@</m:annotation></m:semantics></m:math>, <m:math name="1471-2105-6-263-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mtext>ln((</m:mtext><m:mi>m</m:mi><m:mo>+</m:mo><m:mn>1</m:mn><m:mo stretchy="false">)</m:mo><m:mfrac><m:mrow><m:mi>K</m:mi><m:mo stretchy="false">(</m:mo><m:mi>K</m:mi><m:mo>&#8722;</m:mo><m:mn>1</m:mn><m:mo stretchy="false">)</m:mo></m:mrow><m:mn>2</m:mn></m:mfrac><m:mo stretchy="false">)</m:mo></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqqGSbaBcqqGUbGBcqqGOaakcqqGOaakcqWGTbqBcqGHRaWkcqaIXaqmcqGGPaqkdaWcaaqaaiabdUealjabcIcaOiabdUealjabgkHiTiabigdaXiabcMcaPaqaaiabikdaYaaacqGGPaqkaaa@3CD6@</m:annotation></m:semantics></m:math>, and <it>O</it>(<it>m </it>ln <it>K</it>) of the optimal solution respectively, where <it>m </it>is the maximum number of missing SNPs allowed and <it>K </it>is the number of haplotype patterns in the block.</p>
         <p>We have implemented the first and second greedy algorithms in JAVA [see Additional files <supplr sid="S1">1</supplr> and <supplr sid="S2">2</supplr>]. The LP-relaxation algorithm has been implemented in Perl [see <supplr sid="S3">Additional file 3</supplr>], where the LP problem is solved via a program called "lp_solve" <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. The LP-relaxation algorithm is a randomized method. Thus, this program is repeated for 10 times to explore different solutions and the best solution among them is chosen as the output.</p>
         <suppl id="S1">
            <title>
               <p>Additional File 1</p>
            </title>
            <text>
               <p><b>The program for the first greedy algorithm</b>. The Greedyl.zip file is compressed using WinZip and contains the JAVA source code for the first greedy algorithm.</p>
            </text>
            <file name="1471-2105-6-263-S1.zip">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S2">
            <title>
               <p>Additional File 2</p>
            </title>
            <text>
               <p><b>The program for the second greedy algorithm</b>. The Greedy2.zip file is compressed using WinZip and contains the JAVA source code for the second greedy algorithm.</p>
            </text>
            <file name="1471-2105-6-263-S2.zip">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S3">
            <title>
               <p>Additional File 3</p>
            </title>
            <text>
               <p><b>The program for the iterative LP-relaxation algorithm</b>. The ILP.zip file is compressed using WinZip and contains the Perl script for the iterative LP-relaxation algorithm.</p>
            </text>
            <file name="1471-2105-6-263-S3.zip">
               <p>Click here for file</p>
            </file>
         </suppl>
         <p>In order to evaluate the solutions and efficiency of our algorithms, we also implement a program in JAVA (referred to as "OPT") which uses a brute force method to find the optimal solution. For a given data set of <it>N </it>SNPs, the OPT program examines all possible solutions (i.e., all subsets of <m:math name="1471-2105-6-263-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mrow><m:mo>(</m:mo><m:mrow><m:mtable><m:mtr><m:mtd><m:mi>N</m:mi></m:mtd></m:mtr><m:mtr><m:mtd><m:mn>1</m:mn></m:mtd></m:mtr></m:mtable></m:mrow><m:mo>)</m:mo></m:mrow></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaqadaqaauaabeqaceaaaeaacqWGobGtaeaacqaIXaqmaaaacaGLOaGaayzkaaaaaa@3059@</m:annotation></m:semantics></m:math>, <m:math name="1471-2105-6-263-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mrow><m:mo>(</m:mo><m:mrow><m:mtable><m:mtr><m:mtd><m:mi>N</m:mi></m:mtd></m:mtr><m:mtr><m:mtd><m:mn>2</m:mn></m:mtd></m:mtr></m:mtable></m:mrow><m:mo>)</m:mo></m:mrow></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaqadaqaauaabeqaceaaaeaacqWGobGtaeaacqaIYaGmaaaacaGLOaGaayzkaaaaaa@305B@</m:annotation></m:semantics></m:math>, ..., and <m:math name="1471-2105-6-263-i5" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mrow><m:mo>(</m:mo><m:mrow><m:mtable><m:mtr><m:mtd><m:mi>N</m:mi></m:mtd></m:mtr><m:mtr><m:mtd><m:mi>N</m:mi></m:mtd></m:mtr></m:mtable></m:mrow><m:mo>)</m:mo></m:mrow></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaqadaqaauaabeqaceaaaeaacqWGobGtaeaacqWGobGtaaaacaGLOaGaayzkaaaaaa@308E@</m:annotation></m:semantics></m:math>). The minimum subset of SNPs that can tolerate <it>m </it>missing SNPs is chosen as the output. Due to the NP-hardness of this problem, the OPT program fails to output the optimal solution within a reasonable period of time in many data sets. As a consequence, we skip some impossible solution space to speed up this program by the following two observations: (1) the solutions with less than or equal to <it>m </it>SNPs are the impossible ones since <it>m </it>SNPs might be missing; and (2) for a data set containing <it>K </it>haplotype patterns, the minimum number of SNPs required to distinguish each of them is at least log <it>K </it>(see Lemma 2). As a result, we can examine the possible solutions only for subsets of <m:math name="1471-2105-6-263-i6" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mrow><m:mo>(</m:mo><m:mrow><m:mtable><m:mtr><m:mtd><m:mi>N</m:mi></m:mtd></m:mtr><m:mtr><m:mtd><m:mrow><m:mi>m</m:mi><m:mo>+</m:mo><m:mi>log</m:mi><m:mo>&#8289;</m:mo><m:mtext>&#8201;</m:mtext><m:mtext>&#8201;</m:mtext><m:mi>K</m:mi></m:mrow></m:mtd></m:mtr></m:mtable></m:mrow><m:mo>)</m:mo></m:mrow></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaqadaqaauaabeqaceaaaeaacqWGobGtaeaacqWGTbqBcqGHRaWkcyGGSbaBcqGGVbWBcqGGNbWzcaaMc8UaaGPaVlabdUealbaaaiaawIcacaGLPaaaaaa@3A01@</m:annotation></m:semantics></m:math>, <m:math name="1471-2105-6-263-i7" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mrow><m:mo>(</m:mo><m:mrow><m:mtable><m:mtr><m:mtd><m:mi>N</m:mi></m:mtd></m:mtr><m:mtr><m:mtd><m:mrow><m:mi>m</m:mi><m:mo>+</m:mo><m:mi>log</m:mi><m:mo>&#8289;</m:mo><m:mtext>&#8201;</m:mtext><m:mtext>&#8201;</m:mtext><m:mi>K</m:mi><m:mo>+</m:mo><m:mn>1</m:mn></m:mrow></m:mtd></m:mtr></m:mtable></m:mrow><m:mo>)</m:mo></m:mrow></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaqadaqaauaabeqaceaaaeaacqWGobGtaeaacqWGTbqBcqGHRaWkcyGGSbaBcqGGVbWBcqGGNbWzcaaMc8UaaGPaVlabdUealjabgUcaRiabigdaXaaaaiaawIcacaGLPaaaaaa@3BD3@</m:annotation></m:semantics></m:math> ..., and <m:math name="1471-2105-6-263-i5" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mrow><m:mo>(</m:mo><m:mrow><m:mtable><m:mtr><m:mtd><m:mi>N</m:mi></m:mtd></m:mtr><m:mtr><m:mtd><m:mi>N</m:mi></m:mtd></m:mtr></m:mtable></m:mrow><m:mo>)</m:mo></m:mrow></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaqadaqaauaabeqaceaaaeaacqWGobGtaeaacqWGobGtaaaacaGLOaGaayzkaaaaaa@308E@</m:annotation></m:semantics></m:math>. By searching possible solutions from small subsets to large ones, the OPT program can stop and output the optimal solution immediately when a subset that can tolerate <it>m </it>missing SNPs is found.</p>
         <sec>
            <st>
               <p>Results on simulated data</p>
            </st>
            <p>Theoretically, all SNPs will reach complete linkage equilibrium after sufficient chromosome recombination takes place. We first generate 100 data sets containing short haplotypes which simulate this bottleneck model <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>. Each data set consists of 10 haplotypes with 20 SNPs. These haplotypes are created by randomly assigning the major or minor alleles at each SNP locus. Let <it>m </it>be the number of missing SNPs allowed and <it>S</it><sub><it>a </it></sub>be the average number of robust tag SNPs over 100 data sets. Figure <figr fid="F3">3 (a)</figr> plots <it>S</it><sub><it>a </it></sub>with respect to <it>m </it>(roughly corresponding to SNP missing rates from 0% to 33%). When <it>m </it>= 0, all programs find the same number of SNPs as the optimal solution. The iterative LP-relaxation algorithm slightly outperforms the others as <it>m </it>increases. When <it>m </it>> 6, more than 20 SNPs are required to tolerate missing data. Thus, no data sets contain enough SNPs for solutions.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Experimental results on random data</p>
               </caption>
               <text>
                  <p><b>Experimental results on random data</b>. (a) Results from data sets containing 10 haplotypes and 20 SNPs. (b) Results from data sets containing 10 haplotypes and 40 SNPs.</p>
               </text>
               <graphic file="1471-2105-6-263-3" hint_layout="single"/>
            </fig>
            <p>We then generate 100 data sets containing long haplotypes. Each data set is composed of 10 haplotypes with 40 SNPs. Figure <figr fid="F3">3 (b)</figr> illustrates the experimental results on these long data sets (corresponding to SNP missing rates from 0% to 37%). The optimal solutions for <it>m </it>> 2 can not be found by the OPT program within a reasonable period of time (after one week computation) and are not shown in this figure. It is because the possible solutions in long data sets are too large to enumerate. On the other hand, both greedy and iterative LP-relaxation algorithms run in polynomial time and always output a solution efficiently. In this experiment, both greedy algorithms slightly outperforms the iterative LP-relaxation algorithm. In addition, the number of missing SNPs allowed is larger than those in short data sets. For example, to tolerate 10 missing SNPs (i.e., <it>m </it>= 10), all programs output less than 28 SNPs. The remaining SNPs in each data set are still sufficient to tolerate more missing SNPs.</p>
            <p>Hudson (2002) <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> provides a program which can simulate a set of haplotypes under the assumption of neutral evolution and uniformly distributed recombination rate using the coalescent model. We use Hudson's program to generate 100 short data sets with 10 haplotypes and 20 SNPs and 100 long data sets with 10 haplotypes and 40 SNPs. Figure <figr fid="F4">4 (a)</figr> shows the experimental results on Hudson's short data sets (corresponding to SNP missing rates from 0% to 23%). The number of missing SNPs allowed are less than that of random data. It is because Hudson's program generates coalescent haplotypes which are similar to each other. As a result, many SNPs can not be used to distinguish haplotypes and the amount of tag SNPs is inadequate to tolerate larger missing SNPs. In this experiment, we observe that the iterative LP-relaxation algorithm finds solutions quite close to the optimal solutions and slightly outperforms the other two algorithms.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Experimental results on Hudson's data</p>
               </caption>
               <text>
                  <p><b>Experimental results on Hudson's data</b>. (a) Results from data sets containing 10 haplotypes and 20 SNPs. (b) Results from data sets containing 10 haplotypes and 40 SNPs.</p>
               </text>
               <graphic file="1471-2105-6-263-4" hint_layout="single"/>
            </fig>
            <p>Figure <figr fid="F4">4 (b)</figr> illustrates the experimental results on long data sets generated by Hudson's program (corresponding to SNP missing rates from 0% to 29%). The optimal solutions for <it>m </it>> 2 again can not be found by the OPT program within a reasonable period of time. In this experiment, the performance of the first greedy and iterative LP-relaxation algorithms are similar, and they slightly outperform the second greedy algorithm as <it>m </it>becomes large.</p>
         </sec>
         <sec>
            <st>
               <p>Results on real data</p>
            </st>
            <p>We also test these programs on two real data sets: (1) public haplotype data of human Chromosome 21 released by Patil <it>et al</it>. <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>; and (2) a 500 KB region on human Chromosome 5q31 which may contain a genetic variant related to the Crohn disease by Daly <it>et al</it>. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. Patil's data include 20 haplotypes of 24,047 SNPs spanning over about 32.4 MB, which are partitioned into 4,135 haplotype blocks. By genotyping 103 SNPs with minor allele frequency at least 5%, Daly <it>et al</it>. partition the 500 KB region into 11 haplotype blocks. Each haplotype block in these real data sets contains different numbers of SNPs and haplotypes (e.g., from several SNPs to hundreds of SNPs). When <it>m </it>increases, some short blocks may not contain enough SNPs for tolerating missing data (e.g., <it>m </it>> the number of SNPs in a block). As a consequence, <it>S</it><sub><it>a </it></sub>here stands for the average number of robust tag SNPs over those blocks still containing solutions.</p>
            <p>Figure <figr fid="F5">5 (a)</figr> shows the experimental results on Patil's 4,135 blocks. Because there are many long blocks in Patil's data (e.g., more than one hundred SNPs), the optimal solution for <it>m </it>> 2 can not be found within a reasonable period of time. The experimental result indicates that all algorithms find similar number of robust tag SNPs when <it>m </it>is small. The LP-relaxation algorithm slightly outperforms the others as <it>m </it>increases.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Experimental results on real data</p>
               </caption>
               <text>
                  <p><b>Experimental results on real data</b>. (a) Results from Patil's Chromosome 21 data, (b) Results from Daly's Chromosome 5q31 data.</p>
               </text>
               <graphic file="1471-2105-6-263-5" hint_layout="single"/>
            </fig>
            <p>Figure <figr fid="F5">5 (b)</figr> illustrates the experimental results on Daly's 11 blocks. Because the haplotype blocks partitioned by Daly <it>et al</it>. are very short (e.g., most blocks contain less than 12 SNPs), all optimal solutions still can be found. The solutions found by each algorithm is almost the same as optimal solutions. Theoretically, <it>S</it><sub><it>a </it></sub>should grow monotonically as <it>m </it>increases. But due to the small number of blocks in Daly's data set, <it>S</it><sub><it>a </it></sub>does not grow smoothly when <it>m </it>increases from 2 to 3. To explain this phenomenon, we report the detailed result of the first greedy algorithm in Table <tblr tid="T1">1</tblr>. For each of the 11 blocks, the number of robust tag SNPs found with respect to different values of <it>m </it>is listed in the table. Note that as mentioned before, some blocks may not contain enough SNPs for tolerating large missing data as <it>m </it>increases. When <it>m </it>increases from 2 to 3, Blocks 4 and 10 (which consumes 8 and 5 SNPs) do not contain enough SNPs for a solution and are discarded. As a result, <it>S</it><sub><it>a </it></sub>(for <it>m </it>= 3) is computed only using Blocks 1 and 2 and the value is lower than the previous one (i.e., from 4.75 to 4). This phenomenon is not shown in Figure <figr fid="F5">5 (a)</figr> because it is amortized by thousands of blocks in Patil's data set.</p>
            <tbl id="T1" hint_layout="double">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>The detailed result of first greedy algorithm on Daly's 11 blocks.</p>
               </caption>
               <tblbdy cols="13">
                  <r>
                     <c ca="center">
                        <p>Block ID</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>8</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                     <c ca="center">
                        <p>10</p>
                     </c>
                     <c ca="center">
                        <p>11</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>S</it>
                           <sub>
                              <it>a</it>
                           </sub>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="13">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p><it>m </it>= 0</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>23/11 = 2.09</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p><it>m </it>= 1</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>27/8 = 3.375</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p><it>m </it>= 2</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>8</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>19/4 = 4.75</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p><it>m </it>= 3</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>8/2 = 4</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p><it>m </it>= 4</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>10/2 = 5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p><it>m </it>= 5</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>6/1 = 6</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p><it>m </it>= 6</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>f</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>7/1 = 7</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p><it>f</it>: fail to contain enough SNPs for tolerating <it>m </it>missing SNPs</p>
               </tblfn>
            </tbl>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>In terms of efficiency, the first and second greedy algorithms are faster than the LP-relaxation algorithm. The greedy algorithms usually returns a solution in seconds and the LP-relaxation algorithm requires about half minute for a solution. It is because the running time of LP-relaxation algorithm is bounded by the time of solving the LP problem. Furthermore, this LP-relaxation algorithm is repeated for 10 times to explore 10 different solutions. The OPT program for searching the optimal solution is apparently slower than the others. The optimal solution usually can not be found within a reasonable period of time if the size of the block becomes large. &#191;From our empirical study, the optimal solution can be found in reasonable time by the OPT program if the block contains less than 20 SNPs (e.g., the short random data sets). But for those large data sets with more than 40 SNPs, the OPT program is significantly outperformed by the approximation algorithms (e.g., fail to output a solution within one week computation).</p>
         <p>Assuming no missing data (i.e., <it>m </it>= 0), we compare the solutions found by each algorithm with the optimal solution. Table <tblr tid="T2">2</tblr> lists the numbers of total tag SNPs found by each algorithm in previous experiments. In the experiments on random and Daly's data, the solution found by each algorithm is as good as the optimal solution. In the experiments on Hudson's and Patil's data, these algorithms still find solutions quite close to the optimal solution. For example, the approximation ratios of these algorithms are only <m:math name="1471-2105-6-263-i8" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mfrac><m:mrow><m:mn>472</m:mn></m:mrow><m:mrow><m:mn>443</m:mn></m:mrow></m:mfrac><m:mo>&#8776;</m:mo><m:mn>1.07</m:mn></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabisda0iabiEda3iabikdaYaqaaiabisda0iabisda0iabiodaZaaacqGHijYUcqaIXaqmcqGGUaGlcqaIWaamcqaI3aWnaaa@37F1@</m:annotation></m:semantics></m:math> and <m:math name="1471-2105-6-263-i9" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mfrac><m:mrow><m:mn>4657</m:mn></m:mrow><m:mrow><m:mn>4595</m:mn></m:mrow></m:mfrac><m:mo>&#8776;</m:mo><m:mn>1.01</m:mn></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabisda0iabiAda2iabiwda1iabiEda3aqaaiabisda0iabiwda1iabiMda5iabiwda1aaacqGHijYUcqaIXaqmcqGGUaGlcqaIWaamcqaIXaqmaaa@39EB@</m:annotation></m:semantics></m:math>, respectively.</p>
         <tbl id="T2" hint_layout="double">
            <title>
               <p>Table 2</p>
            </title>
            <caption>
               <p>The number of total tag SNPs found by each algorithm. The percentage of tag SNPs with respect to total SNPs is shown in parentheses.</p>
            </caption>
            <tblbdy cols="7">
               <r>
                  <c>
                     <p/>
                  </c>
                  <c cspan="2" ca="center">
                     <p>Random data</p>
                  </c>
                  <c cspan="2" ca="center">
                     <p>Hudson's data</p>
                  </c>
                  <c ca="center">
                     <p>Patil's data</p>
                  </c>
                  <c ca="center">
                     <p>Daly's data</p>
                  </c>
               </r>
               <r>
                  <c cspan="7">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>Total blocks</p>
                  </c>
                  <c ca="center">
                     <p>100</p>
                  </c>
                  <c ca="center">
                     <p>100</p>
                  </c>
                  <c ca="center">
                     <p>100</p>
                  </c>
                  <c ca="center">
                     <p>100</p>
                  </c>
                  <c ca="center">
                     <p>4135</p>
                  </c>
                  <c ca="center">
                     <p>11</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>Total SNPs</p>
                  </c>
                  <c ca="center">
                     <p>2000</p>
                  </c>
                  <c ca="center">
                     <p>4000</p>
                  </c>
                  <c ca="center">
                     <p>2000</p>
                  </c>
                  <c ca="center">
                     <p>4000</p>
                  </c>
                  <c ca="center">
                     <p>24047</p>
                  </c>
                  <c ca="center">
                     <p>103</p>
                  </c>
               </r>
               <r>
                  <c cspan="7">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>1<sup><it>st </it></sup>Greedy</p>
                  </c>
                  <c ca="center">
                     <p>400 (20%)</p>
                  </c>
                  <c ca="center">
                     <p>400 (10%)</p>
                  </c>
                  <c ca="center">
                     <p>509 (25.5%)</p>
                  </c>
                  <c ca="center">
                     <p>472 (11.8%)</p>
                  </c>
                  <c ca="center">
                     <p>4610 (19.2%)</p>
                  </c>
                  <c ca="center">
                     <p>23 (22.3%)</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>2<sup><it>nd </it></sup>Greedy</p>
                  </c>
                  <c ca="center">
                     <p>400 (20%)</p>
                  </c>
                  <c ca="center">
                     <p>400 (10%)</p>
                  </c>
                  <c ca="center">
                     <p>509 (25.5%)</p>
                  </c>
                  <c ca="center">
                     <p>472 (11.8%)</p>
                  </c>
                  <c ca="center">
                     <p>4610 (19.2%)</p>
                  </c>
                  <c ca="center">
                     <p>23 (22.3%)</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>LP-relaxation</p>
                  </c>
                  <c ca="center">
                     <p>400 (20%)</p>
                  </c>
                  <c ca="center">
                     <p>400 (10%)</p>
                  </c>
                  <c ca="center">
                     <p>509 (25.5%)</p>
                  </c>
                  <c ca="center">
                     <p>471 (11.8%)</p>
                  </c>
                  <c ca="center">
                     <p>4657 (19.4%)</p>
                  </c>
                  <c ca="center">
                     <p>23 (22.3%)</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>OPT</p>
                  </c>
                  <c ca="center">
                     <p>400 (20%)</p>
                  </c>
                  <c ca="center">
                     <p>400 (10%)</p>
                  </c>
                  <c ca="center">
                     <p>492 (24.6%)</p>
                  </c>
                  <c ca="center">
                     <p>443 (11.1%)</p>
                  </c>
                  <c ca="center">
                     <p>4595 (19.1%)</p>
                  </c>
                  <c ca="center">
                     <p>23 (22.3%)</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <p>We then analyze the genotyping cost that can be saved by using tag SNPs. In Table <tblr tid="T2">2</tblr>, the percentage of tag SNPs in each data set is shown in parentheses. The experimental results indicate that the cost of genotyping tag SNPs is significantly reduced in comparison with genotyping all SNPs in a block. For example, in Patil's data, we only need to genotype about 19% of tag SNPs in each block, which saves about 81% genotyping cost.</p>
         <p>The tradeoffs between the number of additional tag SNPs required and the number of missing SNPs allowed are discussed in the following. In practice, missing data in the genotyping experiment are usually limited to certain missing rate. We transform the maximum number of missing SNPs allowed into maximum missing rates allowed by calculating the percentage of <it>m </it>with respect to the number of robust tag SNPs. Table <tblr tid="T3">3</tblr> lists the results of the first greedy algorithm applied on random and Hudson's long data sets. The number of additional tag SNPs grows with respect to <it>m </it>linearly. However, we observe that the maximum missing rate allowed grows slowly as <it>m </it>becomes large. This is because more additional tag SNPs are required in order to tolerate more missing SNPs. But under the same SNP missing rate, genotyping these additional tag SNPs may also increase the number of missing SNPs, which reduces the power of robust tag SNPs. On the positive side, when <it>m </it>is small, the corresponding maximum missing rate allowed is sufficient for most genotyping experiments since their missing rates are usually less than 10%. For example, the robust tag SNPs with <it>m </it>= 1 are sufficient to tolerate 10% missing SNPs, and they only requires at most 3 additional SNPs. As a result, genotyping additional tag SNPs for tolerating missing data is cost-effective under the current genotyping environment.</p>
         <tbl id="T3" hint_layout="double">
            <title>
               <p>Table 3</p>
            </title>
            <caption>
               <p>The tradeoffs between additional tag SNPs required and maximum missing rates allowed. These results come from the first greedy algorithm applied on random and Hudson's data sets with 40 SNPs.</p>
            </caption>
            <tblbdy cols="8">
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>
                        <it>m</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>0</p>
                  </c>
                  <c ca="center">
                     <p>1</p>
                  </c>
                  <c ca="center">
                     <p>2</p>
                  </c>
                  <c ca="center">
                     <p>3</p>
                  </c>
                  <c ca="center">
                     <p>4</p>
                  </c>
                  <c ca="center">
                     <p>5</p>
                  </c>
               </r>
               <r>
                  <c cspan="8">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>Random data (40 SNPs)</p>
                  </c>
                  <c ca="center">
                     <p>average number of robust tag SNPs</p>
                  </c>
                  <c ca="center">
                     <p>4</p>
                  </c>
                  <c ca="center">
                     <p>6</p>
                  </c>
                  <c ca="center">
                     <p>8.51</p>
                  </c>
                  <c ca="center">
                     <p>10.47</p>
                  </c>
                  <c ca="center">
                     <p>12.89</p>
                  </c>
                  <c ca="center">
                     <p>14.92</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>corresponding SNP missing rate</p>
                  </c>
                  <c ca="center">
                     <p>0</p>
                  </c>
                  <c ca="center">
                     <p>16.7%</p>
                  </c>
                  <c ca="center">
                     <p>23.5%</p>
                  </c>
                  <c ca="center">
                     <p>28.6%</p>
                  </c>
                  <c ca="center">
                     <p>31.0%</p>
                  </c>
                  <c ca="center">
                     <p>33.5%</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>average number of extra tag SNPs</p>
                  </c>
                  <c ca="center">
                     <p>0</p>
                  </c>
                  <c ca="center">
                     <p>2</p>
                  </c>
                  <c ca="center">
                     <p>4.51</p>
                  </c>
                  <c ca="center">
                     <p>6.47</p>
                  </c>
                  <c ca="center">
                     <p>8.89</p>
                  </c>
                  <c ca="center">
                     <p>10.92</p>
                  </c>
               </r>
               <r>
                  <c cspan="8">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>Hudson's data (40 SNPs)</p>
                  </c>
                  <c ca="center">
                     <p>average number of robust tag SNPs</p>
                  </c>
                  <c ca="center">
                     <p>4.72</p>
                  </c>
                  <c ca="center">
                     <p>7.71</p>
                  </c>
                  <c ca="center">
                     <p>11.28</p>
                  </c>
                  <c ca="center">
                     <p>14.67</p>
                  </c>
                  <c ca="center">
                     <p>18.23</p>
                  </c>
                  <c ca="center">
                     <p>21.67</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>corresponding SNP missing rate</p>
                  </c>
                  <c ca="center">
                     <p>0</p>
                  </c>
                  <c ca="center">
                     <p>13.0%</p>
                  </c>
                  <c ca="center">
                     <p>17.7%</p>
                  </c>
                  <c ca="center">
                     <p>20.4%</p>
                  </c>
                  <c ca="center">
                     <p>21.9%</p>
                  </c>
                  <c ca="center">
                     <p>23.1%</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>average number of extra tag SNPs</p>
                  </c>
                  <c ca="center">
                     <p>0</p>
                  </c>
                  <c ca="center">
                     <p>2.99</p>
                  </c>
                  <c ca="center">
                     <p>6.56</p>
                  </c>
                  <c ca="center">
                     <p>9.95</p>
                  </c>
                  <c ca="center">
                     <p>13.51</p>
                  </c>
                  <c ca="center">
                     <p>16.95</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <p>In reality, not all haplotypes are of equal importance or confidence. When selecting robust tag SNPs, it might be desirable to weight them according to their population frequency. To incorporate the frequency of haplotypes into this problem, there are two possible ways:</p>
         <p>1. It can be easily done by discarding the rare haplotypes and retain the common haplotypes as the input of our algorithms. This approach would not require modification to our algorithms. But the retained common haplotypes will be processed as equally weighted.</p>
         <p>2. Our algorithms try to find a set of SNPs such that each pair of haplotypes are distinguished by a threshold of at least (<it>m </it>+ 1) SNPs. A simplest way to weight the haplotypes is choosing different thresholds for each pair of haplotypes according to their population frequency. The haplotype pairs with higher frequency can then be assigned with more tag SNPs than the lower ones by our algorithms.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>In this paper, we show there exists a set of robust tag SNPs which is able to tolerate a number of missing data. Our study indicates that genotyping robust tag SNPs is more practical than genotyping minimum tag SNPs for association studies if we can not avoid the occurrence of missing data. We describe two greedy and one LP-relaxation approximation algorithms for finding robust tag SNPs. Our experimental results and theoretical analysis show that these algorithms are not only efficient but the solutions found are also close to the optimal solution. In terms of genotyping cost, we observe that the genotyping cost saved by using robust tag SNPs is significant, and genotyping additional tag SNPs to tolerate missing data is still cost-effective. One future direction is to assign weights to different types of SNPs (e.g., SNPs in coding or non-coding regions), and design algorithms for the selection of weighted tag SNPs.</p>
      </sec>
      <sec>
         <st>
            <p>Software availability</p>
         </st>
         <p><b>Project name: </b>efficient algorithms for utilizing SNP information.</p>
         <p>
            <b>Project home page: </b>
            <url>http://www.csie.ntu.edu.tw/~kmchao/tools/Robust_Tag_SNP</url>
         </p>
         <p><b>Operating system: </b>the implemented greedy algorithms are platform independent, and the implemented LP-relaxation algorithm runs on the Windows operating system.</p>
         <p><b>Programming language: </b>the greedy algorithms are implemented in JAVA, and the LP-relaxation algorithm is implemented in Perl.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <p>Assume we are given a haplotype block containing <it>N </it>SNPs and <it>K </it>haplotype patterns. This block is denoted by an <it>N </it>&#215; <it>K </it>binary matrix <it>M</it><sub><it>h </it></sub>(see Figure <figr fid="F6">6 (A)</figr>). Define <it>M</it><sub><it>h</it></sub>[<it>i</it>,<it>j</it>] &#8712; {1,2} for each <it>i </it>&#8712; [1, <it>N</it>] and <it>j </it>&#8712; [1, <it>K</it>], where 1 and 2 represent the major and minor alleles, respectively. In reality, the haplotype block may also contain missing data. This formulation can be easily extended to handle missing data by treating them as wild card symbols. To simplify the presentation of this paper, we will assume no missing data in the block. Let <it>C </it>be the set of all SNPs in <it>M</it><sub><it>h</it></sub>. The robust tag SNPs <it>C' </it>&#8838; <it>C </it>are a subset of SNPs which is able to distinguish each pair of haplotype patterns unambiguously when at most <it>m </it>SNPs are missing. Note that the missing data may occur at any SNP locus and thus create different missing patterns (see Figure <figr fid="F2">2</figr>). For any haplotype pattern with up to <it>m </it>missing SNPs, the set of robust tag SNPs <it>C' </it>is required to distinguish all of them unambiguously.</p>
         <fig id="F6">
            <title>
               <p>Figure 6</p>
            </title>
            <caption>
               <p>Reformulation of the MRTS problem</p>
            </caption>
            <text>
               <p><b>Reformulation of the MRTS problem</b>. (A) The haplotype matrix <it>M</it><sub><it>h </it></sub>containing <it>N </it>SNPs and <it>K </it>haplotype patterns. (B) The bipartite graph corresponding to <it>M</it><sub><it>h</it></sub>.</p>
            </text>
            <graphic file="1471-2105-6-263-6" hint_layout="single"/>
         </fig>
         <p>To distinguish a haplotype pattern unambiguously, each pair of patterns must be distinguished by at least one SNP in <it>C'</it>. For example (see Figure <figr fid="F6">6 (A)</figr>), we say patterns <it>P</it><sub>1 </sub>and <it>P</it><sub>2 </sub>can be distinguished by SNP <it>S</it><sub>2 </sub>since <it>M</it><sub><it>h</it></sub>[2,1] &#8800; <it>M</it><sub><it>h</it></sub>[2,2]. A formal definition of this problem is given below.</p>
         <sec>
            <st>
               <p>Problem: Minimum Robust Tag SNPs (MRTS)</p>
            </st>
            <p><b>Input: </b>An <it>N </it>&#215; <it>K </it>matrix <it>M</it><sub><it>h </it></sub>and an integer <it>m</it>.</p>
            <p><b>Output: </b>The minimum subset of SNPs <it>C' </it>&#8838; <it>C </it>which satisfies:</p>
            <p>(1) for each pair of patterns <it>P</it><sub><it>i </it></sub>and <it>P</it><sub><it>j</it></sub>, these is a SNP <it>S</it><sub><it>k </it></sub>&#8712; <it>C' </it>such that <it>M</it><sub><it>h</it></sub>[<it>k</it>, <it>i</it>] &#8800; <it>M</it><sub><it>h</it></sub>[<it>k</it>, <it>j</it>];</p>
            <p>(2) when at most <it>m </it>SNPs are discarded from <it>C' </it>arbitrarily, (1) still holds.</p>
            <p>We then reformulate MRTS to a variant of the <it>set covering problem </it><abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. Each SNP <it>S</it><sub><it>k </it></sub>&#8712; <it>C </it>(i.e., the <it>k</it>-th row in <it>M</it><sub><it>h</it></sub>) is reformulated to a set <m:math name="1471-2105-6-263-i10" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>S</m:mi><m:mi>k</m:mi><m:mo>'</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGtbWudaqhaaWcbaGaem4AaSgabaGaei4jaCcaaaaa@303F@</m:annotation></m:semantics></m:math> = {(<it>i</it>, <it>j</it>) | <it>M</it>[<it>k</it>, <it>i</it>] &#8800; <it>M</it>[<it>k</it>, <it>j</it>] and <it>i </it>&lt;<it>j</it>}. For example, suppose the <it>k</it>-th row in <it>M</it><sub><it>h </it></sub>is {1,1,1,2}. The corresponding set <m:math name="1471-2105-6-263-i10" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>S</m:mi><m:mi>k</m:mi><m:mo>'</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGtbWudaqhaaWcbaGaem4AaSgabaGaei4jaCcaaaaa@303F@</m:annotation></m:semantics></m:math> = {(1,4), (2,4), (3,4)}. In other words, <m:math name="1471-2105-6-263-i10" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>S</m:mi><m:mi>k</m:mi><m:mo>'</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGtbWudaqhaaWcbaGaem4AaSgabaGaei4jaCcaaaaa@303F@</m:annotation></m:semantics></m:math> stores the pairs of patterns distinguished by SNP <it>S</it><sub><it>k</it></sub>. Define <it>P </it>as the set that contains all pairs of patterns (i.e., <it>P </it>= {(<it>i</it>,<it>j</it>) | 1 &#8804; <it>i </it>&lt;<it>j </it>&#8804; <it>K</it>} = {(1,2), (1,3), ..., (<it>K </it>- l,<it>K</it>)}).</p>
            <p>Consider each element in <it>P </it>and each reformulated set of <it>C </it>as nodes in an undirected bipartite graph (see Figure <figr fid="F6">6 (B)</figr>. If SNP <it>S</it><sub><it>k </it></sub>can distinguish patterns <it>P</it><sub><it>i </it></sub>and <it>P</it><sub><it>j </it></sub>(i.e., (<it>i</it>,<it>j</it>) &#8712; <m:math name="1471-2105-6-263-i10" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>S</m:mi><m:mi>k</m:mi><m:mo>'</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGtbWudaqhaaWcbaGaem4AaSgabaGaei4jaCcaaaaa@303F@</m:annotation></m:semantics></m:math>), there is an edge connecting the nodes (<it>i</it>, <it>j</it>) and <m:math name="1471-2105-6-263-i10" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>S</m:mi><m:mi>k</m:mi><m:mo>'</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGtbWudaqhaaWcbaGaem4AaSgabaGaei4jaCcaaaaa@303F@</m:annotation></m:semantics></m:math>. The following lemma implies that each pair of patterns must be distinguished by at least (<it>m </it>+ 1) SNPs to tolerate <it>m </it>missing SNPs.</p>
            <p><b>Lemma 1. </b><it>C' </it>&#8838; <it>C is the set of robust tag SNPs which allows at most <it>m </it>missing SNPs iff each node in P has at least </it>(<it>m </it>+ 1) <it>edges connecting to each node in C'</it>.</p>
            <p><it>Proof. </it>Let <it>C' </it>be the set of robust tag SNPs which allows at most <it>m </it>missing SNPs. Suppose patterns <it>P</it><sub><it>i </it></sub>and <it>P</it><sub><it>j </it></sub>are distinguished by only <it>m </it>SNPs in <it>C' </it>(i.e., (<it>i</it>, <it>j</it>) has only <it>m </it>edges connecting to nodes in <it>C'</it>). However, if these <it>m </it>SNPs are all missing, no other SNPs in <it>C' </it>are able to distinguish patterns <it>P</it><sub><it>i </it></sub>and <it>P</it><sub><it>j</it></sub>, which is a contradiction. Thus, each pair of patterns must be distinguished by at least (<it>m </it>+ 1) SNPs, which implies that each node in <it>P </it>must have at least (<it>m </it>+ 1) edges connecting to nodes in <it>C'</it>. The proof of the other direction is similar. &#160;&#160;</p>
            <p>In the following, we give a lower bound regarding the minimum number of robust tag SNPs required, which is used to skip some solution space by the OPT program.</p>
            <p><b>Lemma 2. </b><it>Given K haplotype patterns, the minimum number of robust tag SNPs required is at least </it>log <it>K</it>.</p>
            <p><it>Proof. </it>Recall that the value of a SNP is binary. The maximum number of distinct haplotypes which can be distinguished by <it>N </it>SNPs is at most 2<sup><it>N</it></sup>. As a result, for a given data set containing <it>K </it>haplotype patterns, the minimum number of SNPs required is at least log <it>K</it>. &#160;&#160;&#160;</p>
            <p>The following theorem shows the NP-hardness of the MRTS problem, which implies there is no polynomial time algorithm to find the optimal solution of MRTS.</p>
            <p><b>Theorem 1. </b><it>The MRTS problem is NP-hard</it>.</p>
            <p><it>Proof. </it>When <it>m </it>= 0, MRTS is the same as the original problem of finding minimum tag SNPs, which is known as the <it>minimum test set </it>problem <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B17">17</abbr></abbrgrp>. Since the minimum test set problem is NP-hard and can be reduced to a special case of MRTS, MRTS is NP-hard. &#160;&#160;&#160;</p>
         </sec>
         <sec>
            <st>
               <p>The first greedy algorithm</p>
            </st>
            <p>To solve MRTS efficiently, we propose a greedy algorithm which returns a solution not too larger than the optimal solution. By Lemma 1, to tolerate <it>m </it>missing tag SNPs, we need to find a subset of SNPs <it>C' </it>&#8838; <it>C </it>such that each pair of patterns in <it>P </it>is distinguished by at least (<it>m + </it>1) SNPs in <it>C'</it>. Assume that the SNPs selected by this algorithm are stored in a (<it>m </it>+ 1) &#215; |<it>P</it>| table (see Figure <figr fid="F7">7 (A)</figr>). Initially, each grid in the table is empty. Once a SNP <it>S</it><sub><it>k</it></sub>, (that can distinguish patterns <it>P</it><sub><it>i </it></sub>and <it>P</it><sub><it>j</it></sub>) is selected, one grid of the column (<it>i</it>, <it>j</it>) is filled in with <it>S</it><sub><it>k</it></sub>, and we say that this grid is <it>covered </it>by <it>S</it><sub><it>k</it></sub>.</p>
            <fig id="F7">
               <title>
                  <p>Figure 7</p>
               </title>
               <caption>
                  <p>An example of the first greedy algorithm</p>
               </caption>
               <text>
                  <p><b>An example of the first greedy algorithm</b>. The SNPs <it>S</it><sub>1</sub>, <it>S</it><sub>4</sub>, <it>S</it><sub>2</sub>, and <it>S</it><sub>3 </sub>are selected by the first greedy algorithm. (A) The table that stores each selected SNP.</p>
               </text>
               <graphic file="1471-2105-6-263-7" hint_layout="single"/>
            </fig>
            <p>This greedy algorithm works by covering the grids from the first row to the (<it>m </it>+ 1)-th row, and greedily selects a SNP which covers most uncovered grids in the <it>i</it>-th row at each iteration. In other words, while working on the <it>i</it>-th row, a SNP is selected if its reformulated set <it>S' </it>maximizes |<it>S' </it>&#8745; <it>R</it><sub><it>i </it></sub>|, where <it>R</it><sub><it>i </it></sub>is the set of uncovered grids at the <it>i</it>-th row.</p>
            <p>Figure <figr fid="F7">7</figr> illustrates an example for this algorithm to tolerate one missing tag SNP (i.e., <it>m </it>= 1). The SNPs <it>S</it><sub>1</sub>, <it>S</it><sub>4</sub>, <it>S</it><sub>2</sub>, and <it>S</it><sub>3 </sub>are selected in order. When all grids in this table are covered, each pair of patterns is distinguished by (<it>m + </it>1) SNPs in the corresponding column. Thus, the SNPs in this table are the robust tag SNPs which can tolerate up to <it>m </it>missing SNPs. The pseudo code of this greedy algorithm is given below.</p>
            <p><b>Algorithm: </b>FlRST-GREEDY-ALGORITHM (<it>C</it>, <it>P</it>, <it>m</it>)</p>
            <p>1 <it>R</it><sub><it>i </it></sub>&#8592; <it>P</it>, &#8704;<it>i </it>&#8712; [1, <it>m </it>+ 1]</p>
            <p>2 <it>C' </it>&#8592; <it>&#966;</it></p>
            <p>3 <b>for </b><it>i </it>= 1 to <it>m </it>+ 1 <b>do</b></p>
            <p>4 &#160;&#160;&#160;<b>while </b><it>R</it><sub><it>i </it></sub>&#8800; <it>&#966; </it><b>do</b></p>
            <p>5 &#160;&#160;&#160;&#160;&#160;&#160;select and remove a SNP <it>S </it>from <it>C </it>that maximizes |<it>S' </it>&#8745; <it>R</it><sub><it>i</it></sub>|</p>
            <p>6 &#160;&#160;&#160;&#160;&#160;&#160;<it>C' </it>&#8592; <it>C' </it>&#8746; <it>S</it></p>
            <p>7 &#160;&#160;&#160;&#160;&#160;&#160;<it>j </it>&#8592; <it>i</it></p>
            <p>8 &#160;&#160;&#160;&#160;&#160;&#160;<b>while </b><it>S' </it>&#8800; <it>&#966; </it><b>and </b><it>j </it>&#8804; <it>m </it>+ 1 <b>do</b></p>
            <p>9 &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<it>S</it><sub><it>tmp </it></sub>&#8592; <it>S' </it>&#8745; <it>R</it><sub><it>j </it></sub>//<it>S</it><sub><it>tmp </it></sub>is a temporary variable for holding the result of <it>S' </it>&#8745; <it>R</it><sub><it>i</it></sub></p>
            <p>10 &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<it>R</it><sub><it>j </it></sub>&#8592; <it>R</it><sub><it>j </it></sub>- <it>S</it><sub><it>tmp</it></sub></p>
            <p>11 &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<it>S' </it>&#8592; <it>S' </it>- <it>S</it><sub><it>tmp</it></sub></p>
            <p>12 &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;<it>j </it>&#8592; <it>j </it>+ l</p>
            <p>13 &#160;&#160;&#160;&#160;&#160;&#160;<b>endwhile</b></p>
            <p>14 &#160;&#160;&#160;<b>endwhile</b></p>
            <p>15 <b>endfor</b></p>
            <p>16 <b>return </b><it>C'</it></p>
            <p>The time complexity of this algorithm is analyzed as follows. At Line 4, the number of iterations of the intermediate loop is bounded by |<it>R</it><sub><it>i</it></sub>| &#8804; |<it>P</it>|. Within the loop body (Lines 5&#8211;13), Line 5 takes <it>O</it>(|<it>C</it>||<it>P</it>|) because we need to check all SNPs in <it>C </it>and examine the uncovered grids of <it>R</it><sub><it>i</it></sub>. The inner loop (Lines 8&#8211;13) takes only <it>O</it>(|<it>S'</it>|). Thus, the entire program runs in <it>O</it>(<it>m</it>|<it>C</it>||<it>P</it>|<sup>2</sup>).</p>
            <p>We now show the solution <it>C' </it>returned by the first greedy algorithm is not too larger than the optimal solution <it>C*</it>. Suppose the algorithm selects the <it>k</it>-th SNP when working on the <it>i</it>-th row. Let |<m:math name="1471-2105-6-263-i11" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>S</m:mi><m:mi>k</m:mi><m:mi>c</m:mi></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGtbWudaqhaaWcbaGaem4AaSgabaGaem4yamgaaaaa@30B8@</m:annotation></m:semantics></m:math>| be the number of grids in the <it>i</it>-th row covered by the <it>k</it>-th selected SNP (i.e., |<m:math name="1471-2105-6-263-i11" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>S</m:mi><m:mi>k</m:mi><m:mi>c</m:mi></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGtbWudaqhaaWcbaGaem4AaSgabaGaem4yamgaaaaa@30B8@</m:annotation></m:semantics></m:math>| = |<it>S</it>' &#8745; <it>R</it><sub><it>i</it></sub>|; see Line 5 in FIRST-GREEDY-ALGORITHM). For example (see Figure <figr fid="F7">7</figr>), <m:math name="1471-2105-6-263-i12" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>S</m:mi><m:mn>2</m:mn><m:mi>c</m:mi></m:msubsup><m:mo>=</m:mo><m:mn>2</m:mn></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGtbWudaqhaaWcbaGaeGOmaidabaGaem4yamgaaOGaeyypa0JaeGOmaidaaa@324D@</m:annotation></m:semantics></m:math> since the second selected SNP (i.e., <it>S</it><sub>4</sub>) covers two grids in the first row. We incur 1 unit of cost to each selected SNP, and spread this cost among the grids in <m:math name="1471-2105-6-263-i11" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>S</m:mi><m:mi>k</m:mi><m:mi>c</m:mi></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGtbWudaqhaaWcbaGaem4AaSgabaGaem4yamgaaaaa@30B8@</m:annotation></m:semantics></m:math><abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. In other words, each grid at the <it>i-</it>th row and <it>j</it>-th column is assigned a cost <m:math name="1471-2105-6-263-i13" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>C</m:mi><m:mi>j</m:mi><m:mi>i</m:mi></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGdbWqdaqhaaWcbaGaemOAaOgabaGaemyAaKgaaaaa@30A2@</m:annotation></m:semantics></m:math> (see Figure <figr fid="F8">8</figr>), where</p>
            <fig id="F8">
               <title>
                  <p>Figure 8</p>
               </title>
               <caption>
                  <p>Analysis of the first greedy algorithm</p>
               </caption>
               <text>
                  <p><b>Analysis of the first greedy algorithm</b>. This figure shows the cost <m:math name="1471-2105-6-263-i13" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>C</m:mi><m:mi>j</m:mi><m:mi>i</m:mi></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGdbWqdaqhaaWcbaGaemOAaOgabaGaemyAaKgaaaaa@30A2@</m:annotation></m:semantics></m:math> of each grid for the first greedy algorithm.</p>
               </text>
               <graphic file="1471-2105-6-263-8" hint_layout="single"/>
            </fig>
            <p>
               <m:math name="1471-2105-6-263-i14" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:msubsup>
                           <m:mi>C</m:mi>
                           <m:mi>j</m:mi>
                           <m:mi>i</m:mi>
                        </m:msubsup>
                        <m:mo>=</m:mo>
                        <m:mrow>
                           <m:mo>{</m:mo>
                           <m:mrow>
                              <m:mtable columnalign="left">
                                 <m:mtr columnalign="left">
                                    <m:mtd columnalign="left">
                                       <m:mrow>
                                          <m:mstyle scriptlevel="+1">
                                             <m:mfrac>
                                                <m:mn>1</m:mn>
                                                <m:mrow>
                                                   <m:mrow>
                                                      <m:mo>|</m:mo>
                                                      <m:mrow>
                                                         <m:msubsup>
                                                            <m:mi>S</m:mi>
                                                            <m:mi>k</m:mi>
                                                            <m:mi>c</m:mi>
                                                         </m:msubsup>
                                                      </m:mrow>
                                                      <m:mo>|</m:mo>
                                                   </m:mrow>
                                                </m:mrow>
                                             </m:mfrac>
                                          </m:mstyle>
                                       </m:mrow>
                                    </m:mtd>
                                    <m:mtd columnalign="left">
                                       <m:mrow>
                                          <m:mtext>if&#160;the&#160;algorithm&#160;selects&#160;the&#160;</m:mtext>
                                          <m:mi>k</m:mi>
                                          <m:mtext>-th&#160;SNP&#160;when&#160;covering&#160;the&#160;</m:mtext>
                                          <m:mi>i</m:mi>
                                          <m:mtext>-th&#160;row;</m:mtext>
                                       </m:mrow>
                                    </m:mtd>
                                 </m:mtr>
                                 <m:mtr columnalign="left">
                                    <m:mtd columnalign="left">
                                       <m:mrow>
                                          <m:mtext>&#8201;</m:mtext>
                                          <m:mtext>&#8201;</m:mtext>
                                          <m:mn>0</m:mn>
                                       </m:mrow>
                                    </m:mtd>
                                    <m:mtd columnalign="left">
                                       <m:mrow>
                                          <m:mtext>otherwise</m:mtext>
                                          <m:mtext>.</m:mtext>
                                       </m:mrow>
                                    </m:mtd>
                                 </m:mtr>
                              </m:mtable>
                           </m:mrow>
                        </m:mrow>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGdbWqdaqhaaWcbaGaemOAaOgabaGaemyAaKgaaOGaeyypa0ZaaiqaaeaafaqaaeGacaaabaWaaSqaaSqaaiabigdaXaqaamaaemaabaGaem4uam1aa0baaWqaaiabdUgaRbqaaiabdogaJbaaaSGaay5bSlaawIa7aaaaaOqaaiabbMgaPjabbAgaMjabbccaGiabbsha0jabbIgaOjabbwgaLjabbccaGiabbggaHjabbYgaSjabbEgaNjabb+gaVjabbkhaYjabbMgaPjabbsha0jabbIgaOjabb2gaTjabbccaGiabbohaZjabbwgaLjabbYgaSjabbwgaLjabbogaJjabbsha0jabbohaZjabbccaGiabbsha0jabbIgaOjabbwgaLjabbccaGiabdUgaRjabb2caTiabbsha0jabbIgaOjabbccaGiabbofatjabb6eaojabbcfaqjabbccaGiabbEha3jabbIgaOjabbwgaLjabb6gaUjabbccaGiabbogaJjabb+gaVjabbAha2jabbwgaLjabbkhaYjabbMgaPjabb6gaUjabbEgaNjabbccaGiabbsha0jabbIgaOjabbwgaLjabbccaGiabdMgaPjabb2caTiabbsha0jabbIgaOjabbccaGiabbkhaYjabb+gaVjabbEha3jabbUda7aqaaiaaykW7caaMc8UaeGimaadabaGaee4Ba8MaeeiDaqNaeeiAaGMaeeyzauMaeeOCaiNaee4DaCNaeeyAaKMaee4CamNaeeyzauMaeeOla4caaaGaay5Eaaaaaa@9D18@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>Since each selected SNP is assigned 1 unit of cost, the sum of <m:math name="1471-2105-6-263-i13" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>C</m:mi><m:mi>j</m:mi><m:mi>i</m:mi></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGdbWqdaqhaaWcbaGaemOAaOgabaGaemyAaKgaaaaa@30A2@</m:annotation></m:semantics></m:math> for each grid in the table is equal to |<it>C'</it>|,</p>
            <p>i.e.,</p>
            <p>
               <m:math name="1471-2105-6-263-i15" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:mo>|</m:mo>
                        <m:mi>C</m:mi>
                        <m:mo>'</m:mo>
                        <m:mo>|</m:mo>
                        <m:mtext>&#8201;</m:mtext>
                        <m:mtext>&#8201;</m:mtext>
                        <m:mo>=</m:mo>
                        <m:mstyle displaystyle="true">
                           <m:munderover>
                              <m:mo>&#8721;</m:mo>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mo>=</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>m</m:mi>
                                 <m:mo>+</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                           </m:munderover>
                           <m:mrow>
                              <m:mstyle displaystyle="true">
                                 <m:munderover>
                                    <m:mo>&#8721;</m:mo>
                                    <m:mrow>
                                       <m:mi>j</m:mi>
                                       <m:mo>=</m:mo>
                                       <m:mn>1</m:mn>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mstyle scriptlevel="+1">
                                          <m:mfrac>
                                             <m:mrow>
                                                <m:mi>K</m:mi>
                                                <m:mo stretchy="false">(</m:mo>
                                                <m:mi>K</m:mi>
                                                <m:mo>&#8722;</m:mo>
                                                <m:mn>1</m:mn>
                                                <m:mo stretchy="false">)</m:mo>
                                             </m:mrow>
                                             <m:mn>2</m:mn>
                                          </m:mfrac>
                                       </m:mstyle>
                                    </m:mrow>
                                 </m:munderover>
                                 <m:mrow>
                                    <m:msubsup>
                                       <m:mi>C</m:mi>
                                       <m:mi>j</m:mi>
                                       <m:mi>i</m:mi>
                                    </m:msubsup>
                                 </m:mrow>
                              </m:mstyle>
                           </m:mrow>
                        </m:mstyle>
                        <m:mo>.</m:mo>
                        <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                        <m:mrow>
                           <m:mo>(</m:mo>
                           <m:mn>1</m:mn>
                           <m:mo>)</m:mo>
                        </m:mrow>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqGG8baFcqWGdbWqcqGGNaWjcqGG8baFcaaMc8UaaGPaVlabg2da9maaqahabaWaaabCaeaacqWGdbWqdaqhaaWcbaGaemOAaOgabaGaemyAaKgaaaqaaiabdQgaQjabg2da9iabigdaXaqaamaaleaameaacqWGlbWscqGGOaakcqWGlbWscqGHsislcqaIXaqmcqGGPaqkaeaacqaIYaGmaaaaniabggHiLdaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGTbqBcqGHRaWkcqaIXaqma0GaeyyeIuoakiabc6caUiaaxMaacaWLjaWaaeWaaeaacqaIXaqmaiaawIcacaGLPaaaaaa@537C@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p>Let <m:math name="1471-2105-6-263-i16" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>R</m:mi><m:mi>k</m:mi><m:mi>i</m:mi></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGsbGudaqhaaWcbaGaem4AaSgabaGaemyAaKgaaaaa@30C2@</m:annotation></m:semantics></m:math> be the number of uncovered grids in the <it>i</it>-th row before the <it>k</it>-th iteration (i.e., (<it>k </it>- 1) SNPs have been selected by the algorithm). For example (see Figure <figr fid="F8">8</figr>), <m:math name="1471-2105-6-263-i17" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>R</m:mi><m:mn>2</m:mn><m:mn>1</m:mn></m:msubsup><m:mo>=</m:mo><m:mn>2</m:mn></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGsbGudaqhaaWcbaGaeGOmaidabaGaeGymaedaaOGaeyypa0JaeGOmaidaaa@31EC@</m:annotation></m:semantics></m:math> since two grids in the first row are still uncovered before the second SNP is selected. Define <m:math name="1471-2105-6-263-i18" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>C</m:mi><m:mi>i</m:mi><m:mo>'</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGdbWqdaqhaaWcbaGaemyAaKgabaGaei4jaCcaaaaa@301B@</m:annotation></m:semantics></m:math> as the set of iterations used by the algorithm when working on the <it>i</it>-th row. For example (see Figure <figr fid="F8">8</figr>), <m:math name="1471-2105-6-263-i19" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>C</m:mi><m:mn>2</m:mn><m:mo>'</m:mo></m:msubsup><m:mo>=</m:mo><m:mo>{</m:mo><m:mn>3</m:mn><m:mo>,</m:mo><m:mn>4</m:mn><m:mo>}</m:mo></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGdbWqdaqhaaWcbaGaeGOmaidabaGaei4jaCcaaOGaeyypa0Jaei4EaSNaeG4mamJaeiilaWIaeGinaqJaeiyFa0haaa@368C@</m:annotation></m:semantics></m:math> since this algorithm works on the second row in the third and fourth iterations. We can rewrite (1) as</p>
            <p>
               <m:math name="1471-2105-6-263-i20" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mrow>
                        <m:mstyle displaystyle="true">
                           <m:munderover>
                              <m:mo>&#8721;</m:mo>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mo>=</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>m</m:mi>
                                 <m:mo>+</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                           </m:munderover>
                           <m:mrow>
                              <m:mstyle displaystyle="true">
                                 <m:munderover>
                                    <m:mo>&#8721;</m:mo>
                                    <m:mrow>
                                       <m:mi>j</m:mi>
                                       <m:mo>=</m:mo>
                                       <m:mn>1</m:mn>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mstyle scriptlevel="+1">
                                          <m:mfrac>
                                             <m:mrow>
                                                <m:mi>K</m:mi>
                                                <m:mo stretchy="false">(</m:mo>
                                                <m:mi>K</m:mi>
                                                <m:mo>&#8722;</m:mo>
                                                <m:mn>1</m:mn>
                                                <m:mo stretchy="false">)</m:mo>
                                             </m:mrow>
                                             <m:mn>2</m:mn>
                                          </m:mfrac>
                                       </m:mstyle>
                                    </m:mrow>
                                 </m:munderover>
                                 <m:mrow>
                                    <m:msubsup>
                                       <m:mi>C</m:mi>
                                       <m:mi>j</m:mi>
                                       <m:mi>i</m:mi>
                                    </m:msubsup>
                                 </m:mrow>
                              </m:mstyle>
                           </m:mrow>
                        </m:mstyle>
                        <m:mo>=</m:mo>
                        <m:mstyle displaystyle="true">
                           <m:munderover>
                              <m:mo>&#8721;</m:mo>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mo>=</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>m</m:mi>
                                 <m:mo>+</m:mo>
                                 <m:mn>1</m:mn>
                              </m:mrow>
                           </m:munderover>
                           <m:mrow>
                              <m:mstyle displaystyle="true">
                                 <m:munder>
                                    <m:mo>&#8721;</m:mo>
                                    <m:mrow>
                                       <m:mi>k</m:mi>
                                       <m:mo>&#8712;</m:mo>
                                       <m:msubsup>
                                          <m:mi>C</m:mi>
                                          <m:mi>i</m:mi>
                                          <m:mo>'</m:mo>
                                       </m:msubsup>
                                    </m:mrow>
                                 </m:munder>
                                 <m:mrow>
                                    <m:mrow>
                                       <m:mo>(</m:mo>
                                       <m:mrow>
                                          <m:msubsup>
                                             <m:mi>R</m:mi>
                                             <m:mrow>
                                                <m:mi>k</m:mi>
                                                <m:mo>&#8722;</m:mo>
                                                <m:mn>1</m:mn>
                                             </m:mrow>
                                             <m:mi>i</m:mi>
                                          </m:msubsup>
                                          <m:mo>&#8722;</m:mo>
                                          <m:msubsup>
                                             <m:mi>R</m:mi>
                                             <m:mi>k</m:mi>
                                             <m:mi>i</m:mi>
                                          </m:msubsup>
                                       </m:mrow>
                                       <m:mo>)</m:mo>
                                    </m:mrow>
                                 </m:mrow>
                              </m:mstyle>
                           </m:mrow>
                        </m:mstyle>
                        <m:mfrac>
                           <m:mn>1</m:mn>
                           <m:mrow>
                              <m:mo>|</m:mo>
                              <m:msubsup>
                                 <m:mi>S</m:mi>
                                 <m:mi>k</m:mi>
                                 <m:mi>c</m:mi>
                              </m:msubsup>
                              <m:mo>|</m:mo>
                           </m:mrow>
                        </m:mfrac>
                        <m:mo>.</m:mo>
                        <m:mtext>&#160;&#160;&#160;&#160;&#160;</m:mtext>
                        <m:mrow>
                           <m:mo>(</m:mo>
                           <m:mn>2</m:mn>
                           <m:mo>)</m:mo>
                        </m:mrow>
                     </m:mrow>
                     <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaaeWbqaamaaqahabaGaem4qam0aa0baaSqaaiabdQgaQbqaaiabdMgaPbaaaeaacqWGQbGAcqGH9aqpcqaIXaqmaeaadaWcbaadbaGaem4saSKaeiikaGIaem4saSKaeyOeI0IaeGymaeJaeiykaKcabaGaeGOmaidaaaqdcqGHris5aaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemyBa0Maey4kaSIaeGymaedaniabggHiLdGccqGH9aqpdaaeWbqaamaaqafabaWaaeWaaeaacqWGsbGudaqhaaWcbaGaem4AaSMaeyOeI0IaeGymaedabaGaemyAaKgaaOGaeyOeI0IaemOuai1aa0baaSqaaiabdUgaRbqaaiabdMgaPbaaaOGaayjkaiaawMcaaaWcbaGaem4AaSMaeyicI4Saem4qam0aa0baaWqaaiabdMgaPbqaaiabcEcaNaaaaSqab0GaeyyeIuoaaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd2gaTjabgUcaRiabigdaXaqdcqGHris5aOWaaSaaaeaacqaIXaqmaeaacqGG8baFcqWGtbWudaqhaaWcbaGaem4AaSgabaGaem4yamgaaOGaeiiFaWhaaiabc6caUiaaxMaacaWLjaWaaeWaaeaacqaIYaGmaiaawIcacaGLPaaaaaa@7177@</m:annotation>
                  </m:semantics>
               </m:math>
            </p>
            <p><b>Lemma 3. </b><it>The k-th selected SNP has </it><m:math name="1471-2105-6-263-i21" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mrow><m:mo>|</m:mo><m:mrow><m:msubsup><m:mi>S</m:mi><m:mi>k</m:mi><m:mi>c</m:mi></m:msubsup></m:mrow><m:mo>|</m:mo></m:mrow><m:mo>&#8805;</m:mo><m:mfrac><m:mrow><m:msubsup><m:mi>R</m:mi><m:mrow><m:mi>k</m:mi><m:mo>&#8722;</m:mo><m:mn>1</m:mn></m:mrow><m:mi>i</m:mi></m:msubsup></m:mrow><m:mrow><m:mo>|</m:mo><m:mi>C</m:mi><m:mo>*</m:mo><m:mo>|</m:mo></m:mrow></m:mfrac></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaabdaqaaiabdofatnaaDaaaleaacqWGRbWAaeaacqWGJbWyaaaakiaawEa7caGLiWoacqGHLjYSdaWcaaqaaiabdkfasnaaDaaaleaacqWGRbWAcqGHsislcqaIXaqmaeaacqWGPbqAaaaakeaacqGG8baFcqWGdbWqcqGGQaGkcqGG8baFaaaaaa@40A0@</m:annotation></m:semantics></m:math>.</p>
            <p><it>Proof. </it>Suppose the algorithm is working on the <it>i</it>-th row at the beginning of the <it>k</it>-th iteration. Let <m:math name="1471-2105-6-263-i22" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>C</m:mi><m:mi>k</m:mi><m:mo>*</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGdbWqdaqhaaWcbaGaem4AaSgabaGaeiOkaOcaaaaa@3025@</m:annotation></m:semantics></m:math> be the set of SNPs in <it>C* </it>(the optimal solution) that has been selected by the algorithm before the <it>k</it>-th iteration, and the set of remaining SNPs in <it>C* </it>be <m:math name="1471-2105-6-263-i23" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>C</m:mi><m:mover accent="true"><m:mi>k</m:mi><m:mo>&#175;</m:mo></m:mover><m:mo>*</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGdbWqdaqhaaWcbaGafm4AaSMbaebaaeaacqGGQaGkaaaaaa@303D@</m:annotation></m:semantics></m:math>. We claim that there exists a SNP in <m:math name="1471-2105-6-263-i23" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>C</m:mi><m:mover accent="true"><m:mi>k</m:mi><m:mo>&#175;</m:mo></m:mover><m:mo>*</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGdbWqdaqhaaWcbaGafm4AaSMbaebaaeaacqGGQaGkaaaaaa@303D@</m:annotation></m:semantics></m:math> which can cover at least <m:math name="1471-2105-6-263-i24" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mfrac><m:mrow><m:msubsup><m:mi>R</m:mi><m:mi>k</m:mi><m:mi>i</m:mi></m:msubsup></m:mrow><m:mrow><m:mrow><m:mo>|</m:mo><m:mrow><m:msubsup><m:mi>C</m:mi><m:mover accent="true"><m:mi>k</m:mi><m:mo>&#175;</m:mo></m:mover><m:mo>*</m:mo></m:msubsup></m:mrow><m:mo>|</m:mo></m:mrow></m:mrow></m:mfrac></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdkfasnaaDaaaleaacqWGRbWAaeaacqWGPbqAaaaakeaadaabdaqaaiabdoeadnaaDaaaleaacuWGRbWAgaqeaaqaaiabcQcaQaaaaOGaay5bSlaawIa7aaaaaaa@3797@</m:annotation></m:semantics></m:math> grids in the <it>i</it>-th row. Otherwise (i.e., each SNP in <m:math name="1471-2105-6-263-i23" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>C</m:mi><m:mover accent="true"><m:mi>k</m:mi><m:mo>&#175;</m:mo></m:mover><m:mo>*</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGdbWqdaqhaaWcbaGafm4AaSMbaebaaeaacqGGQaGkaaaaaa@303D@</m:annotation></m:semantics></m:math> covers less than <m:math name="1471-2105-6-263-i24" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mfrac><m:mrow><m:msubsup><m:mi>R</m:mi><m:mi>k</m:mi><m:mi>i</m:mi></m:msubsup></m:mrow><m:mrow><m:mrow><m:mo>|</m:mo><m:mrow><m:msubsup><m:mi>C</m:mi><m:mover accent="true"><m:mi>k</m:mi><m:mo>&#175;</m:mo></m:mover><m:mo>*</m:mo></m:msubsup></m:mrow><m:mo>|</m:mo></m:mrow></m:mrow></m:mfrac></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdkfasnaaDaaaleaacqWGRbWAaeaacqWGPbqAaaaakeaadaabdaqaaiabdoeadnaaDaaaleaacuWGRbWAgaqeaaqaaiabcQcaQaaaaOGaay5bSlaawIa7aaaaaaa@3797@</m:annotation></m:semantics></m:math> grids), all SNPs in <m:math name="1471-2105-6-263-i23" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>C</m:mi><m:mover accent="true"><m:mi>k</m:mi><m:mo>&#175;</m:mo></m:mover><m:mo>*</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGdbWqdaqhaaWcbaGafm4AaSMbaebaaeaacqGGQaGkaaaaaa@303D@</m:annotation></m:semantics></m:math> will cover less than <m:math name="1471-2105-6-263-i25" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mo stretchy="false">(</m:mo><m:mfrac><m:mrow><m:msubsup><m:mi>R</m:mi><m:mi>k</m:mi><m:mi>i</m:mi></m:msubsup></m:mrow><m:mrow><m:mrow><m:mo>|</m:mo><m:mrow><m:msubsup><m:mi>C</m:mi><m:mover accent="true"><m:mi>k</m:mi><m:mo>&#175;</m:mo></m:mover><m:mo>*</m:mo></m:msubsup></m:mrow><m:mo>|</m:mo></m:mrow></m:mrow></m:mfrac><m:mo>&#215;</m:mo><m:mrow><m:mo>|</m:mo><m:mrow><m:msubsup><m:mi>C</m:mi><m:mover accent="true"><m:mi>k</m:mi><m:mo>&#175;</m:mo></m:mover><m:mo>*</m:mo></m:msubsup></m:mrow><m:mo>|</m:mo></m:mrow><m:mo>=</m:mo><m:msubsup><m:mi>R</m:mi><m:mi>k</m:mi><m:mi>i</m:mi></m:msubsup><m:mo stretchy="false">)</m:mo></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqGGOaakdaWcaaqaaiabdkfasnaaDaaaleaacqWGRbWAaeaacqWGPbqAaaaakeaadaabdaqaaiabdoeadnaaDaaaleaacuWGRbWAgaqeaaqaaiabcQcaQaaaaOGaay5bSlaawIa7aaaacqGHxdaTdaabdaqaaiabdoeadnaaDaaaleaacuWGRbWAgaqeaaqaaiabcQcaQaaaaOGaay5bSlaawIa7aiabg2da9iabdkfasnaaDaaaleaacqWGRbWAaeaacqWGPbqAaaGccqGGPaqkaaa@473F@</m:annotation></m:semantics></m:math> grids in the <it>i-th </it>row. But since <m:math name="1471-2105-6-263-i26" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>C</m:mi><m:mi>k</m:mi><m:mo>*</m:mo></m:msubsup><m:mo>&#8746;</m:mo><m:msubsup><m:mi>C</m:mi><m:mover accent="true"><m:mi>k</m:mi><m:mo>&#175;</m:mo></m:mover><m:mo>*</m:mo></m:msubsup><m:mo>=</m:mo><m:mi>C</m:mi><m:mo>*</m:mo></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGdbWqdaqhaaWcbaGaem4AaSgabaGaeiOkaOcaaOGaeSOkIuLaem4qam0aa0baaSqaaiqbdUgaRzaaraaabaGaeiOkaOcaaOGaeyypa0Jaem4qamKaeiOkaOcaaa@37EB@</m:annotation></m:semantics></m:math>, this implies that <it>C* </it>can not cover all grids in <m:math name="1471-2105-6-263-i16" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>R</m:mi><m:mi>k</m:mi><m:mi>i</m:mi></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGsbGudaqhaaWcbaGaem4AaSgabaGaemyAaKgaaaaa@30C2@</m:annotation></m:semantics></m:math>, which is a contradiction. Because all SNPs in <m:math name="1471-2105-6-263-i23" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>C</m:mi><m:mover accent="true"><m:mi>k</m:mi><m:mo>&#175;</m:mo></m:mover><m:mo>*</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGdbWqdaqhaaWcbaGafm4AaSMbaebaaeaacqGGQaGkaaaaaa@303D@</m:annotation></m:semantics></m:math> are candidates to the greedy algorithm, the <it>k</it>-th selected SNP must cover at least <m:math name="1471-2105-6-263-i24" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mfrac><m:mrow><m:msubsup><m:mi>R</m:mi><m:mi>k</m:mi><m:mi>i</m:mi></m:msubsup></m:mrow><m:mrow><m:mrow><m:mo>|</m:mo><m:mrow><m:msubsup><m:mi>C</m:mi><m:mover accent="true"><m:mi>k</m:mi><m:mo>&#175;</m:mo></m:mover><m:mo>*</m:mo></m:msubsup></m:mrow><m:mo>|</m:mo></m:mrow></m:mrow></m:mfrac></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabdkfasnaaDaaaleaacqWGRbWAaeaacqWGPbqAaaaakeaadaabdaqaaiabdoeadnaaDaaaleaacuWGRbWAgaqeaaqaaiabcQcaQaaaaOGaay5bSlaawIa7aaaaaaa@3797@</m:annotation></m:semantics></m:math> grids in the <it>i</it>-th row, which implies <m:math name="1471-2105-6-263-i21" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mrow><m:mo>|</m:mo><m:mrow><m:msubsup><m:mi>S</m:mi><m:mi>k</m:mi><m:mi>c</m:mi></m:msubsup></m:mrow><m:mo>|</m:mo></m:mrow><m:mo>&#8805;</m:mo><m:mfrac><m:mrow><m:msubsup><m:mi>R</m:mi><m:mrow><m:mi>k</m:mi><m:mo>&#8722;</m:mo><m:mn>1</m:mn></m:mrow><m:mi>i</m:mi></m:msubsup></m:mrow><m:mrow><m:mo>|</m:mo><m:mi>C</m:mi><m:mo>*</m:mo><m:mo>|</m:mo></m:mrow></m:mfrac></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaabdaqaaiabdofatnaaDaaaleaacqWGRbWAaeaacqWGJbWyaaaakiaawEa7caGLiWoacqGHLjYSdaWcaaqaaiabdkfasnaaDaaaleaacqWGRbWAcqGHsislcqaIXaqmaeaacqWGPbqAaaaakeaacqGG8baFcqWGdbWqcqGGQaGkcqGG8baFaaaaaa@40A0@</m:annotation></m:semantics></m:math> since |<it>C*</it>| &#8805; |<m:math name="1471-2105-6-263-i23" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>C</m:mi><m:mover accent="true"><m:mi>k</m:mi><m:mo>&#175;</m:mo></m:mover><m:mo>*</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGdbWqdaqhaaWcbaGafm4AaSMbaebaaeaacqGGQaGkaaaaaa@303D@</m:annotation></m:semantics></m:math>| and <m:math name="1471-2105-6-263-i27" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mrow><m:mo>|</m:mo><m:mrow><m:msubsup><m:mi>R</m:mi><m:mi>k</m:mi><m:mi>i</m:mi></m:msubsup></m:mrow><m:mo>|</m:mo></m:mrow><m:mo>&#8804;</m:mo><m:mrow><m:mo>|</m:mo><m:mrow><m:msubsup><m:mi>R</m:mi><m:mrow><m:mi>k</m:mi><m:mo>&#8722;</m:mo><m:mn>1</m:mn></m:mrow><m:mi>i</m:mi></m:msubsup></m:mrow><m:mo>|</m:mo></m:mrow></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaadaabdaqaaiabdkfasnaaDaaaleaacqWGRbWAaeaacqWGPbqAaaaakiaawEa7caGLiWoacqGHKjYOdaabdaqaaiabdkfasnaaDaaaleaacqWGRbWAcqGHsislcqaIXaqmaeaacqWGPbqAaaaakiaawEa7caGLiWoaaaa@3EC0@</m:annotation></m:semantics></m:math>.  &#9633;</p>
            <p><b>Theorem 2. </b><it>The first greedy algorithm gives a solution of </it>(<it>m </it>+ 1) <m:math name="1471-2105-6-263-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mtext>ln</m:mtext><m:mfrac><m:mrow><m:mi>K</m:mi><m:mo stretchy="false">(</m:mo><m:mi>K</m:mi><m:mo>&#8722;</m:mo><m:mn>1</m:mn><m:mo stretchy="false">)</m:mo></m:mrow><m:mn>2</m:mn></m:mfrac></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqqGSbaBcqqGUbGBdaWcaaqaaiabdUealjabcIcaOiabdUealjabgkHiTiabigdaXiabcMcaPaqaaiabikdaYaaaaaa@363F@</m:annotation></m:semantics></m:math><it>approximation</it>.</p>
            <p><it>Proof. </it>Define the <it>d-</it>th harmonic number as <m:math name="1471-2105-6-263-i28" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mi>H</m:mi><m:mo stretchy="false">(</m:mo><m:mi>d</m:mi><m:mo stretchy="false">)</m:mo><m:mo>=</m:mo><m:mstyle displaystyle="true"><m:msubsup><m:mo>&#8721;</m:mo><m:mrow><m:mi>i</m:mi><m:mo>=</m:mo><m:mn>1</m:mn></m:mrow><m:mi>d</m:mi></m:msubsup><m:mrow><m:mfrac><m:mn>1</m:mn><m:mi>i</m:mi></m:mfrac></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqabeGadaaakeaacqWGibascqGGOaakcqWGKbazcqGGPaqkcqGH9aqpdaaeWaqaamaalaaabaGaeGymaedabaGaemyAaKgaaaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemizaqganiabggHiLdaaaa@3ACF@</m:annotation></m:semantics></m:math> and <it>H</it>(0) = 0. By (2) and Lemma 3,</p>
            <p>
               <m:math name="1471-2105-6-263-i29" xmlns:m="http://www.w3.org/1998/Math/MathML">
                  <m:semantics>
                     <m:mtable>
                        <m:mtr>
                           <m:mtd>
                              <m:mstyle displaystyle="true">
                                 <m:munderover>
                                    <m:mo>&#8721;</m:mo>
                                    <m:mrow>
                                       <m:mi>i</m:mi>
                                       <m:mo>=</m:mo>
                                       <m:mn>1</m:mn>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mi>m</m:mi>
                                       <m:mo>+</m:mo>
                                       <m:mn>1</m:mn>
                                    </m:mrow>
                                 </m:munderover>
                                 <m:mrow>
                                    <m:mstyle displaystyle="true">
                                       <m:munderover>
                                          <m:mo>&#8721;</m:mo>
                                          <m:mrow>
                                             <m:mi>j</m:mi>
                                             <m:mo>=</m:mo>
                                             <m:mn>1</m:mn>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mstyle scriptlevel="+1">
                                                <m:mfrac>
                                                   <m:mrow>
                                                      <m:mi>K</m:mi>
                                                      <m:mo stretchy="false">(</m:mo>
                                                      <m:mi>K</m:mi>
                                                      <m:mo>&#8722;</m:mo>
                                                      <m:mn>1</m:mn>
                                                      <m:mo stretchy="false">)</m:mo>
                                                   </m:mrow>
                                                   <m:mn>2</m:mn>
                                                </m:mfrac>
                                             </m:mstyle>
                                          </m:mrow>
                                       </m:munderover>
                                       <m:mrow>
                                          <m:msubsup>
                                             <m:mi>C</m:mi>
                                             <m:mi>j</m:mi>
                                             <m:mi>i</m:mi>
                                          </m:msubsup>
                                       </m:mrow>
                                    </m:mstyle>
                                 </m:mrow>
                              </m:mstyle>
                              <m:mo>=</m:mo>
                              <m:mstyle displaystyle="true">
                                 <m:munderover>
                                    <m:mo>&#8721;</m:mo>
                                    <m:mrow>
                                       <m:mi>i</m:mi>
                                       <m:mo>=</m:mo>
                                       <m:mn>1</m:mn>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mi>m</m:mi>
                                       <m:mo>+</m:mo>
                                       <m:mn>1</m:mn>
                                    </m:mrow>
                                 </m:munderover>
                                 <m:mrow>
                                    <m:mstyle displaystyle="true">
                                       <m:munder>
                                          <m:mo>&#8721;</m:mo>
                                          <m:mrow>
                                             <m:mi>k</m:mi>
                                             <m:mo>&#8712;</m:mo>
                                             <m:msubsup>
                                                <m:mi>C</m:mi>
                                                <m:mi>i</m:mi>
                                                <m:mo>'</m:mo>
                                             </m:msubsup>
                                          </m:mrow>
                                       </m:munder>
                                       <m:mrow>
                                          <m:mrow>
                                             <m:mo>(</m:mo>
                                             <m:mrow>
                                                <m:msubsup>
                                                   <m:mi>R</m:mi>
                                                   <m:mrow>
                                                      <m:mi>k</m:mi>
                                                      <m:mo>&#8722;</m:mo>
                                                      <m:mn>1</m:mn>
                                                   </m:mrow>
                                                   <m:mi>i</m:mi>
                                                </m:msubsup>
                                                <m:mo>&#8722;</m:mo>
                                                <m:msubsup>
                                                   <m:mi>R</m:mi>
                                                   <m:mi>k</m:mi>
                                                   <m:mi>i</m:mi>
                                                </m:msubsup>
                                             </m:mrow>
                                             <m:mo>)</m:mo>
                                          </m:mrow>
                                       </m:mrow>
                                    </m:mstyle>
                                 </m:mrow>
                              </m:mstyle>
                              <m:mstyle scriptlevel="+1">
                                 <m:mfrac>
                                    <m:mn>1</m:mn>
                                    <m:mrow>
                                       <m:mrow>
                                          <m:mo>|</m:mo>
                                          <m:mrow>
                                             <m:msubsup>
                                                <m:mi>S</m:mi>
                                                <m:mi>k</m:mi>
                                                <m:mi>c</m:mi>
                                             </m:msubsup>
                                          </m:mrow>
                                          <m:mo>|</m:mo>
                                       </m:mrow>
                                    </m:mrow>
                                 </m:mfrac>
                              </m:mstyle>
                              <m:mo>&#8804;</m:mo>
                              <m:mstyle displaystyle="true">
                                 <m:munderover>
                                    <m:mo>&#8721;</m:mo>
                                    <m:mrow>
                                       <m:mi>i</m:mi>
                                       <m:mo>=</m:mo>
                                       <m:mn>1</m:mn>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mi>m</m:mi>
                                       <m:mo>+</m:mo>
                                       <m:mn>1</m:mn>
                                    </m:mrow>
                                 </m:munderover>
                                 <m:mrow>
                                    <m:mstyle displaystyle="true">
                                       <m:munder>
                                          <m:mo>&#8721;</m:mo>
                                          <m:mrow>
                                             <m:mi>k</m:mi>
                                             <m:mo>&#8712;</m:mo>
                                             <m:msubsup>
                                                <m:mi>C</m:mi>
                                                <m:mi>i</m:mi>
                                                <m:mo>'</m:mo>
                                             </m:msubsup>
                                          </m:mrow>
                                       </m:munder>
                                       <m:mrow>
                                          <m:mrow>
                                             <m:mo>(</m:mo>
      