<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art><ui>1471-2105-11-66</ui><ji>1471-2105</ji><fm>
<dochead>Research article</dochead>
<bibl>
<title>
<p>FastTagger: an efficient algorithm for genome-wide tag SNP selection using multi-marker linkage disequilibrium</p>
</title>
<aug>
<au ca="yes" id="A1"><snm>Liu</snm><fnm>Guimei</fnm><insr iid="I1"/><email>liugm@comp.nus.edu.sg</email></au>
<au id="A2"><snm>Wang</snm><fnm>Yue</fnm><insr iid="I2"/><email>wangyue@nus.edu.sg</email></au>
<au id="A3"><snm>Wong</snm><fnm>Limsoon</fnm><insr iid="I1"/><email>wongls@comp.nus.edu.sg</email></au>
</aug>
<insg>
<ins id="I1"><p>Department of Computer Science, National University of Singapore, Singapore</p></ins>
<ins id="I2"><p>NUS Graduate School for Integrative Science and Engineering, National University of Singapore, Singapore</p></ins>
</insg>
<source>BMC Bioinformatics</source>
<issn>1471-2105</issn>
<pubdate>2010</pubdate>
<volume>11</volume>
<issue>1</issue>
<fpage>66</fpage>
<url>http://www.biomedcentral.com/1471-2105/11/66</url>
<xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-11-66</pubid><pubid idtype="pmpid">20113476</pubid></pubidlist></xrefbib>
</bibl>
<history><rec><date><day>18</day><month>8</month><year>2009</year></date></rec><acc><date><day>29</day><month>1</month><year>2010</year></date></acc><pub><date><day>29</day><month>1</month><year>2010</year></date></pub></history>
<cpyrt><year>2010</year><collab>Liu et al; licensee BioMed Central Ltd.</collab><note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note></cpyrt>
<abs>
<sec>
<st>
<p>Abstract</p>
</st>
<sec>
<st>
<p>Background</p>
</st>
<p>Human genome contains millions of common single nucleotide polymorphisms (SNPs) and these SNPs play an important role in understanding the association between genetic variations and human diseases. Many SNPs show correlated genotypes, or linkage disequilibrium (LD), thus it is not necessary to genotype all SNPs for association study. Many algorithms have been developed to find a small subset of SNPs called tag SNPs that are sufficient to infer all the other SNPs. Algorithms based on the <it>r</it>
<sup>2 </sup>LD statistic have gained popularity because <it>r</it>
<sup>2 </sup>is directly related to statistical power to detect disease associations. Most of existing <it>r</it>
<sup>2 </sup>based algorithms use pairwise LD. Recent studies show that multi-marker LD can help further reduce the number of tag SNPs. However, existing tag SNP selection algorithms based on multi-marker LD are both time-consuming and memory-consuming. They cannot work on chromosomes containing more than 100 k SNPs using length-3 tagging rules.</p>
</sec>
<sec>
<st>
<p>Results</p>
</st>
<p>We propose an efficient algorithm called FastTagger to calculate multi-marker tagging rules and select tag SNPs based on multi-marker LD. FastTagger uses several techniques to reduce running time and memory consumption. Our experiment results show that FastTagger is several times faster than existing multi-marker based tag SNP selection algorithms, and it consumes much less memory at the same time. As a result, FastTagger can work on chromosomes containing more than 100 k SNPs using length-3 tagging rules.</p>
<p>FastTagger also produces smaller sets of tag SNPs than existing multi-marker based algorithms, and the reduction ratio ranges from 3%-9% when length-3 tagging rules are used. The generated tagging rules can also be used for genotype imputation. We studied the prediction accuracy of individual rules, and the average accuracy is above 96% when <it>r</it>
<sup>2 </sup>&#8805; 0.9.</p>
</sec>
<sec>
<st>
<p>Conclusions</p>
</st>
<p>Generating multi-marker tagging rules is a computation intensive task, and it is the bottleneck of existing multi-marker based tag SNP selection methods. FastTagger is a practical and scalable algorithm to solve this problem.</p>
</sec>
</sec>
</abs>
</fm><bdy>
<sec>
<st>
<p>Background</p>
</st>
<p>A single-nucleotide polymorphism (SNP) is a DNA sequence variation occurring when a single nucleotide--A, T, C, or G--in the genome differs between members of a species (or between paired chromosomes in an individual). SNPs are the most common genetic variations in the human genome, and they are very important for understanding the genetic basis of common diseases. Millions of SNPs are present in human genome. The enormous number of SNPs presents a challenging problem for genome-wide association study. It has been observed that adjacent SNPs are often highly correlated. To reduce genotyping cost, many algorithms have been developed to select a smallest set of SNPs such that all the other SNPs can be inferred from them. The selected SNPs are called <it>tag SNPs</it>.</p>
<p>Existing tag SNP selection methods can be classified into two categories: block based methods <abbrgrp>
<abbr bid="B1">1</abbr>
<abbr bid="B2">2</abbr>
<abbr bid="B3">3</abbr>
<abbr bid="B4">4</abbr>
<abbr bid="B5">5</abbr>
<abbr bid="B6">6</abbr>
<abbr bid="B7">7</abbr>
</abbrgrp> and genome-wide approaches <abbrgrp>
<abbr bid="B8">8</abbr>
<abbr bid="B9">9</abbr>
<abbr bid="B10">10</abbr>
<abbr bid="B11">11</abbr>
<abbr bid="B12">12</abbr>
<abbr bid="B13">13</abbr>
</abbrgrp>. Block based methods rely on a predefined haplotype block structure. The blocks are separated by recombination hot-spots, and there are few recombinations within a block. Thus the haplotypes within a block usually are of low diversity. They then attempt to select a subset of SNPs that can discriminate all common haplotypes within each block. The genome-wide tag SNP selection algorithms do not need to partition the whole chromosome into blocks, and they utilize linkage disequilibrium among nearby SNPs to find tag SNPs. Among the genome-wide approaches, those based on the <it>r</it>
<sup>2 </sup>linkage disequilibrium statistic have gained increasing popularity recently because <it>r</it>
<sup>2 </sup>is directly related to statistical power to detect disease associations <abbrgrp>
<abbr bid="B14">14</abbr>
</abbrgrp>.</p>
<p>Algorithm LD-select <abbrgrp>
<abbr bid="B9">9</abbr>
</abbrgrp> is the first algorithm using the <it>r</it>
<sup>2 </sup>LD statistic to select tag SNPs, and it employs a greedy approach to find tag SNPs. Following it, several other algorithms based on the <it>r</it>
<sup>2 </sup>statistic have been developed. FESTA <abbrgrp>
<abbr bid="B12">12</abbr>
</abbrgrp> breaks down large marker sets into disjoint pieces, where exhaustive searches can replace the greedy algorithm, thus leading to smaller tag SNP sets. MultiPop-TagSelect <abbrgrp>
<abbr bid="B15">15</abbr>
</abbrgrp> and REAPER <abbrgrp>
<abbr bid="B11">11</abbr>
</abbrgrp> apply LD-select to multiple populations. LRTag <abbrgrp>
<abbr bid="B13">13</abbr>
</abbrgrp> uses a Lagrangian relaxation algorithm to find tag SNPs across multiple populations. All these algorithms use pairwise LD between SNPs.</p>
<p>Recent studies have shown that multi-marker LD can help further reduce the number of tag SNPs needed <abbrgrp>
<abbr bid="B16">16</abbr>
<abbr bid="B17">17</abbr>
<abbr bid="B18">18</abbr>
</abbrgrp>, and several algorithms have been developed to select tag SNPs based on multi-marker <it>r</it>
<sup>2 </sup>statistics <abbrgrp>
<abbr bid="B19">19</abbr>
<abbr bid="B20">20</abbr>
<abbr bid="B21">21</abbr>
</abbrgrp>. These algorithms find association rules of the form {<it>SNP</it>
<sub>1</sub>, &#8943;, <it>SNP</it>
<sub>
<it>k</it>
</sub>} &#8594; <it>SNP</it>
<sub>
<it>x</it>
</sub>, where <it>k </it>&#8804; 3, <it>SNP</it>
<sub>
<it>x </it>
</sub>&#8713; {<it>SNP</it>
<sub>1</sub>, &#8943;, <it>SNP</it>
<sub>
<it>k</it>
</sub>} and the <it>r</it>
<sup>2 </sup>statistic between the left hand side and the right hand side of the rule is no less than a predefined threshold. Their results show that the multi-marker LD model can reduce the number of tag SNPs significantly compared with pairwise algorithms. However, existing multi-marker based algorithms are both time-consuming and memory-consuming. Most of the time is spent on calculating multi-marker <it>r</it>
<sup>2 </sup>statistics. Furthermore, an excess number of multi-marker association rules may be generated when <it>k </it>&#8805; 3, which incurs high memory consumption when using these rules to select tag SNPs. It takes hundreds of hours for the MultiTag algorithm <abbrgrp>
<abbr bid="B19">19</abbr>
<abbr bid="B20">20</abbr>
</abbrgrp> to finish on chromosomes containing around 30 k SNPs. The MMTagger algorithm <abbrgrp>
<abbr bid="B21">21</abbr>
</abbrgrp> needs several hours to finish, but it consumes more than 1 GB memory. MMTagger cannot work on chromosomes with more than 100 k SNPs when <it>k </it>&#8805; 3. In this paper, we propose a multi-marker LD based tag SNP selection algorithm called FastTagger. FastTagger employs several techniques to reduce running time and memory consumption: (1) It merges nearby equivalent SNPs together to reduce the number of multi-marker association rules to be tested. (2) FastTagger prunes redundant rules to reduce the number of rules generated. (3) If there are too many rules, FastTagger uses a heuristics to skip some rules, that is, a rule is skipped if the right hand side of the rule has been covered enough number of times. (4) If the total size of the rules generated exceeds the memory size, FastTagger divides the chromosome into chunks, and then finds tag SNPs within each chunk. This technique can make FastTagger work on chromosomes containing more than 100 k SNPs with as less as 50 MB memory.</p>
</sec>
<sec>
<st>
<p>Methods</p>
</st>
<p>In this section, we first describe how to calculate multi-marker <it>r</it>
<sup>2 </sup>statistics, and then present the FastTagger algorithm. The FastTagger algorithm consists of two steps. In the first step, it generates tagging rules, and in the second step, it uses a greedy approach to select tag SNPs using rules generated in the first step.</p>
<sec>
<st>
<p>Multi-marker tagging rules</p>
</st>
<p>Most SNPs have only two alleles, so we consider only bi-allelic SNPs. Given a population, the allele with higher frequency in the population is called major allele, and the allele with lower frequency is called minor allele. We use uppercase letters to denote the major alleles of SNPs, and use lowercase letters to denote the minor alleles. SNPs that are far apart from each other usually are not linked. Here we require that the distance between every pair of SNPs in a rule must not exceed a predefined distance threshold <it>max_dist</it>.</p>
<p>Given <it>k </it>SNPs <it>S </it>= {<it>SNP</it>
<sub>1</sub>, <it>SNP</it>
<sub>2</sub>, &#8943;, <it>SNP</it>
<sub>
<it>k</it>
</sub>}, there are 2<sup>
<it>k </it>
</sup>possible haplotypes over the <it>k </it>loci. To calculate the <it>r</it>
<sup>2 </sup>statistic of rule <it>S </it>&#8594; <it>SNP</it>
<sub>
<it>x</it>
</sub>, we need to divide the 2<sup>
<it>k </it>
</sup>haplotypes into two non-empty groups and map the two groups to the two alleles of <it>SNP</it>
<sub>
<it>x</it>
</sub>. MultiTag <abbrgrp>
<abbr bid="B19">19</abbr>
</abbrgrp> and MMTagger <abbrgrp>
<abbr bid="B21">21</abbr>
</abbrgrp> uses different methods to do the mapping.</p>
<sec>
<st>
<p>The one-vs-the-rest model</p>
</st>
<p>MultiTag uses this model. There are totally <inline-formula>
<graphic file="1471-2105-11-66-i1.gif"/>
</inline-formula> - 2 possible ways to group the 2<sup>
<it>k </it>
</sup>haplotypes into two non-empty groups. MultiTag considers only 2<sup>
<it>k </it>
</sup>ways such that one group contains only one haplotype, and the other group contains all the other haplotypes. It calculates the <it>r</it>
<sup>2 </sup>statistics for all the 2<sup>
<it>k </it>
</sup>groupings, and then select the one with the highest <it>r</it>
<sup>2 </sup>statistic.</p>
</sec>
<sec>
<st>
<p>The co-occurrence model</p>
</st>
<p>MMTagger does the mapping based on the co-occurrences of the alleles of the SNPs on the left hand side and the alleles of the SNP on the right hand side. Let <it>H </it>be a haplotype over the SNP set <it>S </it>on the left hand side, <it>A </it>and <it>a </it>be the two alleles of <it>SNP</it>
<sub>
<it>x </it>
</sub>on the right hand side, and <it>f</it>(<it>H</it>) be the frequency of <it>H</it>. We use <it>f</it>(<it>HA</it>) to denote the frequency of <it>H </it>and <it>SNP</it>
<sub>
<it>x </it>
</sub>= <it>A </it>occurring together, and <it>f</it>(<it>Ha</it>) to denote the frequency of <it>H </it>and <it>SNP</it>
<sub>
<it>x </it>
</sub>= <it>a </it>occurring together. If <it>f </it>(<it>HA</it>) &gt; <it>f </it>(<it>Ha</it>), we map haplotype <it>H </it>to allele <it>A </it>of <it>SNP</it>
<sub>
<it>x</it>
</sub>, otherwise we map haplotype <it>H </it>to allele <it>a </it>of <it>SNP</it>
<sub>
<it>x</it>
</sub>. Let <it>H</it>
<sub>
<it>A </it>
</sub>be the set of haplotypes mapped to allele <it>A</it>, and <it>H</it>
<sub>
<it>a </it>
</sub>be the set of haplotypes mapped to allele <it>a</it>. We convert SNP set <it>S </it>to a bi-allelic marker with two "alleles" <it>H</it>
<sub>
<it>A </it>
</sub>and <it>H</it>
<sub>
<it>a</it>
</sub>. Then we can calculate the <it>r</it>
<sup>2 </sup>statistic between <it>S </it>and <it>SNP</it>
<sub>
<it>x </it>
</sub>as follows.</p>
<p>
<display-formula id="M1">
<graphic file="1471-2105-11-66-i2.gif"/>
</display-formula>
</p>
<p>where <it>P</it>(<it>H</it>
<sub>
<it>A</it>
</sub>), <it>P </it>(<it>H</it>
<sub>
<it>a</it>
</sub>), <it>P </it>(<it>A</it>), <it>P </it>(<it>a</it>) and <it>P </it>(<it>H</it>
<sub>
<it>A</it>
</sub>
<it>A</it>) are the relative frequencies of <it>H</it>
<sub>
<it>A</it>
</sub>, <it>H</it>
<sub>
<it>a</it>
</sub>, <it>A</it>, <it>a </it>and <it>H</it>
<sub>
<it>A</it>
</sub>
<it>A </it>respectively.</p>
<p>We implemented both models in the FastTagger algorithm, and let users choose which model they want to use.</p>
<p>If the <it>r</it>
<sup>2 </sup>statistic between <it>S </it>and <it>SNP</it>
<sub>
<it>x </it>
</sub>is no less than a predefined threshold <it>min_r</it>2, we say that <it>SNP</it>
<sub>
<it>x </it>
</sub>can be tagged by <it>S</it>, and <it>R </it>: <it>S </it>&#8594; <it>SNP</it>
<sub>
<it>x </it>
</sub>is a <it>tagging rule</it>. With the increase of the size of <it>S</it>, the haplotypes of <it>S </it>partition the whole dataset into finer and finer groups. In an extreme case, every haplotype of <it>S </it>occurs at most once. In this case, the association between haplotypes of <it>S </it>and alleles of <it>SNP</it>
<sub>
<it>x </it>
</sub>becomes unreliable. To prevent over-fitting, we put a constraint on the size of <it>S</it>. The size of <it>S </it>should not exceeds a predefined threshold <it>max_size</it>.</p>
<p>The <it>r</it>
<sup>2 </sup>statistics can be calculated from phased haplotype data directly. If the SNP data are in the form of unphased genotype data, we can use existing haplotype inference algorithms such as PHASE <abbrgrp>
<abbr bid="B22">22</abbr>
</abbrgrp> to convert genotype data into phased haplotype data. We can also estimate <it>k</it>-marker haplotype frequencies directly from genotype data without phasing using the algorithms described in <abbrgrp>
<abbr bid="B23">23</abbr>
<abbr bid="B24">24</abbr>
</abbrgrp>. The second approach is used in algorithm LD-select <abbrgrp>
<abbr bid="B9">9</abbr>
</abbrgrp>.</p>
</sec>
</sec>
<sec>
<st>
<p>Generating tagging rules</p>
</st>
<p>To generate all the tagging rules, we need to enumerate all the SNP sets that satisfy the maximum distance constraint and maximum size constraint, and then calculate the <it>r</it>
<sup>2 </sup>statistics between these SNP sets and their nearby SNPs. The search space can be enormously large when the number of SNPs is large. We use several techniques to reduce the number of rules to be tested.</p>
<sec>
<st>
<p>Merging equivalent SNPs</p>
</st>
<p>Given two SNPs <it>SNP</it>
<sub>
<it>i </it>
</sub>and <it>SNP</it>
<sub>
<it>j</it>
</sub>, if <it>r</it>
<sup>2</sup>(<it>SNP</it>
<sub>
<it>i</it>
</sub>, <it>SNP</it>
<sub>
<it>j</it>
</sub>) = 1, which means that <it>SNP</it>
<sub>
<it>i </it>
</sub>and <it>SNP</it>
<sub>
<it>j </it>
</sub>can tag each other perfectly, then we say <it>SNPi </it>and <it>SNPj </it>are equivalent. Two equivalent SNPs always have the same <it>r</it>
<sup>2 </sup>statistics with other SNPs, thus the computation cost of the rules involving them can be shared by merging them together.</p>
<p>For each group of merged equivalent SNPs, a representative SNP is picked to represent this group. FastTagger generates tagging rules between representative SNPs only. The tagging rules generated in this way are called representative tagging rules. One representative tagging rule can actually represent multiple rules. Therefore, by merging equivalent SNPs, we are not only saving computation cost, but also reducing storage overhead.</p>
<p>Note that not every rule represented by a representative tagging rule is valid. Some of them may not satisfy the distance constraint. Equivalent SNPs that are separated by more than <it>max_dist </it>bases cannot appear in the same rule, and merging them together can produce many false rules. To reduce the number of false rules, FastTagger only merges equivalent SNPs that are within a distance of <it>max_dist</it>.</p>
</sec>
<sec>
<st>
<p>Pruning redundant tagging rules</p>
</st>
<p>If a SNP <it>SNP</it>
<sub>
<it>x </it>
</sub>can be tagged by a SNP set <it>S</it>, then any rule <it>S' </it>&#8594; <it>SNP</it>
<sub>
<it>x </it>
</sub>such that <it>S' </it>is a proper superset of <it>S </it>is redundant. FastTagger generates only non-redundant tagging rules to reduce running time and memory consumption, and the definition of non-redundant rules is given as follows:</p>
<p>
<b>Definition 1 (Non-redundant tagging rule) </b>
<it>Given a rule S &#8594; SNP</it>
<sub>
<it>x </it>
</sub>
<it>such that SNP</it>
<sub>
<it>x </it>
</sub>
<it>can be tagged by S, if there does not exist another rule S' &#8594; SNP</it>
<sub>
<it>x </it>
</sub>
<it>such that S' is a proper subset of S and SNP</it>
<sub>
<it>x </it>
</sub>
<it>can be tagged by S', then S &#8594; SNP</it>
<sub>
<it>x </it>
</sub>
<it>is called a non-redundant tagging rule</it>.</p>
<p>To prune redundant rules, before calculating the <it>r</it>
<sup>2 </sup>statistic between <it>S </it>and <it>SNP</it>
<sub>
<it>x</it>
</sub>, FastTagger checks whether there exists a subset <it>S' </it>of <it>S </it>such that <it>SNP</it>
<sub>
<it>x </it>
</sub>can be tagged by <it>S'</it>. FastTagger uses a depth-first strategy to enumerate SNP sets. This search strategy is adopted from a frequent generator mining algorithm <abbrgrp>
<abbr bid="B25">25</abbr>
</abbrgrp>, and it ensures that all the tagging rules whose left hand side is a subset of <it>S </it>are generated before <it>S </it>is processed.</p>
<p>There can be many tagging rules generated. To speed-up the check operation, FastTagger divides the generated tagging rules into groups based on their right hand side SNP, that is, rules with the same right hand side SNP are in the same group. FastTagger then uses a hash map to index the rules in the same group, and the hashing key is the left hand side of the rules. To check whether <it>S </it>&#8594; <it>SNP</it>
<sub>
<it>x </it>
</sub>is redundant, FastTagger searches the hash map of <it>SNP</it>
<sub>
<it>x </it>
</sub>for the subsets of <it>S</it>. If there is a subset of <it>S </it>in the hash map of <it>SNP</it>
<sub>
<it>x</it>
</sub>, the rule is redundant; otherwise, the <it>r</it>
<sup>2 </sup>statistic of the rule is calculated.</p>
</sec>
<sec>
<st>
<p>Skipping rules</p>
</st>
<p>Even though merging equivalent SNPs and removing redundant tagging rules can reduce the number of tagging rules significantly, it is still possible that a large number of tagging rules are generated in the first step, which incurs high memory consumption in the second step. FastTagger uses heuristics to further reduce the number of tagging rules generated: if a SNP <it>SNP</it>
<sub>
<it>x </it>
</sub>occurs at the right hand side of tagging rules enough number of times, then <it>SNP</it>
<sub>
<it>x </it>
</sub>will not be considered as right hand side candidate in future rule generation. The rationale behind this heuristics is that if a SNP can be tagged by many other SNPs, then during the tag SNP selection process, the SNP has a high probability to be covered by selected tag SNPs.</p>
</sec>
</sec>
<sec>
<st>
<p>Selecting tag SNPs using a greedy approach</p>
</st>
<p>Finding the smallest set of tag SNPs is computationally expensive. FastTagger uses a greedy approach similar to the one proposed in <abbrgrp>
<abbr bid="B9">9</abbr>
<abbr bid="B19">19</abbr>
</abbrgrp> to find a near optimal set of tag SNPs.</p>
<p>Let <it>C </it>be the set of candidate tag SNPs, <it>T </it>be the set of tag SNPs selected, and <it>V </it>be the set of SNPs not being covered. A SNP is covered if either it is a tag SNP or it can be tagged by some SNP set <it>S </it>such that <it>S </it>&#8838; <it>T</it>. Initially, <it>C </it>and <it>V </it>contain all the SNPs, and <it>T </it>is empty.</p>
<p>FastTagger first identifies those SNPs that do not appear at the right hand side of any tagging rules, and these SNPs must be selected as tag SNPs. FastTagger puts them into <it>T </it>and remove them from <it>C</it>. These SNPs are also removed from <it>V</it>. For the remaining SNPs in <it>V</it>, if they can be tagged by some SNP set <it>S </it>such that <it>S </it>&#8838; <it>T</it>, then they are removed from <it>V </it>too.</p>
<p>Next, for each SNP <it>SNP</it>
<sub>
<it>i </it>
</sub>&#8712; <it>C</it>, FastTagger finds the set of SNPs in <it>V </it>that are covered by <it>SNP</it>
<sub>
<it>i</it>
</sub>. A SNP <it>SNP</it>
<sub>
<it>j </it>
</sub>in <it>V </it>is covered by <it>SNP</it>
<sub>
<it>i </it>
</sub>if <it>SNP</it>
<sub>
<it>j </it>
</sub>is not tagged by any subsets of <it>T </it>and there exists a subset <it>S </it>of <it>T </it>such that <it>SNP</it>
<sub>
<it>j </it>
</sub>is tagged by <it>S </it>&#8746; {<it>SNP</it>
<sub>
<it>i</it>
</sub>}.</p>
<p>FastTagger then picks a SNP from <it>C </it>that covers the largest number of SNPs in <it>V </it>as a tag SNP. This newly picked tag SNP is put into <it>T </it>and removed from <it>C</it>. All the SNPs that are covered by it including itself are removed from <it>V</it>. This process is repeated until <it>V </it>is empty, that is, all the SNPs have been covered. In each iteration, in order to find the set of SNPs covered by every candidate tag SNP in <it>C</it>, FastTagger needs to keep the tagging rules in memory. However, the number of rules generated can be very large. It is possible that the total size of tagging rules is too large to fit into the main memory. To solve this problem, we can break the whole chromosome into several chunks such that the rules over every chunk can fit into the main memory. We then select tag SNPs within each chunk.</p>
<p>When selecting tag SNPs within each chunk, only those tagging rules whose SNPs all fall into this chunk are used. To also utilize the rules across chunks, we allow two adjacent chunks to have certain overlap. The length of the overlap is determined by the <it>max_dist </it>threshold. The SNPs in one chunk that are within <it>max_dist </it>bases away from the first SNP of the next chunk are included in the next chunk since they can tag or be tagged by SNPs in the next chunk. FastTagger finds tag SNPs from each chunk from left to right. The tag SNPs selected in the current chunk that also belong to the next chunk will be passed on to the next chunk as tag SNPs. Note that if the distance between two adjacent SNPs is larger than <it>max_dist</it>, then these two SNPs are used as a breakpoint even if there is enough memory. The reason being that if the distance between two adjacent SNPs is larger than <it>max_dist</it>, then the two SNPs cannot tag each other or each other's neighbors.</p>
<p>Using the above method, FastTagger can work on chromosomes containing more than 100 k SNPs with as less as 50 MB memory, while existing algorithm consumes more than 1 GB memory even on chromosomes containing around 30 k SNPs.</p>
</sec>
</sec>
<sec>
<st>
<p>Results and Discussion</p>
</st>
<p>In this section, we study the performance of FastTagger. We conducted the experiments on a PC with 2.33 Ghz Intel(R) Core(TM) Duo CPU and 3.25 GB memory running Fedora 7. All codes were complied using g++. The source codes and executable of the FastTagger algorithm can be found in Additional file <supplr sid="S1">1</supplr>. We obtained the datasets from HapMap release 21 <url>http://hapmap.ncbi.nlm.nih.gov/downloads/phasing/2006-07_phaseII/phased/</url> and project ENCODE <url>http://hapmap.ncbi.nlm.nih.gov/downloads/phasing/2005-03_phaseI/ENCODE/</url>. There are 4 populations and 10 regions in the ENCODE project. Here, we report the overall results on the ten regions for each population. The results on individual regions can be found in Additional file <supplr sid="S2">2</supplr>. From HapMap release 21, we selected 6 chromosomes: chr1, chr2, chr3, chr19, chr21 and chr22, and used the Han Chinese plus Japanese population. Table <tblr tid="T1">1</tblr> shows the number of SNPs with <it>MAF </it>&#8805; 5% on the datasets. In all the experiment, we set <it>max_dist </it>to 100 k, and select only those SNPs with <it>MAF </it>&#8805; 5%.</p>
<suppl id="S1">
<title>
<p>Additional file 1</p>
</title>
<text>
<p>File "FastTagger.zip" contains the source codes and executables of the FastTagger program, both for Linux and Windows. Please read file "FastTagger.readme" on how to use the program.</p>
</text>
<file name="1471-2105-11-66-S1.ZIP">
   <p>Click here for file</p>
</file>
</suppl>
<suppl id="S2">
<title>
<p>Additional file 2</p>
</title>
<text>
<p>File "FastTagger-sup.xls" contains additional experiment results, and it is a Microsoft Excel file.</p>
</text>
<file name="1471-2105-11-66-S2.XLS">
   <p>Click here for file</p>
</file>
</suppl>
<tbl id="T1"><title><p>Table 1</p></title><caption><p>Datasets. </p></caption><tblbdy cols="6">
      <r>
         <c ca="center">
            <p>
               <b>datasets</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>#SNPs</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>#Rep SNPs</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>datasets</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>#SNPs</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>#Rep SNPs</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="6">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>ENCODE CEU</p>
         </c>
         <c ca="center">
            <p>7,221</p>
         </c>
         <c ca="center">
            <p>2,484</p>
         </c>
         <c ca="center">
            <p>chr2</p>
         </c>
         <c ca="center">
            <p>169,905</p>
         </c>
         <c ca="center">
            <p>85,807</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>ENCODE HCB</p>
         </c>
         <c ca="center">
            <p>6,430</p>
         </c>
         <c ca="center">
            <p>2,286</p>
         </c>
         <c ca="center">
            <p>chr3</p>
         </c>
         <c ca="center">
            <p>135,058</p>
         </c>
         <c ca="center">
            <p>71,244</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>ENCODE JPT</p>
         </c>
         <c ca="center">
            <p>6,216</p>
         </c>
         <c ca="center">
            <p>2,196</p>
         </c>
         <c ca="center">
            <p>chr19</p>
         </c>
         <c ca="center">
            <p>28,931</p>
         </c>
         <c ca="center">
            <p>17,807</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>ENCODE YRI</p>
         </c>
         <c ca="center">
            <p>7,963</p>
         </c>
         <c ca="center">
            <p>4,408</p>
         </c>
         <c ca="center">
            <p>chr21</p>
         </c>
         <c ca="center">
            <p>28,914</p>
         </c>
         <c ca="center">
            <p>15,644</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr1</p>
         </c>
         <c ca="center">
            <p>149,716</p>
         </c>
         <c ca="center">
            <p>78,893</p>
         </c>
         <c ca="center">
            <p>chr22</p>
         </c>
         <c ca="center">
            <p>26,595</p>
         </c>
         <c ca="center">
            <p>15,553</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>The "#Rep SNPs" column is the number of representative SNPs with merging window size of 100 k.</p>
   </tblfn></tbl>
<sec>
<st>
<p>Comparison with other algorithms</p>
</st>
<p>The first experiment is to compare FastTagger with LRTag <abbrgrp>
<abbr bid="B13">13</abbr>
</abbrgrp>, MMTagger <abbrgrp>
<abbr bid="B21">21</abbr>
</abbrgrp> and MultiTag <abbrgrp>
<abbr bid="B19">19</abbr>
</abbrgrp>. LRTag uses only pair-wise LD to find tag SNPs, and it has been shown to outperform LD-select and FESTA. Hence we choose LRTag as a representative of the pairwise algorithms. MMTagger and MultiTag both use multi-marker LD to find tag SNPs. We obtained the programs from their respective authors. FastTagger used all the techniques described previously except the skipping rules technique. LRTag takes pre-computed pairwise <it>r</it>
<sup>2 </sup>statistics as input, so the running time of LRTag includes only tag SNP selection time. We report the results at <it>min_r</it>2 = 0.95 here, results at <it>min_r</it>2 = 0.9 and <it>min_r</it>2 = 0.8 can be found in supplementary materials. For all the four algorithms, the selected tag SNPs can cover the whole region of interest.</p>
<p>We first compare FastTagger with LRTag and MultiTag on using pairwise LD to find tag SNPs. Table <tblr tid="T2">2</tblr> shows the running time and the number of tag SNPs selected by the three algorithms. The running time is measured in minutes. FastTagger is several times faster than LRTag even though LRTag only needs to pick tag SNPs from pre-computed pairwise <it>r</it>
<sup>2 </sup>statistics while FastTagger needs to compute pairwise <it>r</it>
<sup>2 </sup>statistics as well as selecting tag SNPs. Both algorithms are orders of magnitude faster than MultiTag. Among the three algorithms, LRTag produces the smallest number of tag SNPs, but the difference is very small. Overall, FastTagger generates 0.31% more tag SNPs than LRTag when <it>min_r</it>2 = 0.95. MultiTag generates 1.77% more tag SNPs than FastTagger when <it>min_r</it>2 = 0.95. LRTag uses a Lagrangian relaxation algorithm to select tag SNPs instead of a greedy approach used in other algorithms. That is why it generates less tag SNPs than other algorithms.</p>
<tbl id="T2"><title><p>Table 2</p></title><caption><p>Comparison of running time and number of tag SNPs selected when pairwise LD are used. </p></caption><tblbdy cols="8">
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>
               <b><it>min_r</it>2</b>
            </p>
         </c>
         <c cspan="3" ca="center">
            <p>
               <b>Running time (minutes)</b>
            </p>
         </c>
         <c cspan="3" ca="center">
            <p>
               <b>#tag SNPs</b>
            </p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>
               <b>FastTagger</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>LRTag</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>MultiTag</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>FastTagger</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>LRTag</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>MultiTag</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="8">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>ENCODE CEU</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>0.003</p>
         </c>
         <c ca="center">
            <p>0.016</p>
         </c>
         <c ca="center">
            <p>10.4</p>
         </c>
         <c ca="center">
            <p>2144</p>
         </c>
         <c ca="center">
            <p>2127</p>
         </c>
         <c ca="center">
            <p>2136</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>ENCODE HCB</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>0.003</p>
         </c>
         <c ca="center">
            <p>0.014</p>
         </c>
         <c ca="center">
            <p>7.5</p>
         </c>
         <c ca="center">
            <p>2065</p>
         </c>
         <c ca="center">
            <p>2055</p>
         </c>
         <c ca="center">
            <p>2061</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>ENCODE JPT</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>0.003</p>
         </c>
         <c ca="center">
            <p>0.013</p>
         </c>
         <c ca="center">
            <p>6.6</p>
         </c>
         <c ca="center">
            <p>1996</p>
         </c>
         <c ca="center">
            <p>1990</p>
         </c>
         <c ca="center">
            <p>1996</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>ENCODE YRI</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>0.004</p>
         </c>
         <c ca="center">
            <p>0.008</p>
         </c>
         <c ca="center">
            <p>41.6</p>
         </c>
         <c ca="center">
            <p>4115</p>
         </c>
         <c ca="center">
            <p>4107</p>
         </c>
         <c ca="center">
            <p>4109</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr1</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>0.076</p>
         </c>
         <c ca="center">
            <p>0.242</p>
         </c>
         <c ca="center">
            <p>26.2</p>
         </c>
         <c ca="center">
            <p>62190</p>
         </c>
         <c ca="center">
            <p>61988</p>
         </c>
         <c ca="center">
            <p>63391</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr2</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>0.088</p>
         </c>
         <c ca="center">
            <p>0.293</p>
         </c>
         <c ca="center">
            <p>30.2</p>
         </c>
         <c ca="center">
            <p>66026</p>
         </c>
         <c ca="center">
            <p>65822</p>
         </c>
         <c ca="center">
            <p>67236</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr3</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>0.070</p>
         </c>
         <c ca="center">
            <p>0.222</p>
         </c>
         <c ca="center">
            <p>25.1</p>
         </c>
         <c ca="center">
            <p>55895</p>
         </c>
         <c ca="center">
            <p>55713</p>
         </c>
         <c ca="center">
            <p>56972</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr19</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>0.015</p>
         </c>
         <c ca="center">
            <p>0.032</p>
         </c>
         <c ca="center">
            <p>3.6</p>
         </c>
         <c ca="center">
            <p>14777</p>
         </c>
         <c ca="center">
            <p>14744</p>
         </c>
         <c ca="center">
            <p>15014</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr21</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>0.015</p>
         </c>
         <c ca="center">
            <p>0.040</p>
         </c>
         <c ca="center">
            <p>6.0</p>
         </c>
         <c ca="center">
            <p>12455</p>
         </c>
         <c ca="center">
            <p>12435</p>
         </c>
         <c ca="center">
            <p>12658</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr22</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>0.014</p>
         </c>
         <c ca="center">
            <p>0.033</p>
         </c>
         <c ca="center">
            <p>7.9</p>
         </c>
         <c ca="center">
            <p>12690</p>
         </c>
         <c ca="center">
            <p>12652</p>
         </c>
         <c ca="center">
            <p>12932</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>The running time of LRTag includes only tag SNP selection time, while the running time of FastTagger and MultiTag includes both rule generation time and tag SNP selection time. MMTagger is excluded from this table because the MMTagger program provided by its authors cannot use pairwise LD to find tag SNPs.</p>
   </tblfn></tbl>
<p>Table <tblr tid="T3">3</tblr> shows the running time and the number of tag SNP selected by the FastTagger, MMTagger and MultiTag when multi-marker LD are used. We implemented both models in FastTagger, and denote them as Fast-COOC (the co-occurrence model) and Fast-1vsR (the one-vs-the-rest model). MultiTag took extremely long time to finish on the 6 chromosomes when <it>max_size </it>= 3, so its results are not reported on the 6 chromosomes when <it>max_size </it>= 3. When <it>max_size </it>= 2, we divided chr1, chr2 and chr3 into 20 chunks, chr19, chr21 and chr22 into 5 chunks, and then ran MultiTag on each chunk and combined the results. MMTagger terminated abnormally on chr1, chr2 and chr3 when <it>max_size </it>= 3 because too many rules were generated. To solve this problem, we divided the three chromosomes into 10 chunks, and then ran MMTagger on each chunk and combined the results together.</p>
<tbl id="T3"><title><p>Table 3</p></title><caption><p>Comparison of running time and number of tag SNPs selected when multi-marker LD are used. </p></caption><tblbdy cols="11">
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>
               <b>
                  <it>max_size</it>
               </b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b><it>min_r</it>2</b>
            </p>
         </c>
         <c cspan="4" ca="center">
            <p>
               <b>Running time (minutes)</b>
            </p>
         </c>
         <c cspan="4" ca="center">
            <p>
               <b>#tag SNPs</b>
            </p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>
               <b>Fast-COOC</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>MMTagger</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Fast-1vsR</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>MultiTag</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Fast-COOC</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>MMTagger</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Fast-1vsR</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>MultiTag</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="11">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>ENCODE CEU</p>
         </c>
         <c ca="center">
            <p>2</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>0.038</p>
         </c>
         <c ca="center">
            <p>0.041</p>
         </c>
         <c ca="center">
            <p>0.048</p>
         </c>
         <c ca="center">
            <p>&#8805;10 hours</p>
         </c>
         <c ca="center">
            <p>1282</p>
         </c>
         <c ca="center">
            <p>1282</p>
         </c>
         <c ca="center">
            <p>1291</p>
         </c>
         <c ca="center">
            <p>1371</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>ENCODE HCB</p>
         </c>
         <c ca="center">
            <p>2</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>0.032</p>
         </c>
         <c ca="center">
            <p>0.032</p>
         </c>
         <c ca="center">
            <p>0.042</p>
         </c>
         <c ca="center">
            <p>&#8805;10 hours</p>
         </c>
         <c ca="center">
            <p>1305</p>
         </c>
         <c ca="center">
            <p>1328</p>
         </c>
         <c ca="center">
            <p>1308</p>
         </c>
         <c ca="center">
            <p>1424</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>ENCODE JPT</p>
         </c>
         <c ca="center">
            <p>2</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>0.029</p>
         </c>
         <c ca="center">
            <p>0.028</p>
         </c>
         <c ca="center">
            <p>0.038</p>
         </c>
         <c ca="center">
            <p>&#8805;10 hours</p>
         </c>
         <c ca="center">
            <p>1234</p>
         </c>
         <c ca="center">
            <p>1258</p>
         </c>
         <c ca="center">
            <p>1240</p>
         </c>
         <c ca="center">
            <p>1349</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>ENCODE YRI</p>
         </c>
         <c ca="center">
            <p>2</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>0.181</p>
         </c>
         <c ca="center">
            <p>0.188</p>
         </c>
         <c ca="center">
            <p>0.245</p>
         </c>
         <c ca="center">
            <p>&#8805;60 hours</p>
         </c>
         <c ca="center">
            <p>2575</p>
         </c>
         <c ca="center">
            <p>2618</p>
         </c>
         <c ca="center">
            <p>2579</p>
         </c>
         <c ca="center">
            <p>2770</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr1</p>
         </c>
         <c ca="center">
            <p>2</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>1.13</p>
         </c>
         <c ca="center">
            <p>5.84</p>
         </c>
         <c ca="center">
            <p>1.40</p>
         </c>
         <c ca="center">
            <p>&#8805;7 days</p>
         </c>
         <c ca="center">
            <p>43202</p>
         </c>
         <c ca="center">
            <p>43483</p>
         </c>
         <c ca="center">
            <p>43306</p>
         </c>
         <c ca="center">
            <p>43462</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr2</p>
         </c>
         <c ca="center">
            <p>2</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>1.32</p>
         </c>
         <c ca="center">
            <p>7.21</p>
         </c>
         <c ca="center">
            <p>1.63</p>
         </c>
         <c ca="center">
            <p>&#8805;7 days</p>
         </c>
         <c ca="center">
            <p>44135</p>
         </c>
         <c ca="center">
            <p>44556</p>
         </c>
         <c ca="center">
            <p>44225</p>
         </c>
         <c ca="center">
            <p>49289</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr3</p>
         </c>
         <c ca="center">
            <p>2</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>1.14</p>
         </c>
         <c ca="center">
            <p>5.11</p>
         </c>
         <c ca="center">
            <p>1.41</p>
         </c>
         <c ca="center">
            <p>&#8805;7 days</p>
         </c>
         <c ca="center">
            <p>37881</p>
         </c>
         <c ca="center">
            <p>38206</p>
         </c>
         <c ca="center">
            <p>37952</p>
         </c>
         <c ca="center">
            <p>39300</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr19</p>
         </c>
         <c ca="center">
            <p>2</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>0.176</p>
         </c>
         <c ca="center">
            <p>0.343</p>
         </c>
         <c ca="center">
            <p>0.218</p>
         </c>
         <c ca="center">
            <p>&#8805;30 hours</p>
         </c>
         <c ca="center">
            <p>11151</p>
         </c>
         <c ca="center">
            <p>11192</p>
         </c>
         <c ca="center">
            <p>11160</p>
         </c>
         <c ca="center">
            <p>11747</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr21</p>
         </c>
         <c ca="center">
            <p>2</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>0.287</p>
         </c>
         <c ca="center">
            <p>0.473</p>
         </c>
         <c ca="center">
            <p>0.359</p>
         </c>
         <c ca="center">
            <p>&#8805;60 hours</p>
         </c>
         <c ca="center">
            <p>8543</p>
         </c>
         <c ca="center">
            <p>8627</p>
         </c>
         <c ca="center">
            <p>8564</p>
         </c>
         <c ca="center">
            <p>9103</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr22</p>
         </c>
         <c ca="center">
            <p>2</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>0.370</p>
         </c>
         <c ca="center">
            <p>0.567</p>
         </c>
         <c ca="center">
            <p>0.468</p>
         </c>
         <c ca="center">
            <p>&#8805;100 hours</p>
         </c>
         <c ca="center">
            <p>8970</p>
         </c>
         <c ca="center">
            <p>9025</p>
         </c>
         <c ca="center">
            <p>8993</p>
         </c>
         <c ca="center">
            <p>9533</p>
         </c>
      </r>
      <r>
         <c cspan="11">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>ENCODE CEU</p>
         </c>
         <c ca="center">
            <p>3</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>1.28</p>
         </c>
         <c ca="center">
            <p>3.69</p>
         </c>
         <c ca="center">
            <p>1.85</p>
         </c>
         <c ca="center">
            <p>&#8805;50 hours</p>
         </c>
         <c ca="center">
            <p>972</p>
         </c>
         <c ca="center">
            <p>1017</p>
         </c>
         <c ca="center">
            <p>1151</p>
         </c>
         <c ca="center">
            <p>1244</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>ENCODE HCB</p>
         </c>
         <c ca="center">
            <p>3</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>1.26</p>
         </c>
         <c ca="center">
            <p>3.40</p>
         </c>
         <c ca="center">
            <p>1.93</p>
         </c>
         <c ca="center">
            <p>&#8805;80 hours</p>
         </c>
         <c ca="center">
            <p>1003</p>
         </c>
         <c ca="center">
            <p>1034</p>
         </c>
         <c ca="center">
            <p>1170</p>
         </c>
         <c ca="center">
            <p>1170</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>ENCODE JPT</p>
         </c>
         <c ca="center">
            <p>3</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>1.06</p>
         </c>
         <c ca="center">
            <p>2.74</p>
         </c>
         <c ca="center">
            <p>1.60</p>
         </c>
         <c ca="center">
            <p>&#8805;50 hours</p>
         </c>
         <c ca="center">
            <p>958</p>
         </c>
         <c ca="center">
            <p>1002</p>
         </c>
         <c ca="center">
            <p>1129</p>
         </c>
         <c ca="center">
            <p>1244</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>ENCODE YRI</p>
         </c>
         <c ca="center">
            <p>3</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>11.6</p>
         </c>
         <c ca="center">
            <p>36.7</p>
         </c>
         <c ca="center">
            <p>17.4</p>
         </c>
         <c ca="center">
            <p>&#8805;14 days</p>
         </c>
         <c ca="center">
            <p>1848</p>
         </c>
         <c ca="center">
            <p>1927</p>
         </c>
         <c ca="center">
            <p>2165</p>
         </c>
         <c ca="center">
            <p>2516</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr1</p>
         </c>
         <c ca="center">
            <p>3</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>34.9</p>
         </c>
         <c ca="center">
            <p>137.3</p>
         </c>
         <c ca="center">
            <p>49.6</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
         <c ca="center">
            <p>35556</p>
         </c>
         <c ca="center">
            <p>38185</p>
         </c>
         <c ca="center">
            <p>40534</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr2</p>
         </c>
         <c ca="center">
            <p>3</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>42.9</p>
         </c>
         <c ca="center">
            <p>166.9</p>
         </c>
         <c ca="center">
            <p>60.8</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
         <c ca="center">
            <p>35502</p>
         </c>
         <c ca="center">
            <p>38372</p>
         </c>
         <c ca="center">
            <p>41129</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr3</p>
         </c>
         <c ca="center">
            <p>3</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>39.3</p>
         </c>
         <c ca="center">
            <p>154.6</p>
         </c>
         <c ca="center">
            <p>55.5</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
         <c ca="center">
            <p>30695</p>
         </c>
         <c ca="center">
            <p>33041</p>
         </c>
         <c ca="center">
            <p>35305</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr19</p>
         </c>
         <c ca="center">
            <p>3</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>4.34</p>
         </c>
         <c ca="center">
            <p>16.6</p>
         </c>
         <c ca="center">
            <p>6.25</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
         <c ca="center">
            <p>9444</p>
         </c>
         <c ca="center">
            <p>10032</p>
         </c>
         <c ca="center">
            <p>10546</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr21</p>
         </c>
         <c ca="center">
            <p>3</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>9.91</p>
         </c>
         <c ca="center">
            <p>37.7</p>
         </c>
         <c ca="center">
            <p>14.4</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
         <c ca="center">
            <p>6929</p>
         </c>
         <c ca="center">
            <p>7404</p>
         </c>
         <c ca="center">
            <p>7935</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr22</p>
         </c>
         <c ca="center">
            <p>3</p>
         </c>
         <c ca="center">
            <p>0.95</p>
         </c>
         <c ca="center">
            <p>16.5</p>
         </c>
         <c ca="center">
            <p>65.3</p>
         </c>
         <c ca="center">
            <p>24.4</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
         <c ca="center">
            <p>7327</p>
         </c>
         <c ca="center">
            <p>7788</p>
         </c>
         <c ca="center">
            <p>8392</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>Fast-COOC represents the FastTagger algorithm using the co-occurrence model, and Fast-1vsR represents the FastTagger algorithm using the one-vs-the-rest model. <it>max_size </it>is the maximum number of SNPs on the left hand side of a tagging rule. For the MMTagger algorithm, we divided chr1, chr2 and chr3 into 10 chunks when <it>max_size </it>= 3, and ran MMTagger on each chunk, and then combined the results. For the MultiTag algorithm, we divided chr1, chr2 and chr3 into 20 chunks, chr19, chr21 and chr22 into 5 chunks when <it>max_size </it>= 3. When <it>max_size </it>= 3, MultiTag took too long to finish on the 6 chromosomes, so we did not get its results on the 6 chromosomes.</p>
   </tblfn></tbl>
<p>Table <tblr tid="T3">3</tblr> shows that the multi-marker model can reduce the number of tag SNPs significantly under the same <it>min_r</it>2 threshold compared with the pairwise model (Table <tblr tid="T2">2</tblr>). The number of tag SNPs is reduced by more than 30% when <it>max_size </it>= 2. When <it>max_size </it>= 3, the number of tag SNPs is reduced by more than 40%. However, calculating multi-marker <it>r</it>
<sup>2 </sup>statistics is much more expensive than computing pairwise <it>r</it>
<sup>2</sup>. FastTagger is more than 10 times slower when <it>max_size </it>= 2, and hundreds of times slower when <it>max_size </it>= 3.</p>
<p>On ENCODE regions, FastTagger and MMTagger take similar time to finish when <it>max_size </it>= 2; when <it>max_size </it>= 3, FastTagger is 2-3 times faster than MMTagger. On the 6 chromosomes, FastTagger is 2-6 times faster than MMTagger. Both algorithms are orders of magnitude faster than MultiTag. The number of tag SNPs selected by FastTagger under the co-occurrence model is smaller than that selected by MMTagger and MultiTag.</p>
<p>Table <tblr tid="T4">4</tblr> shows the maximum memory usage of FastTagger and MMTagger with <it>max_r</it>2 = 0.95 and <it>max_size </it>= 3. MMTagger consumes much more memory than FastTagger, that is why it cannot work on large chromosomes such as chr1, chr2 and chr3 when <it>max_size </it>= 3.</p>
<tbl id="T4"><title><p>Table 4</p></title><caption><p>Memory usage of FastTagger and MMTagger. </p></caption><tblbdy cols="6">
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>
               <b>FastTagger</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>MMTagger</b>
            </p>
         </c>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>
               <b>FastTagger</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>MMTagger</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="6">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr1</p>
         </c>
         <c ca="center">
            <p>94.41 MB</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
         <c ca="center">
            <p>chr19</p>
         </c>
         <c ca="center">
            <p>30.29 MB</p>
         </c>
         <c ca="center">
            <p>657 MB</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr2</p>
         </c>
         <c ca="center">
            <p>287.50 MB</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
         <c ca="center">
            <p>chr21</p>
         </c>
         <c ca="center">
            <p>74.99 MB</p>
         </c>
         <c ca="center">
            <p>1210 MB</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr3</p>
         </c>
         <c ca="center">
            <p>119.72 MB</p>
         </c>
         <c ca="center">
            <p>-</p>
         </c>
         <c ca="center">
            <p>chr22</p>
         </c>
         <c ca="center">
            <p>50.20 MB</p>
         </c>
         <c ca="center">
            <p>1216 MB</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>The co-occurrence model is used in FastTagger. <it>min_r</it>2 = 0.95, <it>max_size </it>= 3.</p>
   </tblfn></tbl>
<p>Table <tblr tid="T3">3</tblr> also shows that the co-occurrence model generates smaller set of tag SNPs than the one-vs-the-rest model. The reason being that more rules are generated under the co-occurrence model as shown in Table <tblr tid="T5">5</tblr>. When <it>max_size </it>= 2, the two models generate similar number of rules, so does the number of tag SNPs. When <it>max_size </it>= 3, the co-occurrence model generates 3-4 times more rules than the one-vs-the-rest model, hence it can use much less tag SNPs to tag all the other SNPs. The co-occurrence model also consumes much more memory when <it>max_size </it>= 3 as shown in the last two columns of Table <tblr tid="T5">5</tblr>.</p>
<tbl id="T5"><title><p>Table 5</p></title><caption><p>The number of tagging rules generated under the two models using the FastTagger algorithm (<it>min_r</it>2 = 0.9).</p></caption><tblbdy cols="6">
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>
               <b>
                  <it>max_size</it>
               </b>
            </p>
         </c>
         <c cspan="2" ca="center">
            <p>
               <b>#rules</b>
            </p>
         </c>
         <c cspan="2" ca="center">
            <p>
               <b>memory</b>
            </p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>
               <b>Fast-COOC</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Fast-1vsR</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Fast-COOC</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Fast-1vsR</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="6">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr19</p>
         </c>
         <c ca="center">
            <p>2</p>
         </c>
         <c ca="center">
            <p>121,122</p>
         </c>
         <c ca="center">
            <p>120,627</p>
         </c>
         <c ca="center">
            <p>6.63 MB</p>
         </c>
         <c ca="center">
            <p>6.63 MB</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr21</p>
         </c>
         <c ca="center">
            <p>2</p>
         </c>
         <c ca="center">
            <p>169,864</p>
         </c>
         <c ca="center">
            <p>168,936</p>
         </c>
         <c ca="center">
            <p>11.43 MB</p>
         </c>
         <c ca="center">
            <p>11.43 MB</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr22</p>
         </c>
         <c ca="center">
            <p>2</p>
         </c>
         <c ca="center">
            <p>156,134</p>
         </c>
         <c ca="center">
            <p>155,223</p>
         </c>
         <c ca="center">
            <p>8.14 MB</p>
         </c>
         <c ca="center">
            <p>8.13 MB</p>
         </c>
      </r>
      <r>
         <c cspan="6">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr19</p>
         </c>
         <c ca="center">
            <p>3</p>
         </c>
         <c ca="center">
            <p>1,421,519</p>
         </c>
         <c ca="center">
            <p>377,773</p>
         </c>
         <c ca="center">
            <p>38.69 MB</p>
         </c>
         <c ca="center">
            <p>13.29 MB</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr21</p>
         </c>
         <c ca="center">
            <p>3</p>
         </c>
         <c ca="center">
            <p>2,713,338</p>
         </c>
         <c ca="center">
            <p>657,767</p>
         </c>
         <c ca="center">
            <p>101.11 MB</p>
         </c>
         <c ca="center">
            <p>29.92 MB</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr22</p>
         </c>
         <c ca="center">
            <p>3</p>
         </c>
         <c ca="center">
            <p>2,590,826</p>
         </c>
         <c ca="center">
            <p>573,738</p>
         </c>
         <c ca="center">
            <p>67.28 MB</p>
         </c>
         <c ca="center">
            <p>19.21 MB</p>
         </c>
      </r>
   </tblbdy></tbl>
</sec>
<sec>
<st>
<p>The effectiveness of the techniques used in FastTagger</p>
</st>
<p>This experiment studies the effectiveness of the techniques used by FastTagger in reducing running time and memory consumption. We used the co-occurrence model in this experiment because it generates more rules and is more memory demanding than the one-vs-the-rest model. The baseline FastTagger algorithm in this experiment uses two techniques as in the previous experiment: merging equivalent SNPs and pruning redundant tagging rules. The running time and memory consumption of the baseline algorithm, and the number of tag SNPs and tagging rules generated by the baseline algorithm on chr19, chr21 and chr22 when <it>max_size </it>= 3 and <it>min_r</it>2 = 0.95 is shown in Table <tblr tid="T6">6</tblr>.</p>
<tbl id="T6"><title><p>Table 6</p></title><caption><p>Baseline algorithm: merging equivalent SNPs and pruning redundant rules, no skipping rules. </p></caption><tblbdy cols="5">
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>
               <b>time</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>#tag SNPs</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>mem</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>#rules</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr19</p>
         </c>
         <c ca="center">
            <p>4.34</p>
         </c>
         <c ca="center">
            <p>9444</p>
         </c>
         <c ca="center">
            <p>30.29 MB</p>
         </c>
         <c ca="center">
            <p>
               <it>951,392</it>
            </p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr21</p>
         </c>
         <c ca="center">
            <p>9.91</p>
         </c>
         <c ca="center">
            <p>6929</p>
         </c>
         <c ca="center">
            <p>74.99 MB</p>
         </c>
         <c ca="center">
            <p>
               <it>1,747,900</it>
            </p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr22</p>
         </c>
         <c ca="center">
            <p>16.5</p>
         </c>
         <c ca="center">
            <p>7327</p>
         </c>
         <c ca="center">
            <p>50.20 MB</p>
         </c>
         <c ca="center">
            <p>
               <it>1,658,769</it>
            </p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>The co-occurrence model is used. <it>max_size </it>= 3, <it>min_r</it>2 = 0.95.</p>
   </tblfn></tbl>
<p>The "#Rep SNPs" column in Table <tblr tid="T1">1</tblr> shows the number of representative SNPs after merging equivalent SNPs using window size of 100 k. The number of SNPs is reduced by around a half. We have tried to use a larger window size to merge equivalent SNPs, and the results show that larger window sizes do not achieve much further reduction. The reduction in number of SNPs greatly reduces the number of rules to be tested. Table <tblr tid="T7">7</tblr> shows the performance of FastTagger without merging equivalent SNPs. Without merging equivalent SNPs, FastTagger generates an excessive number of tagging rules, e.g., around 20 times more than that of merging equivalent SNPs, thus taking much longer time and consuming much more memory. There is also a slight increase in the number of tag SNPs selected.</p>
<tbl id="T7"><title><p>Table 7</p></title><caption><p>Baseline algorithm WITHOUT merging equivalent SNPs.</p></caption><tblbdy cols="5">
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>
               <b>time</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>#tag SNPs</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>mem</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>#rules</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr19</p>
         </c>
         <c ca="center">
            <p>31.4</p>
         </c>
         <c ca="center">
            <p>9476</p>
         </c>
         <c ca="center">
            <p>209.83 MB</p>
         </c>
         <c ca="center">
            <p>
               <it>17,798,798</it>
            </p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr21</p>
         </c>
         <c ca="center">
            <p>72.3</p>
         </c>
         <c ca="center">
            <p>6959</p>
         </c>
         <c ca="center">
            <p>555.42 MB</p>
         </c>
         <c ca="center">
            <p>
               <it>35,278,021</it>
            </p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr22</p>
         </c>
         <c ca="center">
            <p>90.5</p>
         </c>
         <c ca="center">
            <p>7342</p>
         </c>
         <c ca="center">
            <p>340.59 MB</p>
         </c>
         <c ca="center">
            <p>
               <it>30,954,495</it>
            </p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p> The co-occurrence model is used. <it>max_size </it>= 3, <it>min_r</it>2 = 0.95.</p>
   </tblfn></tbl>
<p>Table <tblr tid="T8">8</tblr> shows the performance of FastTagger without pruning redundant rules. Pruning redundant rules can reduce the number of rules generated by 3 times, thus reducing the maximum memory usage of FastTagger by more than a half. Although identifying redundant rules can reduce the search space, it also incurs some overhead. Hence the running time of FastTagger does not decrease when it uses the pruning redundant rules technique.</p>
<tbl id="T8"><title><p>Table 8</p></title><caption><p>Baseline algorithm WITHOUT pruning redundant rules. </p></caption><tblbdy cols="5">
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>
               <b>time</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>#tag SNPs</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>mem</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>#rules</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr19</p>
         </c>
         <c ca="center">
            <p>4.24</p>
         </c>
         <c ca="center">
            <p>9439</p>
         </c>
         <c ca="center">
            <p>75.70 MB</p>
         </c>
         <c ca="center">
            <p>
               <it>3,048,090</it>
            </p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr21</p>
         </c>
         <c ca="center">
            <p>9.60</p>
         </c>
         <c ca="center">
            <p>6942</p>
         </c>
         <c ca="center">
            <p>191.86 MB</p>
         </c>
         <c ca="center">
            <p>
               <it>5,643,004</it>
            </p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr22</p>
         </c>
         <c ca="center">
            <p>15.8</p>
         </c>
         <c ca="center">
            <p>7327</p>
         </c>
         <c ca="center">
            <p>130.19 MB</p>
         </c>
         <c ca="center">
            <p>
               <it>5,563,473</it>
            </p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>The co-occurrence model is used. <it>max_size </it>= 3, <it>min_r</it>2 = 0.95.</p>
   </tblfn></tbl>
<p>Table <tblr tid="T9">9</tblr> shows performance of FastTagger when the skipping rules technique is used. Here if a SNP appears in the right hand side no less than 5 times, the SNP will not be considered as right hand side any more. By using this technique, the number of rules generated is reduced by more than a half. The running time and memory usage of FastTagger is also reduced. The number of tag SNPs selected increases slightly, but it is still smaller than that generated by the MMTagger algorithm.</p>
<tbl id="T9"><title><p>Table 9</p></title><caption><p>Baseline algorithm with skipping rules: if a SNP appears in the right hand side no less than 5 times, the SNP will not be considered as right hand side any more. </p></caption><tblbdy cols="5">
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>
               <b>time</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>#tag SNPs</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>mem</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>#rules</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr19</p>
         </c>
         <c ca="center">
            <p>3.66</p>
         </c>
         <c ca="center">
            <p>9550</p>
         </c>
         <c ca="center">
            <p>18.61 MB</p>
         </c>
         <c ca="center">
            <p>
               <it>461,139</it>
            </p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr21</p>
         </c>
         <c ca="center">
            <p>8.06</p>
         </c>
         <c ca="center">
            <p>7086</p>
         </c>
         <c ca="center">
            <p>40.74 MB</p>
         </c>
         <c ca="center">
            <p>
               <it>754,084</it>
            </p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr22</p>
         </c>
         <c ca="center">
            <p>13.5</p>
         </c>
         <c ca="center">
            <p>7447</p>
         </c>
         <c ca="center">
            <p>28.62 MB</p>
         </c>
         <c ca="center">
            <p>
               <it>755,309</it>
            </p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>The co-occurrence model is used. <it>max_size </it>= 3, <it>min_r</it>2 = 0.95.</p>
   </tblfn></tbl>
<p>We also tested FastTagger under a memory constraint. The maximum memory can be used by FastTagger is limited to 50 MB. We used the three large chromosomes, chr1, chr2 and chr3, in this experiment. All the three chromosomes contain more than 100 k SNPs. Table <tblr tid="T10">10</tblr> shows even with as less as 50 MB memory, FastTagger can still work on chromosomes with 100 k SNPs. There is only a tiny increase in its running time and the number of tag SNPs generated.</p>
<tbl id="T10"><title><p>Table 10</p></title><caption><p>Performance of Fast-COOC when memory size is restricted to 50 MB (<it>max_size </it>= 3, <it>min_r</it>2 = 0.95)</p></caption><tblbdy cols="7">
      <r>
         <c>
            <p/>
         </c>
         <c cspan="3" ca="center">
            <p>
               <b>No memory constraint</b>
            </p>
         </c>
         <c cspan="3" ca="center">
            <p>
               <b>mem = 50 MB</b>
            </p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>
               <b>time</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>#tag SNPs</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>mem</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>time</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>#tag SNPs</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>#chunks</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="7">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr1</p>
         </c>
         <c ca="center">
            <p>34.9</p>
         </c>
         <c ca="center">
            <p>35556</p>
         </c>
         <c ca="center">
            <p>94.41 MB</p>
         </c>
         <c ca="center">
            <p>35.14</p>
         </c>
         <c ca="center">
            <p>35561</p>
         </c>
         <c ca="center">
            <p>16</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr2</p>
         </c>
         <c ca="center">
            <p>42.9</p>
         </c>
         <c ca="center">
            <p>35502</p>
         </c>
         <c ca="center">
            <p>287.50 MB</p>
         </c>
         <c ca="center">
            <p>43.14</p>
         </c>
         <c ca="center">
            <p>35518</p>
         </c>
         <c ca="center">
            <p>21</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>chr3</p>
         </c>
         <c ca="center">
            <p>39.3</p>
         </c>
         <c ca="center">
            <p>30695</p>
         </c>
         <c ca="center">
            <p>119.72 MB</p>
         </c>
         <c ca="center">
            <p>39.3</p>
         </c>
         <c ca="center">
            <p>30706</p>
         </c>
         <c ca="center">
            <p>15</p>
         </c>
      </r>
   </tblbdy></tbl>
</sec>
<sec>
<st>
<p>Portability and prediction accuracy</p>
</st>
<p>Multi-marker models group combinations of the alleles on the left hand side into two groups, and then map these two groups to the two alleles on the right hand side. Compared with pairwise model, multi-marker models are more prone to over-fitting. Here we use three populations in HapMap--the Han Chinese population (HCB), the Japanese population (JPT) and the Caucasian population(CEU)--to study the portability and prediction accuracy of tagging rules of different lengths. We use chr19 in this experiment. We first generate tagging rules from one population, and then calculate the <it>r</it>2 statistics and prediction accuracy of these rules in the other populations. The prediction accuracy of a rule is defined as the proportion of alleles of the SNP on the right hand side that are correctly predicted by the alleles of the SNPs on the left hand side. The results reported below are results when rules are generated from individuals in the Han Chinese population and are evaluated using individuals in the other two populations. In all three populations, we consider only those SNPs with MAF &#8805; 5%.</p>
<p>Figures <figr fid="F1">1</figr>, <figr fid="F2">2</figr> and <figr fid="F3">3</figr> show the distribution of the <it>r</it>
<sup>2 </sup>values of the rules generated from the Han Chinese population using the two multi-marker models in the three populations. Table <tblr tid="T11">11</tblr> shows average <it>r</it>
<sup>2 </sup>of the rules in the three populations. For all the three lengths, the average <it>r</it>
<sup>2 </sup>of the rules in the Japanese population and the Caucasian population is lower than that in the Chinese population. The decrease of length-2 and length-3 rules is more significant than that of length-1 rules, which indicates that longer rules are more prone to over-fitting than shorter rules for both models. The <it>r</it>
<sup>2 </sup>values of the rules become much lower in the Caucasian population than that in the Japanese population, which is consistent with the genetic differences between the three populations.</p>
<tbl id="T11"><title><p>Table 11</p></title><caption><p>Average <it>r</it><sup>2 </sup>and predication accuracy of rules of different length on three populations. </p></caption><tblbdy cols="11">
      <r>
         <c>
            <p/>
         </c>
         <c>
            <p/>
         </c>
         <c cspan="3" ca="center">
            <p>
               <b>#rules</b>
            </p>
         </c>
         <c cspan="3" ca="center">
            <p>
               <b>average <it>r</it><sup>2</sup></b>
            </p>
         </c>
         <c cspan="3" ca="center">
            <p>
               <b>average accuracy</b>
            </p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>
               <b>len</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>model</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>HCB</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>JPT</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>CEU</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>HCB</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>JPT</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>CEU</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>HCB</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>JPT</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>CEU</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="11">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>1</p>
         </c>
         <c ca="center">
            <p>pairwise</p>
         </c>
         <c ca="center">
            <p>85961</p>
         </c>
         <c ca="center">
            <p>84123</p>
         </c>
         <c ca="center">
            <p>69083</p>
         </c>
         <c ca="center">
            <p>0.978</p>
         </c>
         <c ca="center">
            <p>0.942</p>
         </c>
         <c ca="center">
            <p>0.865</p>
         </c>
         <c ca="center">
            <p>0.995</p>
         </c>
         <c ca="center">
            <p>0.989</p>
         </c>
         <c ca="center">
            <p>0.966</p>
         </c>
      </r>
      <r>
         <c cspan="11">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>2</p>
         </c>
         <c ca="center">
            <p>co-occurrence</p>
         </c>
         <c ca="center">
            <p>1563176</p>
         </c>
         <c ca="center">
            <p>1472654</p>
         </c>
         <c ca="center">
            <p>1014934</p>
         </c>
         <c ca="center">
            <p>0.965</p>
         </c>
         <c ca="center">
            <p>0.878</p>
         </c>
         <c ca="center">
            <p>0.745</p>
         </c>
         <c ca="center">
            <p>0.993</p>
         </c>
         <c ca="center">
            <p>0.977</p>
         </c>
         <c ca="center">
            <p>0.938</p>
         </c>
      </r>
      <r>
         <c cspan="11">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>2</p>
         </c>
         <c ca="center">
            <p>one-vs-the-rest</p>
         </c>
         <c ca="center">
            <p>1560181</p>
         </c>
         <c ca="center">
            <p>1469765</p>
         </c>
         <c ca="center">
            <p>1012699</p>
         </c>
         <c ca="center">
            <p>0.965</p>
         </c>
         <c ca="center">
            <p>0.881</p>
         </c>
         <c ca="center">
            <p>0.753</p>
         </c>
         <c ca="center">
            <p>0.993</p>
         </c>
         <c ca="center">
            <p>0.977</p>
         </c>
         <c ca="center">
            <p>0.940</p>
         </c>
      </r>
      <r>
         <c cspan="11">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>3</p>
         </c>
         <c ca="center">
            <p>co-occurrence</p>
         </c>
         <c ca="center">
            <p>26182522</p>
         </c>
         <c ca="center">
            <p>24495802</p>
         </c>
         <c ca="center">
            <p>16064120</p>
         </c>
         <c ca="center">
            <p>0.952</p>
         </c>
         <c ca="center">
            <p>0.790</p>
         </c>
         <c ca="center">
            <p>0.665</p>
         </c>
         <c ca="center">
            <p>0.990</p>
         </c>
         <c ca="center">
            <p>0.960</p>
         </c>
         <c ca="center">
            <p>0.913</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>3</p>
         </c>
         <c ca="center">
            <p>one-vs-the-rest</p>
         </c>
         <c ca="center">
            <p>7074493</p>
         </c>
         <c ca="center">
            <p>6269985</p>
         </c>
         <c ca="center">
            <p>3955224</p>
         </c>
         <c ca="center">
            <p>0.970</p>
         </c>
         <c ca="center">
            <p>0.791</p>
         </c>
         <c ca="center">
            <p>0.659</p>
         </c>
         <c ca="center">
            <p>0.994</p>
         </c>
         <c ca="center">
            <p>0.970</p>
         </c>
         <c ca="center">
            <p>0.919</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p>The rules are generated from Han Chinese population with <it>min_r</it>2 = 0.9. Some rules may become invalid in the other two populations because the MAF of some SNPs in the other two populations may be smaller than 5%. When only pairwise LD is used, all algorithms generate the same set of rules. When multi-markers are considered, FastTagger-COOC and MMTagger generate the same set of rules using the co-occurrence model; FastTagger-avsR and MultiTag generate the same set of rules using the one-vs-the-rest model.</p>
   </tblfn></tbl>
<fig id="F1"><title><p>Figure 1</p></title><caption><p>Portability of length-1 rules</p></caption><text>
   <p><b>Portability of length-1 rules</b>. The rules are generated from the Han Chinese population with <it>min_r</it>2 = 0.9, and they are then validated on the other two datasets as well.</p>
</text><graphic file="1471-2105-11-66-1" hint_layout="single"/></fig>
<fig id="F2"><title><p>Figure 2</p></title><caption><p>Portability of length-2 rules</p></caption><text>
   <p><b>Portability of length-2 rules</b>. The rules are generated from the Han Chinese population with <it>min_r</it>2 = 0.9.</p>
</text><graphic file="1471-2105-11-66-2" hint_layout="single"/></fig>
<fig id="F3"><title><p>Figure 3</p></title><caption><p>Portability of length-3 rules</p></caption><text>
   <p><b>Portability of length-3 rules</b>. The rules are generated from the Han Chinese population with <it>min_r</it>2 = 0.9.</p>
</text><graphic file="1471-2105-11-66-3" hint_layout="single"/></fig>
<p>The same trend is observed on prediction accuracy (Figure <figr fid="F4">4</figr>, <figr fid="F5">5</figr> and <figr fid="F6">6</figr>). Even though the rules are generated from the Chinese population, their accuracy in the Japanese population is always above 80%. Even for length-3 rules, 94% rules generated using the co-occurrence model have an accuracy no less than 90%, and 97.4% rules generated using the one-vs-the-rest model have an accuracy no less than 90% in the Japanese population. The average accuracy of length-3 rules is above 96% for both models in the Japanese population(Table <tblr tid="T11">11</tblr>). The average accuracy of the rules in the Caucasian population is lower than that in the Japanese population, but it is still above 91% even for length-3 rules. We believe that if we use individuals from the same population to do the testing, the average <it>r</it>
<sup>2 </sup>and accuracy should be even higher. As for the two models, the number of length-2 rules generated by the two models is similar, while the co-occurrence model generates about 3.5 times more length-3 rules than the one-vs-the-rest model. The average <it>r</it>
<sup>2 </sup>and accuracy of the length-3 rules generated using the one-vs-the-rest model is higher than that generated using the co-occurrence model on both populations. However, since much less rules are generated under the one-vs-the-rest model, the one-vs-the-rest model needs more tag SNPs to cover all the other SNPs than the co-occurrence model as shown in Table <tblr tid="T3">3</tblr>.</p>
<fig id="F4"><title><p>Figure 4</p></title><caption><p>Prediction accuracy of length-1 rules</p></caption><text>
   <p><b>Prediction accuracy of length-1 rules</b>. The rules are generated from the Han Chinese population with <it>min_r</it>2 = 0.9.</p>
</text><graphic file="1471-2105-11-66-4" hint_layout="single"/></fig>
<fig id="F5"><title><p>Figure 5</p></title><caption><p>Prediction accuracy of length-2 rules</p></caption><text>
   <p><b>Prediction accuracy of length-2 rules</b>. The rules are generated from the Han Chinese population with <it>min_r</it>2 = 0.9.</p>
</text><graphic file="1471-2105-11-66-5" hint_layout="single"/></fig>
<fig id="F6"><title><p>Figure 6</p></title><caption><p>Prediction accuracy of length-3 rules</p></caption><text>
   <p><b>Prediction accuracy of length-3 rules</b>. The rules are generated from the Han Chinese population with <it>min_r</it>2 = 0.9.</p>
</text><graphic file="1471-2105-11-66-6" hint_layout="single"/></fig>
</sec>
</sec>
<sec>
<st>
<p>Conclusions</p>
</st>
<p>In this paper, we have presented an efficient algorithm called FastTagger for genome-wide tag SNP selection using multi-marker LD. FastTagger uses several techniques to reduce running time and memory consumption. Our experiment results show that FastTagger is several times faster than existing tag SNP selection algorithms using multi-marker models, and it consumes much less memory at the same time, which makes FastTagger can work on chromosomes containing more than 100 k SNPs where existing algorithms using multi-marker models usually fail. FastTagger also select less tag SNPs than existing algorithms using multi-marker LD. Our experiment results also show that merging equivalent SNPs together is the most effective technique in reducing running time and memory consumption.</p>
<p>We implemented two multi-marker models in the FastTagger algorithm. The one-vs-the-rest model generates rules with higher average <it>r</it>
<sup>2 </sup>and higher average accuracy than the co-occurrence model under the same parameter settings. However, it generates much less length-3 rules than the co-occurrence model, thus requiring more tag SNPs to cover all the other SNPs.</p>
<p>We compared the portability and prediction accuracy of rules of different length. The results show that shorter rules have better portability and higher prediction accuracy than longer rules. Nevertheless, length-3 rules generated from the Chinese population can still achieve an average accuracy of 96% on the Japanese population for both models.</p>
<p>In our experiments, we calculate prediction accuracy for individual rules. When we use these rules to make predictions on unobserved SNPs, it is possible that one SNP can be predicted by multiple rules, and the prediction of these rules may conflict with one another. In our future work, we will study how to resolve the conflicts and make consensus predictions for unobserved SNPs.</p>
</sec>
<sec>
<st>
<p>Availability and requirements</p>
</st>
<p indent="1">&#8226; <b>Project name</b>: Pattern spaces &amp; data mining algorithms for pharmacogenomics</p>
<p indent="1">&#8226; <b>Project home page</b>: <url>http://www.comp.nus.edu.sg/~wongls/projects/snp-analysis/index.html</url>
</p>
<p indent="1">&#8226; <b>Grant</b>: A*STAR SERC PSF 072-101-0016</p>
<p indent="1">&#8226; <b>Operating system(s)</b>: Linux or Windows</p>
<p indent="1">&#8226; <b>Programming language</b>: C++</p>
<p indent="1">&#8226; <b>Other requirements</b>: none</p>
<p indent="1">&#8226; <b>License</b>: FreeBSD for academic use</p>
<p indent="1">&#8226; <b>Any restrictions to use by non-academics</b>: Licence needed</p>
</sec>
<sec>
<st>
<p>Authors' contributions</p>
</st>
<p>Guimei Liu designed and implemented the FastTagger algorithm, and wrote this manuscript. Yue Wang participated in discussion of the proposed method and conducted the experiments. Limsoon Wong gave advice on the design of the algorithm and the manuscript. All authors read and approved the final manuscript.</p>
</sec>
</bdy><bm>
<ack>
<sec>
<st>
<p>Acknowledgements</p>
</st>
<p>This work was supported in part by an A*STAR grant SERC 072 101 0016 (Liu, Wong) and an NUS NGS scholarship (Wang).</p>
</sec>
</ack>
<refgrp><bibl id="B1"><title><p>Haplotype tagging for the identification of common disease genes</p></title><aug><au><snm>Johnson</snm><fnm>G</fnm></au></aug><source>Nature Genetics</source><pubdate>2001</pubdate><volume>29</volume><fpage>233</fpage><lpage>237</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/ng1001-233</pubid><pubid idtype="pmpid" link="fulltext">11586306</pubid></pubidlist></xrefbib></bibl><bibl id="B2"><title><p>Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21</p></title><aug><au><snm>Patil</snm><fnm>N</fnm></au></aug><source>Science</source><pubdate>2001</pubdate><volume>294</volume><issue>5547</issue><fpage>1719</fpage><lpage>1723</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1126/science.1065573</pubid><pubid idtype="pmpid" link="fulltext">11721056</pubid></pubidlist></xrefbib></bibl><bibl id="B3"><title><p>The Structure of Haplotype Blocks in the Human Genome</p></title><aug><au><snm>Gabriel</snm><fnm>SB</fnm></au><au><snm>Schaffner</snm><fnm>SF</fnm></au><au><snm>Nguyen</snm><fnm>H</fnm></au><au><snm>Moore</snm><fnm>JM</fnm></au><au><snm>Roy</snm><fnm>J</fnm></au><au><snm>Blumenstiel</snm><fnm>B</fnm></au><au><snm>Higgins</snm><fnm>J</fnm></au><au><snm>DeFelice</snm><fnm>M</fnm></au><au><snm>Lochner</snm><fnm>A</fnm></au><au><snm>Faggart</snm><fnm>M</fnm></au><au><snm>Liu-Cordero</snm><fnm>SN</fnm></au><au><snm>Rotimi</snm><fnm>C</fnm></au><au><snm>Adeyemo</snm><fnm>A</fnm></au><au><snm>Cooper</snm><fnm>R</fnm></au><au><snm>Ward</snm><fnm>R</fnm></au><au><snm>Lander</snm><fnm>ES</fnm></au><au><snm>Daly</snm><fnm>MJ</fnm></au><au><snm>Altshuler</snm><fnm>D</fnm></au></aug><source>Science</source><pubdate>2002</pubdate><volume>296</volume><issue>5576</issue><fpage>2225</fpage><lpage>2229</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1126/science.1069424</pubid><pubid idtype="pmpid" link="fulltext">12029063</pubid></pubidlist></xrefbib></bibl><bibl id="B4"><title><p>Selection of Minimum Subsets of Single Nucleotide Polymorphisms to Capture Haplotype Block Diversity</p></title><aug><au><snm>Avi-Itzhak</snm><fnm>HI</fnm></au><au><snm>Su</snm><fnm>X</fnm></au><au><snm>de la Vega</snm><fnm>FM</fnm></au></aug><source>Pacific Symposium on Biocomputing</source><pubdate>2003</pubdate><fpage>466</fpage><lpage>477</lpage><xrefbib><pubid idtype="pmpid">12603050</pubid></xrefbib></bibl><bibl id="B5"><title><p>Minimal haplotype tagging</p></title><aug><au><snm>Sebastiani</snm><fnm>P</fnm></au></aug><source>Proc Natl Acad Sci</source><pubdate>2003</pubdate><volume>100</volume><fpage>9900</fpage><lpage>9905</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.1633613100</pubid><pubid idtype="pmcid">187880</pubid><pubid idtype="pmpid" link="fulltext">12900503</pubid></pubidlist></xrefbib></bibl><bibl id="B6"><title><p>Haplotype Tagging Single Nucleotide Polymorphisms and Association Studies</p></title><aug><au><snm>Thompson</snm><fnm>D</fnm></au><au><snm>Stram</snm><fnm>D</fnm></au><au><snm>Goldgar</snm><fnm>D</fnm></au><au><snm>Witte</snm><fnm>JS</fnm></au></aug><source>Human Heredity</source><pubdate>2003</pubdate><volume>56</volume><fpage>48</fpage><lpage>55</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1159/000073732</pubid><pubid idtype="pmpid" link="fulltext">14614238</pubid></pubidlist></xrefbib></bibl><bibl id="B7"><title><p>Haplotype Block Partitioning and Tag SNP Selection Using Genotype Data and Their Applications to Association Studies</p></title><aug><au><snm>Zhang</snm><fnm>K</fnm></au><au><snm>Qin</snm><fnm>ZS</fnm></au><au><snm>Liu</snm><fnm>JS</fnm></au><au><snm>Chen</snm><fnm>T</fnm></au><au><snm>Waterman</snm><fnm>MS</fnm></au><au><snm>Sun</snm><fnm>F</fnm></au></aug><source>Genome Research</source><pubdate>2004</pubdate><volume>14</volume><fpage>908</fpage><lpage>916</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.1837404</pubid><pubid idtype="pmcid">479119</pubid><pubid idtype="pmpid" link="fulltext">15078859</pubid></pubidlist></xrefbib></bibl><bibl id="B8"><title><p>Optimal haplotype block-free selection of tagging SNPs for genome-wide association studies</p></title><aug><au><snm>Halldorsson</snm><fnm>B</fnm></au></aug><source>Genome Research</source><pubdate>2004</pubdate><volume>14</volume><fpage>1633</fpage><lpage>1640</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.2570004</pubid><pubid idtype="pmcid">509273</pubid><pubid idtype="pmpid" link="fulltext">15289481</pubid></pubidlist></xrefbib></bibl><bibl id="B9"><title><p>Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium</p></title><aug><au><snm>Carlson</snm><fnm>C</fnm></au></aug><source>The American Journal of Human Genetics</source><pubdate>2004</pubdate><volume>74</volume><fpage>106</fpage><lpage>120</lpage><xrefbib><pubid idtype="doi">10.1086/381000</pubid></xrefbib></bibl><bibl id="B10"><title><p>Tag SNP selection in genotype data for maximizing SNP prediction accuracy</p></title><aug><au><snm>Halperin</snm><fnm>E</fnm></au></aug><source>Bioinformatics</source><pubdate>2005</pubdate><volume>21</volume><fpage>195</fpage><lpage>203</lpage><xrefbib><pubid idtype="doi">10.1093/bioinformatics/bti1021</pubid></xrefbib></bibl><bibl id="B11"><title><p>The whole genome tagSNP selection and transferability among HapMap populations</p></title><aug><au><snm>Magi</snm><fnm>R</fnm></au><au><snm>Kaplinski</snm><fnm>L</fnm></au><au><snm>Remm</snm><fnm>M</fnm></au></aug><source>Pacific Symposium on Biocomputing</source><pubdate>2006</pubdate><volume>11</volume><fpage>535</fpage><lpage>543</lpage><xrefbib><pubid idtype="doi">full_text</pubid></xrefbib></bibl><bibl id="B12"><title><p>An efficient comprehensive search algorithm for tagSNP selection using linkage disequilibrium criteria</p></title><aug><au><snm>Qin</snm><fnm>Z</fnm></au><au><snm>Gopalakrishnan</snm><fnm>S</fnm></au><au><snm>Abecasis</snm><fnm>G</fnm></au></aug><source>Bioinformatics</source><pubdate>2006</pubdate><volume>22</volume><issue>2</issue><fpage>220</fpage><lpage>225</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bti762</pubid><pubid idtype="pmpid" link="fulltext">16269414</pubid></pubidlist></xrefbib></bibl><bibl id="B13"><title><p>Efficient algorithms for genome-wide tagSNPs selection across populations via linkage disequilibrium criterion</p></title><aug><au><snm>Liu</snm><fnm>L</fnm></au><au><snm>Wu</snm><fnm>Y</fnm></au><au><snm>Lonardi</snm><fnm>S</fnm></au><au><snm>Jiang</snm><fnm>T</fnm></au></aug><source>Proc. of 6th Annual International Conference on Computational Systems Bioinformatics</source><pubdate>2007</pubdate><fpage>67</fpage><lpage>78</lpage><xrefbib><pubid idtype="doi">full_text</pubid></xrefbib></bibl><bibl id="B14"><title><p>Linkage Disequilibrium in Humans: Models and Data</p></title><aug><au><snm>Pritchard</snm><fnm>JK</fnm></au><au><snm>Przeworski</snm><fnm>M</fnm></au></aug><source>Am J Hum Genet</source><pubdate>2001</pubdate><volume>69</volume><fpage>1</fpage><lpage>14</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1086/321275</pubid><pubid idtype="pmcid">1226024</pubid><pubid idtype="pmpid" link="fulltext">11410837</pubid></pubidlist></xrefbib></bibl><bibl id="B15"><title><p>Efficient selection of tagging single-nucleotide polymorphisms in multiple populations</p></title><aug><au><snm>Howie</snm><fnm>B</fnm></au></aug><source>Human Genetics</source><pubdate>2006</pubdate><volume>120</volume><fpage>58</fpage><lpage>68</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1007/s00439-006-0182-5</pubid><pubid idtype="pmpid" link="fulltext">16680432</pubid></pubidlist></xrefbib></bibl><bibl id="B16"><title><p>Efficiency and power in genetic association studies</p></title><aug><au><snm>Bakker</snm><fnm>PD</fnm></au><au><snm>Yelensky</snm><fnm>R</fnm></au><au><snm>Pe'er</snm><fnm>I</fnm></au><au><snm>Gabriel</snm><fnm>SB</fnm></au><au><snm>Daly</snm><fnm>MJ</fnm></au><au><snm>Altshuler1</snm><fnm>D</fnm></au></aug><source>Nature Genetics</source><pubdate>2005</pubdate><volume>37</volume><fpage>1217</fpage><lpage>1223</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/ng1669</pubid><pubid idtype="pmpid" link="fulltext">16244653</pubid></pubidlist></xrefbib></bibl><bibl id="B17"><title><p>Evaluating and improving power in whole-genome association studies using fixed marker sets</p></title><aug><au><snm>Pe'er</snm><fnm>I</fnm></au></aug><source>Nature Genetics</source><pubdate>2006</pubdate><volume>38</volume><fpage>663</fpage><lpage>667</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/ng1816</pubid><pubid idtype="pmpid" link="fulltext">16715096</pubid></pubidlist></xrefbib></bibl><bibl id="B18"><title><p>A new framework for the selection of tag SNPs by multimarker haplotypes</p></title><aug><au><snm>Huang</snm><fnm>YT</fnm></au><au><snm>Chao</snm><fnm>KM</fnm></au></aug><source>Journal of Biomedical Informatics</source><pubdate>2008</pubdate><volume>41</volume><issue>6</issue><fpage>953</fpage><lpage>961</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.jbi.2008.04.003</pubid><pubid idtype="pmpid" link="fulltext">18490200</pubid></pubidlist></xrefbib></bibl><bibl id="B19"><title><p>Genome-wide selection of tag SNPs using multiple-marker correlation</p></title><aug><au><snm>Hao</snm><fnm>K</fnm></au></aug><source>Bioinformatics</source><pubdate>2007</pubdate><volume>23</volume><issue>23</issue><fpage>3178</fpage><lpage>3184</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btm496</pubid><pubid idtype="pmpid" link="fulltext">18006555</pubid></pubidlist></xrefbib></bibl><bibl id="B20"><title><p>LdCompare: rapid computation of single and multiple marker <it>r</it><sup>2 </sup>and genetic coverage</p></title><aug><au><snm>Hao</snm><fnm>K</fnm></au><au><snm>Di</snm><fnm>X</fnm></au><au><snm>Cawley</snm><fnm>S</fnm></au></aug><source>Bioinformatics</source><pubdate>2007</pubdate><volume>23</volume><issue>2</issue><fpage>252</fpage><lpage>254</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btl574</pubid><pubid idtype="pmpid" link="fulltext">17148510</pubid></pubidlist></xrefbib></bibl><bibl id="B21"><title><p>A New Model of Multi-Marker Correlation for Genome-wide Tag SNP Selection</p></title><aug><au><snm>Wang</snm><fnm>WB</fnm></au><au><snm>Jiang</snm><fnm>T</fnm></au></aug><source>Proc. of the International Conference on Genome Informatics</source><pubdate>2008</pubdate></bibl><bibl id="B22"><title><p>A new statistical method for haplotype reconstruction from population data</p></title><aug><au><snm>Stephens</snm><fnm>M</fnm></au><au><snm>Smith</snm><fnm>N</fnm></au><au><snm>Donnelly</snm><fnm>P</fnm></au></aug><source>The American Journal of Human Genetics</source><pubdate>2001</pubdate><volume>68</volume><fpage>978</fpage><lpage>989</lpage><xrefbib><pubid idtype="doi">10.1086/319501</pubid></xrefbib></bibl><bibl id="B23"><title><p>Estimation of linkage disequilibrium in randomly mating populations</p></title><aug><au><snm>Hill</snm><fnm>W</fnm></au></aug><source>Heredity</source><pubdate>1974</pubdate><volume>33</volume><issue>2</issue><fpage>229</fpage><lpage>239</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/hdy.1974.89</pubid><pubid idtype="pmpid">4531429</pubid></pubidlist></xrefbib></bibl><bibl id="B24"><title><p>Tests for association of gene frequencies at several loci in random mating diploid populations</p></title><aug><au><snm>Hill</snm><fnm>W</fnm></au></aug><source>Bioinformatics</source><pubdate>1975</pubdate><volume>31</volume><issue>4</issue><fpage>881</fpage><lpage>888</lpage></bibl><bibl id="B25"><title><p>A new concise representation of frequent itemsets using generators and a positive border</p></title><aug><au><snm>Liu</snm><fnm>G</fnm></au><au><snm>Li</snm><fnm>J</fnm></au><au><snm>Wong</snm><fnm>L</fnm></au></aug><source>Knowl Inf Syst</source><pubdate>2008</pubdate><volume>17</volume><fpage>35</fpage><lpage>56</lpage><xrefbib><pubid idtype="doi">10.1007/s10115-007-0111-5</pubid></xrefbib></bibl></refgrp>
</bm></art>
