Fast and accurate haplotype frequency estimation for large haplotype vectors from pooled DNA data

Iliadis, Alexandros; Anastassiou, Dimitris; Wang, Xiaodong

doi:10.1186/1471-2156-13-94

Research article
Open access
Published: 30 October 2012

Fast and accurate haplotype frequency estimation for large haplotype vectors from pooled DNA data

Alexandros Iliadis¹,
Dimitris Anastassiou¹ &
Xiaodong Wang¹

BMC Genetics volume 13, Article number: 94 (2012) Cite this article

4086 Accesses
7 Citations
1 Altmetric
Metrics details

Abstract

Background

Typically, the first phase of a genome wide association study (GWAS) includes genotyping across hundreds of individuals and validation of the most significant SNPs. Allelotyping of pooled genomic DNA is a common approach to reduce the overall cost of the study. Knowledge of haplotype structure can provide additional information to single locus analyses. Several methods have been proposed for estimating haplotype frequencies in a population from pooled DNA data.

Results

We introduce a technique for haplotype frequency estimation in a population from pooled DNA samples focusing on datasets containing a small number of individuals per pool (2 or 3 individuals) and a large number of markers. We compare our method with the publicly available state-of-the-art algorithms HIPPO and HAPLOPOOL on datasets of varying number of pools and marker sizes. We demonstrate that our algorithm provides improvements in terms of accuracy and computational time over competing methods for large number of markers while demonstrating comparable performance for smaller marker sizes. Our method is implemented in the "Tree-Based Deterministic Sampling Pool" (TDSPool) package which is available for download at http://www.ee.columbia.edu/~anastas/tdspool.

Conclusions

Using a tree-based determinstic sampling technique we present an algorithm for haplotype frequency estimation from pooled data. Our method demonstrates superior performance in datasets with large number of markers and could be the method of choice for haplotype frequency estimation in such datasets.

Background

In recent years large genetic association studies involving hundreds or thousands of individuals have become increasingly available, providing opportunities for biological and medical discoveries. In these studies, hundreds of thousands of SNPs are genotyped for the cases and the controls, and discrepancies between the haplotype distributions indicate an association between a genetic region and the disease. Typically, the first phase of a GWAS includes genotyping across hundreds of individuals and validation of the most significant SNPs. One possible approach to reducing the overall cost of GWAS is to replace individual genotyping in phase I with allelotyping of pooled genomic DNA [1–6]. Here, equimolar amounts of DNA are mixed into one sample prior to the amplification and sequencing steps. After genotyping, the frequency of an allele in each position is given [5].

Rather than examining SNPs independent of each other, simultaneously considering the values of multiple SNPs within haplotypes (combinations of alleles at multiple loci in individual chromosomes) can improve the power of detecting associations with disease and is also of general interest with the pooled data. To facilitate haplotype-based association analysis it is necessary to estimate haplotype frequencies from pooled DNA data.

A variety of algorithms have been suggested to estimate haplotype frequencies from pooled data. Available methods fall into two large categories. The first category consists of methods that focus on accurate solutions for small pool sizes (2 or 3 individuals per pool) and considerably large genotype segments. Many well known approaches that focus on small pool sizes use an expectation-maximization (EM) algorithm for maximizing the multinomial likelihood [7–9]. Pirinen et al. [10] extended the gold standard PHASE algorithm [11] to the case of pooled data. They introduced a novel step in the Markov Chain Monte Carlo (MCMC) scheme, during which the haplotypes within each pool were shuffled to simulate individuals on which the original PHASE algorithm could be run to estimate the haplotypes. A method based on perfect phylogeny, HAPLOPOOL, was suggested in [12] and was supplemented with the EM algorithm and linear regression in order to combine haplotype segments. HAPLOPOOL has demonstrated superior performance in terms of accuracy and computational time with respect to the competing EM algorithms. The second category consists of methods that focus on large pools (order of hundred of individuals per pool) and considerably smaller genotype segments. For this scenario, Zhang et al. [13] first proposed a method (PoooL) for estimating haplotype frequencies using a normal approximation for the distribution of pooled allele counts. Imposing a set of linear constraints they transformed the EM algorithm to a constrained maximum entropy problem which they solved using the iterative scaling method. Kuk et al. [14] improved the PoooL methodology, using the ratio of normal densities approximation in the EM, which resulted to the AEM method. Gasbarra et al. [15] introduced a Bayesian haplotype frequency estimation method combining the pooled allele frequency data with prior database knowledge about the set of existing haplotypes in the population. Finally, HIPPO [16] used a multinormal approximation of the likelihood and a reversible-jump Markov chain Monte Carlo (RJMCMC) algorithm to estimate the existing haplotypes in the population and their frequencies. The HIPPO framework is also able to accommodate prior database knowledge for the existing haplotypes in the population and has demonstrated improvements in the performance over the approximate EM - algorithm [16]. In this study we will therefore compare our proposed algorithm with the top performing methods from each category as discussed above, namely HIPPO and HAPLOPOOL.

Naturally, pooling techniques are more prone to errors and offer less possibilities for assessing the quality of the data than individual genotyping. As argued and discussed by Kirkpatrick et al. [12], pooling errors have much greater effect on larger pool sizes as opposed to small pool sizes with respect to the number of incorrect allele calls and the subsequent haplotype estimation. In specific, if σ is the error standard deviation (SD) in the estimates of allele frequencies, 2* σ should be less than the difference between allowable frequency estimates, in order for clustering algorithms to be able to correct the error. As more individuals are included in each pool, the difference between allowable allele frequencies decreases, which results in a higher percentage of incorrect calls. For example in pools of two individuals where the difference between allowable frequency calls is 0.25 (0,0.25, 0.5 ,0.75,1), an accuracy of σ <0.125 will ensure a low rate of incorrect calls (<1%).

In a recent study Kuk et al. [17] examined the efficiency of pooling relative to no pooling using asymptotic statistical theory. They found that under linkage equilibrium (not a typical case!) pooling suffers loss in efficiency when there are more than three independent loci (2³ haplotypes) and up to four individuals per pool, whereas accuracy decreases with increasing pool size and number of loci. Rare alleles or linkage disequilibrium (LD) (or both) decrease the number of haplotypes that appear with non-negligible frequencies and thus pooling could remain efficient for larger haplotype blocks. In general, pooling could still remain more efficient in the case where only a small number of haplotypes can occur with appreciable frequency, as also suggested in Barratt et al. [18], and while pool size is kept considerably small.

In this paper we propose a new tree-based deterministic sampling method (TDSPool) for haplotype frequency estimation from pooled DNA data. Our method specifically focuses on small pool sizes and can handle arbitrarily large block sizes. In our study, we examine real data focusing on dense SNP areas, in which only a small number of haplotypes appear with appreciable frequency, so that our scenarios are within the limits of Kuk et al. [17]. We demonstrate that using our methodology we can achieve improved performance over existing state-of-the-art methods in datasets with large number of markers.

Results

In order to compare the accuracy of frequency estimation between the different methods and under the different scenarios examined, we compared the predicted haplotype frequencies from a given method, f, to the gold-standard frequencies, g, observed in the actual population. The measure we used was the χ² distance between the two distributions which is simply the result of the χ² statistic, where g is the expected distribution, i.e., χ²(f, g) = Σ_i=1^d(f_i − g_i)²/g_i and d is the number of gold standard haplotypes [12].

Datasets

To examine the performance of our methodology we have considered in our experiments real datasets for which estimates of the haplotype frequencies were already available and which cover a variety of dataset sizes.

We have first simulated using the three loci haplotypes and their associated frequencies from the dataset of Jain et al. [19] as the true distribution (Table 1). The haplotypes and their frequencies were estimated using the EM algorithm from a set of 135 individuals genotyped on three SNPs and the estimates were used as the true haplotype distribution. We have simulated datasets with a variable number of pools T = 50, 75, 100 and 150. In each pool each individual was randomly selecting a pair of haplotypes according to the distribution of haplotypes. We have created pools with two different pool sizes, 2 and 3 individuals per pool. For each number of pools and each pool size we have created 100 datasets that were used as the datasets for our simulation.

Table 1 Haplotypes and their estimated frequencies for the 3 loci dataset

Full size table

Next, we considered two more cases with larger number of loci. In the second case which has L = 10 loci, we generated data according to the haplotype frequencies of the AGT gene considered in Yang et al. [9]. The haplotypes and their respective frequencies are given in Table 2. The procedure for creating datasets and pools was identical to the three loci case.

Table 2 Haplotypes and their estimated frequencies for the 10 loci dataset

Full size table

The third dataset consisted of SNPs from the first 7Mb (742 kb to 7124.8 kb) of the HapMap CEU population (HapMap 3 release 2- Phasing data). This chromosomal region was partitioned based on physical distance into disjoint blocks of 15 kb. The resulting blocks had a varying number of markers ranging from 2–28. For our purposes we have considered only the datasets that had more than 10 SNPs and less than 20 (which was the maximum number of loci so that HAPLOPOOL could produce estimates within a reasonable amount of time) which resulted in selecting a total of 80 blocks. On each block the parental haplotypes and their estimated frequencies were used as the true haplotype distribution. As in the previous cases, in each block two different pool sizes, 2 and 3 individuals per pool, were considered and four different number of pools per dataset.

Frequency estimation

We have examined the accuracy of our method and compared it against HIPPO and HAPLOPOOL on the three datasets described in our previous subsection. In all experiments considered in this subsection the DNA pools were simulated assuming no missing data or measurement error. The performance of the methods is shown in Figure 1.

For the 3 and 10 loci datasets the result presented is the average χ² distance from a 100 simulation experiments, whereas in the HapMap dataset the result presented is the average χ² distance on the 80 datasets considered. For the 3 loci dataset it can be seen that TDSPool and HAPLOPOOL produced similar accuracy. For the remaining two datasets with larger number of loci TDSPool demonstrated superior performance. For the HapMap dataset only TDSPool and HAPLOPOOL were evaluated since the maximum number of loci HIPPO can handle without prior knowledge of the major haplotypes in the population is 10. At the same time even though HAPLOPOOL can in principle handle larger datasets, due to excessive computational time for datasets with 24 and 28 loci we restricted our comparisons to datasets between 10 and 20 loci. We note here as well that since HIPPO is based on a central limit theorem it is likely to be a better approximation in large pools as opposed to small ones that we focus in our study.

From our experiments we can also see that the number of pools also affected accuracy. All algorithms demonstrated improved performance with increasing number of pools in the dataset.

Noise and missing data

In the previous subsection we have evaluated the performance of our method by simulating DNA pools without missing data and measurement errors. However, in allelotyping pooled DNA, allele frequencies may not be estimated properly in some practical situations and the data are consequently missing or have measurement errors.

In order to measure the effect of genotype error on the accuracy of the haplotype frequency estimation and evaluate the performance of our method under such scenarios, we have simulated genotyping error by adding a Gaussian error with SD σ to each called allele frequency. Suppose we denote the correct allele frequency at SNP j in pool i as c_ij. The perturbed allele frequency is given by $\hat{c_{ij}} = c_{ij} + x$ where x ∼ N(0, σ²). After simulating these perturbed haplotype frequencies, we discretize the resulting frequencies to produce perturbed allele counts that are consistent with the number of haplotypes in each pool. We have considered a variety of values for σ, ranging from 0 to 0.06 similar to Kirkpatrik et al. [12]. The perturbed datasets examined were derived from the unperturbed datasets used in the previous subsection with the procedure described above. The results are shown in Figure 2. Due to space limitations we give the results only when the number of pools is 75 but the shape of the figures is similar for the remaining number of pools examined in our previous subsection.

For small number of loci, HAPLOPOOL achieves the best performance. However, for larger datasets TDSPool outperforms all competing methods.

Furthermore, we have evaluated the performance of our methodology using missing data. We have randomly masked 1 and 2% of the SNPs respectively on the 10 loci datasets and estimated the accuracy. As shown in Figure 3, missing SNPs result in small loses in the accuracy and as expected the error decreases with increasing pool number.

Timing results

The computational times for all datasets are displayed in Table 3. All methods were run with their default parameters. Specifically, for HIPPO the default number of iterations was 100000 and for TDSPool the default number of streams (as will be defined in the "Methods" section) used throughout our experiments was chosen to be 50. Based on these results HIPPO was the slowest performing method in all datasets performing more than 20 times slower than the remaining two algorithms in the ten loci dataset. For the three loci dataset all methods were able to estimate the haplotype frequencies within six seconds. For the ten loci dataset HAPLOPOOL and TDSPool were still able to produce the results in less than three seconds whereas HIPPO demanded more than 58 seconds to finish. For the HapMap datasets again both methods TDSPool and HAPLOPOOL were able to finish the procedure within four seconds. In the ten loci and HapMap datasets TDSPool demonstrated better performance compared to HAPLOPOOL when the number of pools in each dataset was more than 75. Therefore, for all practical applications all methods are fast enough and within limits for researchers to use.

Table 3 Timing results

Full size table

Discussion

We have introduced a new algorithm for estimating haplotype frequencies from datasets with pooled DNA samples and we have compared it with existing available packages. We have shown that for datasets with small number of loci our algorithm has comparable performance to state-of-the-art methods in terms of accuracy and computational time but it demonstrates superior performance for datasets with larger number of loci.

Our method specifically focuses on small pool sizes and we have demonstrated the performance on pools of two or three individuals. In our experiments we have partitioned pooled genotype vectors in blocks of 4 SNPs as described in the "Partition-Ligation" subsection. We have chosen to partition the pooled genotypes every 4 SNPs so that computations are performed fast and we avoid cases with huge number of solutions. Partitioning the dataset every 3 SNPs had negligible impact on the accuracy of our results (results not shown) whereas partitioning every 5 SNPs in general can produce block pool genotypes with thousands of solutions, especially when missing data occur.

In the framework developed by Pirinen [16], which had resulted in HIPPO, the algorithm was able to accommodate prior database information on existing haplotypes in a population. Similarly, our methodology offers a framework that can easily incorporate prior knowledge in the form of known haplotypes from the same population as that from which the target pools were created. When such existing haplotypes are known (such as those available from the HapMap), they can be easily introduced in the form of a prior for the counts in the TDSPool algorithm. The presence of the extra information will improve the frequency estimation accuracy in the target population.

Conclusions

We have introduced a new algorithm for estimating haplotype frequencies from pooled DNA samples using a Tree-Based Deterministic sampling scheme. Algorithms for haplotype frequency estimation from pooled data fall into two categories. The first category consists of algorithms that focus on accurate solutions and allow for considerably large genotype segments and the second category of algorithms that focus on small segments but allow for a large number of individuals per pool. We have compared our methodology with state-of-the-art algorithms from each category, namely HAPLOPOOL and HIPPO. We have focused on scenarios and datasets in which the use of pooling data is suggested for haplotype frequency estimation according to the study of Kuk et al. [17]. In specific, our method focuses on scenarios where pools contain 2 or 3 individuals and we have shown that for such scenarios our method demonstrates comparable or better performance compared with competing algorithms for a small number of loci and outperforms these algorithms for a large number of loci. Furthermore, our TDSPool methodology provides a straightforward framework for incorporating prior database knowledge into the haplotype frequency estimation.

Methods

In the beginning of the section we introduce some notation. We then present the prior and posterior distribution given the data and derive the state update equations for the TDSPool estimator. We further present the modified partition-ligation procedure adjusted for the pooled data so that we are able to handle larger haplotype vectors and we finally give a summary of the proposed procedure.

Definitions and notation

Suppose we are given a set of pooled DNA measurements on L diallelic loci. We denote the two alleles at each locus by 0 and 1, for convenience of our representation. Following the common notation, we use the counts of allele 1 as the measurement for each allele on each pooled DNA sample, which can be converted from the estimated allele frequencies and consists the pool genotype. Therefore if the size of a pool is N individuals, the counts for each allele can vary between 0 and 2N.

Suppose that we have T such pools each one of them with size N_jj = 1, …, T. We denote α_t = {α_t¹, …α_t^L} to be the pool genotype of the t-th pool where α_jⁱ ∈ {0, …, 2N_t}. Suppose also that A_t = {a₁, …, α_t} is a set of pool genotypes of pools up to and including pool t and let A denote the full set of pool genotypes. In pool t we denote the haplotypes occurring in that pool as h_t = {h_t,1, …, h_t,2Nt} where h_t,i ∈ {0, 1}^L is a binary string of length L and the minor allele is present in position j in haplotype i if h_t,i,j = 0. We further define H_t = {h₁, …, h_t}, similarly to A_t as the set of haplotypes for each genotype pool up to and including pool t. A schematic representation of the dataset and the notation used is given in Figure 4.

Let us also define Z = {z₁, …z_M} , where z_m ∈ {0, 1}^L is a binary string of length L in which 0 and 1 correspond to the two alleles at each locus, as the set containing all haplotype vectors of length L that are consistent with any pool genotype in the set A. To obtain Z from the given dataset A, we first enumerate for each α_i the subset ψ_i = {h_i¹, …, h_i^Y} i = 1,…,T that contains all possible haplotype assignments which are consistent with α_i. The set Z is then given simply by Z = ∪ _i=1^Tψ_i . A set of population haplotype frequencies θ = {θ₁, …, θ_M} is also associated with the set Z of all possible haplotype vectors, where θ_m is the probability with which the haplotype z_m occurs in the total population.

Probabilistic model

Assuming random mating in the population it is clear that the number of each unique haplotype in H is drawn from a multinomial distribution based on the haplotype frequency θ[20]. This leads us to the use of the Dirichlet distribution as the prior distribution for θ[21] so that θ ∼ D(ρ₁, …, ρ_M)

With mean $E \{θ_{i}\} = \frac{ρ_{i}}{\sum_{j = 1}^{M} ρ_{j}}$

Before we calculate the posterior distribution for θ we note here that

\begin{array}{l} p (a_{t} | h_{t} = (h_{t, 1}, \dots, h t {,_{2 N}}_{t})) \\ = \{\begin{array}{l} 1 if a_{t} and h_{t} are consistent \\ 0 otherwise \end{array}\} \end{array}

and similarly

p (A_{t} | H_{t}) = {1 i f A_{t} and H_{t} are consistent 0 otherwise

Calculating the posterior distribution for θ we have:

p (θ | A_{t}, H_{t}, Z) \propto p (α_{t} | h_{t} = (h_{t, 1}, \dots, h_{t, 2 N_{t}}), θ, A_{t - 1}, H_{t - 1}) p (h_{t} = (h_{t, 1}, \dots, h_{t, 2 N_{t}}) | θ, A_{t - 1}, H_{t - 1}, Z) p (θ | A_{t - 1}, H_{t - 1}) \propto p (h_{t} = (h_{t, 1}, \dots, h_{t, 2 N_{t}}) | θ, Z) p (θ | A_{t - 1}, H_{t - 1}, Z) \propto \prod_{i = 1}^{2 N_{t}} θ_{h_{t, i}} \prod_{m = 1}^{M} θ_{m}^{ρ_{m} (t - 1) - 1} \propto \prod_{m = 1}^{M} θ_{m}^{ρ_{m} (t - 1) - 1 + \sum_{i = 1}^{2 N_{t}} I (z_{m} - h_{t, i})} \propto D (ρ_{1} (t - 1) + \sum_{i = 1}^{2 N_{t}} I (z_{1} - h_{t, i}), \dots, ρ_{M} (t - 1) + \sum_{i = 1}^{2 N_{t}} I (z_{M} - h_{t, i}))

(1)

where we denote ρ_m(t) m = 1,…,M as the parameters of the distribution of θ after the t-th pool and I (z_m − h_t,i) with i = 1,…,2N_t is the indicator function which equals 1 when z_m − h_t,i is a vector of zeros, and 0 otherwise.

We have shown that the posterior distribution for θ is also Dirichlet with parameters as given in (1) and depends only on the sufficient statistics, T_t = {ρ_m(t), 1 ≤ m ≤ M} which can be easily updated based on T_t−1, h_t, α_t as given by (1) i.e. T_t = T_t(T_t−1, h_t, α_t).

Inference problem

Following the notation we used in our previous subsections we can summarize the frequency estimation problem as follows: Given A = {α₁, …, α_T} the set of observed pool genotype vectors and Z = {z₁, …, z_M} the set of haplotypes compatible to the pool genotypes in A we wish to infer H = {h₁, …, h_T} the unknown haplotypes in each pool and θ = {θ₁, …, θ_M} the haplotype frequencies of all the haplotypes occurring in the population.

Computational algorithm (TDSPool)

Similar to traditional Sequential Monte Carlo (SMC) methods, we assume that by the time we have processed pool genotype α_t-1 we have K sets of solution streams (i.e. sets of candidate haplotypes for pools 1,…, t-1) and their associated weights $\{(H_{t - 1}^{(k)} | w_{t - 1}^{(k)}), k = 1, \dots, K\}$ properly weighted with respect to the posterior distribution p(H_t−1|A_t−1).

Given the set of solution streams and the associated weights we approximate the distribution p(H_t−1|A_t−1) as follows:

\hat{p} (H_{t - 1} | A_{t - 1}) = \frac{1}{W_{t - 1}} \sum_{k = 1}^{K} w_{t - 1}^{(k)} I (H_{t - 1} - H_{t - 1}^{(k)})

(2)

where $W_{t - 1} = \sum_{k = 1}^{K} w_{t - 1}^{(k)},$ and I (●) is the indicator function such that I (x-y)=1 for x = y and I (x-y) = 0 otherwise.

When we process the pool genotype t we would like to make an online inference of the haplotypes H_t based on the pool genotypes A_t. Let us further assume that there are K^ext possible haplotype solutions compatible with the genotype of the t-th pool, i.e., h_tⁱ, i = 1, …, K^ext .

Before we move to the derivation of the state update equation we note here that in the following we will use the fact that for the unknown parameters θ, as we have shown in "Probabilistic Model" subsection, under certain assumptions the prior and posterior distribution are Dirichlet and depend only on a set of sufficient statistics T_t = T_t(T_t−1, h_t, α_t)

Therefore, from Bayes’ theorem we have:

p (H_{t} | A_{t}, Z) \propto p (α_{t} | H_{t}, A_{t - 1}) p (h_{t} | H_{t - 1}, A_{t - 1}, Z) p (H_{t - 1} | A_{t - 1}, Z) \propto p (H_{t - 1} | A_{t - 1}, Z) \int p (α_{t} | h_{t}, θ) p (θ | h_{t}, H_{t - 1}, A_{t - 1}, Z) d θ \int p (h_{t} | H_{t - 1}, θ, Z) p (θ | T_{t - 1}, Z) d θ \propto p (H_{t - 1} | A_{t - 1}, Z) \int p (h_{t} | H_{t - 1}, θ, Z) p (θ | T_{t - 1}, Z) d θ \propto p (H_{t - 1} | A_{t - 1}, Z) \int (\prod_{i = 1}^{2 N_{t}} θ_{h_{t, i}}) p (θ | T_{t - 1}, Z) d θ \propto p (H_{t - 1} | A_{t - 1}, Z) E_{θ | T_{t - 1}} {\prod_{i = 1}^{2 N_{t}} θ_{h_{t, i}}} \propto p (H_{t - 1} | A_{t - 1}, Z) [\prod_{i = 1}^{2 N_{t}} ρ_{h_{t, i}} (t - 1) / {(\sum_{m = 1}^{M} ρ_{m} (t - 1))}^{2 N_{t}}]

(3)

where $ρ_{h_{t, i}} (t - 1) = \{ρ_{z_{m}} (t - 1) : h_{t, i} = z_{m}\}$

Assuming that we have approximated p(H_t−1|A_t−1) as in (2), we can approximate p(H_t|A_t) using (3) as ${\hat{p}}^{ext} (H_{t} | A_{t}) = \frac{1}{W_{t}^{ext}} \sum_{k = 1}^{K} \sum_{i = 1}^{K e x t} w_{t}^{(k, i)} I (H_{t} - [H_{t - 1}^{(k)}, (h_{t, 1}^{i}, \dots, h_{t, 2 N_{t}}^{i})])$ _.

The weight update formula is given by

w_{t}^{(k, i)} \propto w_{t - 1}^{(k)} \frac{\prod_{j = 1}^{2 N_{t}} ρ_{{h^{i}}_{t, j}}^{(k)} (t - 1)}{{(\sum_{m = 1}^{M} ρ_{m}^{(k)} (t - 1))}^{2 N_{t}}}

(4)

Partition-Ligation

In the partition phase the dataset is divided into small segments of consecutive loci. Once the blocks are phased, they are ligated together using a modified extension of the Partition-Ligation (PL) method [21] for the case of pooled data.

In our current implementation to be able to derive all possible solution combinations for each pool genotype efficiently we have decided to keep the maximum block length to 4 SNPs. Clearly the more SNPs are included in a block the more information about the LD patterns we can capture but at the same time the number of possible combinations increases and becomes prohibitive for more than 5 SNPs. For our experiments in a dataset with L loci we have considered L/4 blocks of 4 consecutive loci and the remaining SNPs were treated as a separate block.

The result of phasing for each block is a set of haplotype solutions for each pool genotype. Two neighbouring blocks are ligated by creating merged solutions for each pool genotype from all combinations of the block solutions, one from each block. When creating a merged solution for a pool genotype from the two separate solutions (one from each block), since we do not know which haplotypes belong to the same chromosome, all different possible assignments are examined. The TDSPool algorithm is then repeated in the same manner as it was for the individual blocks.

Furthermore, the order in which the individual blocks are ligated is not predetermined. We first ligate the blocks that would produce in each step the minimum entropy ligation. This procedure allows us to ligate first the most homogeneous blocks so that we have more certainty in the solutions that we produce while moving in the ligation procedure.

Summary of the proposed algorithm

Routine 1

Set the current number of streams m = 1. Define K as the maximum number of streams allowed. Define H₀¹ ={}.
For t = 1, 2,…
- ○ Find the K^ext possible haplotype configurations compatible with the pool genotype of the t-th pool.
- ○ For k = 1,2,…, m , j = 1,…,K^ext
  - ▪ Enumerate all possible particle extensions $H_{t}^{(k, j)} = [H_{t - 1}^{(k)}, (h_{t, 1}^{j}, \dots, h_{t, 2 N_{t}}^{j})]$
  - ▪j compute the weights w_t^(k,j) according to (4)
- ○ Select and preserve M = min (K, m· K^ext) distinct sample streams {H_t^(k), k = 1,…,M} with the highest importance weights {w_t^(k), k = 1,…,M} from the set {H_t^(k,j), w_t^(k,j), k = 1,…,m, j = 1,…, K^ext }
- ○ Update the number of counts of each encountered haplotype in each stream
- ○ Set m = M

TDSPool ALGORITHM

Partition the genotype dataset G into B subsets.
For b = 1,…,B , apply Routine 1 so that all segments are phased and for each one keep all the solutions contained in the top K particles.
Until all blocks are ligated, repeat the following
- ○ Find the blocks that if ligated would produce the minimum entropy
- ○ Ligate the blocks, following the procedure described in the Partition-Ligation section

References

Bansal A, van den Boom D, Kammerer S, Honisch C, Adam G, Cantor CR, Kleyn P, Braun A: Association testing by DNA pooling: an effective initial screen. Proc Natl Acad Sci U S A. 2002, 99 (26): 16871-16874. 10.1073/pnas.262671399.
Article PubMed Central CAS PubMed Google Scholar
Barcellos LF, Klitz W, Field LL, Tobias R, Bowcock AM, Wilson R, Nelson MP, Nagatomi J, Thomson G: Association mapping of disease loci, by use of a pooled DNA genomic screen. Am J Hum Genet. 1997, 61 (3): 734-747. 10.1086/515512.
Article PubMed Central CAS PubMed Google Scholar
Norton N, Williams NM, O'Donovan MC, Owen MJ: DNA pooling as a tool for large-scale association studies in complex traits. Ann Med. 2004, 36 (2): 146-152. 10.1080/07853890310021724.
Article CAS PubMed Google Scholar
Pearson JV, Huentelman MJ, Halperin RF, Tembe WD, Melquist S, Homer N, Brun M, Szelinger S, Coon KD, Zismann VL: Identification of the genetic basis for complex disorders by use of pooling-based genomewide single-nucleotide-polymorphism association studies. Am J Hum Genet. 2007, 80 (1): 126-139. 10.1086/510686.
Article PubMed Central CAS PubMed Google Scholar
Sham P, Bader JS, Craig I, O'Donovan M, Owen M: DNA Pooling: a tool for large-scale association studies. Nat Rev Genet. 2002, 3 (11): 862-871.
Article CAS PubMed Google Scholar
Zuo Y, Zou G, Zhao H: Two-stage designs in case–control association analysis. Genetics. 2006, 173 (3): 1747-1760. 10.1534/genetics.105.042648.
Article PubMed Central CAS PubMed Google Scholar
Ito T, Chiku S, Inoue E, Tomita M, Morisaki T, Morisaki H, Kamatani N: Estimation of haplotype frequencies, linkage-disequilibrium measures, and combination of haplotype copies in each pool by use of pooled DNA data. Am J Hum Genet. 2003, 72 (2): 384-398. 10.1086/346116.
Article PubMed Central CAS PubMed Google Scholar
Wang S, Kidd KK, Zhao H: On the use of DNA pooling to estimate haplotype frequencies. Genet Epidemiol. 2003, 24 (1): 74-82. 10.1002/gepi.10195.
Article PubMed Google Scholar
Yang Y, Zhang J, Hoh J, Matsuda F, Xu P, Lathrop M, Ott J: Efficiency of single-nucleotide polymorphism haplotype estimation from pooled DNA. Proc Natl Acad Sci U S A. 2003, 100 (12): 7225-7230. 10.1073/pnas.1237858100.
Article PubMed Central CAS PubMed Google Scholar
Pirinen M, Kulathinal S, Gasbarra D, Sillanpaa MJ: Estimating population haplotype frequencies from pooled DNA samples using PHASE algorithm. Genet Res (Camb). 2008, 90 (6): 509-524. 10.1017/S0016672308009877.
Article CAS Google Scholar
Stephens M, Scheet P: Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am J Hum Genet. 2005, 76 (3): 449-462. 10.1086/428594.
Article PubMed Central CAS PubMed Google Scholar
Kirkpatrick B, Armendariz CS, Karp RM, Halperin E: HAPLOPOOL: improving haplotype frequency estimation through DNA pools and phylogenetic modeling. Bioinformatics. 2007, 23 (22): 3048-3055. 10.1093/bioinformatics/btm435.
Article CAS PubMed Google Scholar
Zhang H, Yang HC, Yang Y: PoooL: an efficient method for estimating haplotype frequencies from large DNA pools. Bioinformatics. 2008, 24 (17): 1942-1948. 10.1093/bioinformatics/btn324.
Article CAS PubMed Google Scholar
Kuk AY, Zhang H, Yang Y: Computationally feasible estimation of haplotype frequencies from pooled DNA with and without Hardy-Weinberg equilibrium. Bioinformatics. 2009, 25 (3): 379-386. 10.1093/bioinformatics/btn623.
Article CAS PubMed Google Scholar
Gasbarra D, Kulathinal S, Pirinen M, Sillanpaa MJ: Estimating haplotype frequencies by combining data from large DNA pools with database information. IEEE/ACM Trans Comput Biol Bioinform. 2011, 8 (1): 36-44.
Article PubMed Google Scholar
Pirinen M: Estimating population haplotype frequencies from pooled SNP data using incomplete database information. Bioinformatics. 2009, 25 (24): 3296-3302. 10.1093/bioinformatics/btp584.
Article CAS PubMed Google Scholar
Kuk AY, Xu J, Yang Y: A study of the efficiency of pooling in haplotype estimation. Bioinformatics. 2010, 26 (20): 2556-2563. 10.1093/bioinformatics/btq492.
Article CAS PubMed Google Scholar
Barratt BJ, Payne F, Rance HE, Nutland S, Todd JA, Clayton DG: Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design. Ann Hum Genet. 2002, 66 (Pt 5–6): 393-405.
Article CAS PubMed Google Scholar
Jain S, Tang X, Narayanan CS, Agarwal Y, Peterson SM, Brown CD, Ott J, Kumar A: Angiotensinogen gene polymorphism at −217 affects basal promoter activity and is associated with hypertension in African-Americans. J Biol Chem. 2002, 277 (39): 36889-36896. 10.1074/jbc.M204732200.
Article CAS PubMed Google Scholar
Excoffier L, Slatkin M: Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol. 1995, 12 (5): 921-927.
CAS PubMed Google Scholar
Niu T, Qin ZS, Xu X, Liu JS: Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am J Hum Genet. 2002, 70 (1): 157-169. 10.1086/338446.
Article PubMed Central CAS PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Center for Computational Biology and Bioinformatics and Department of Electrical Engineering, Columbia University, New York, NY, USA
Alexandros Iliadis, Dimitris Anastassiou & Xiaodong Wang

Authors

Alexandros Iliadis
View author publications
You can also search for this author in PubMed Google Scholar
Dimitris Anastassiou
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaodong Wang.

Additional information

Authors’ contributions

All authors contributed equally to this work. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Iliadis, A., Anastassiou, D. & Wang, X. Fast and accurate haplotype frequency estimation for large haplotype vectors from pooled DNA data. BMC Genet 13, 94 (2012). https://doi.org/10.1186/1471-2156-13-94

Download citation

Received: 30 May 2012
Accepted: 09 October 2012
Published: 30 October 2012
DOI: https://doi.org/10.1186/1471-2156-13-94

Fast and accurate haplotype frequency estimation for large haplotype vectors from pooled DNA data