Abstract
Background
Pooling is a cost effective way to collect data for genetic association studies, particularly for rare genetic variants. It is of interest to estimate the haplotype frequencies, which contain more information than single locus statistics. By viewing the pooled genotype data as incomplete data, the expectationmaximization (EM) algorithm is the natural algorithm to use, but it is computationally intensive. A recent proposal to reduce the computational burden is to make use of database information to form a list of frequently occurring haplotypes, and to restrict the haplotypes to come from this list only in implementing the EM algorithm. There is, however, the danger of using an incorrect list, and there may not be enough database information to form a list externally in some applications.
Results
We investigate the possibility of creating an internal list from the data at hand. One way to form such a list is to collapse the observed total minor allele frequencies to “zero” or “at least one”, which is shown to have the desirable effect of amplifying the haplotype frequencies. To improve coverage, we propose ways to add and remove haplotypes from the list, and a benchmarking method to determine the frequency threshold for removing haplotypes. Simulation results show that the EM estimates based on a suitably augmented and trimmed collapsed data list (ATCDL) perform satisfactorily. In two scenarios involving 25 and 32 loci respectively, the EMATCDL estimates outperform the EM estimates based on other lists as well as the collapsed data maximum likelihood estimates.
Conclusions
The proposed augmented and trimmed CD list is a useful list for the EM algorithm to base upon in estimating the haplotype distributions of rare variants. It can handle more markers and larger pool size than existing methods, and the resulting EMATCDL estimates are more efficient than the EM estimates based on other lists.
Keywords:
Collapsed data; EM algorithm; Haplotype frequency estimation; Haplotype list; Pooled genotype data; Rare variantsBackground
In statistical genetics, the haplotype distribution is the joint distribution of the allele types at, say, L loci. We will focus on biallelic loci in this article so that each haplotype vector is a vector of binary values, and the haplotype distribution is a multivariate binary distribution. The importance of haplotypes is well documented [13] and reinforced more recently by the works of Muers [4] and Tewhey et al.[5]. By incorporating linkage disequilibrium information from multiple loci, haplotypebased inference can lead to more powerful tests of genetic association than singlelocus analyses. Haplotype distributions are usually estimated from individual genotype data which is the sum of the maternal and paternal haplotype vectors of an individual. As reviewed by Niu [6] and Marchini et al.[7], statistical approaches to haplotype inference based on individual genotype data are effective and costefficient. These include the expectationmaximization (EM) type algorithms for finding maximum likelihood estimates (MLE) [8], and the Bayesian PHASE algorithm [9]. Since DNA pooling is a popular and costeffective way of collecting data in genetic association studies [1014], the EM algorithm and its variants have been extended by various authors [1518] to handle pooled genotype data (i.e., the sum of all 2k haplotype vectors of all k individuals in a pool), whereas Pirinen et al.[19], Gasbarra et al.[20] and Pirinen [21] have extended Bayesian algorithms using Markov Chain Monte Carlo (MCMC) or reversible jump MCMC schemes. Also from a Bayesian perspective, Iliadis et al.[22] conduct deterministic treebased sampling instead of MCMC sampling, but their algorithm is feasible for small pool sizes only, even though the block size can be arbitrary. Despite the falling costs of genotyping, the popularity of the pooling strategy has not waned, with Kim et al.[23] and Liang et al.[24] advocating the use of pooling for nextgeneration sequencing data. The importance of pooling increases with the recent surge of interest in rare variant analysis based on resequencing data [25] to explain missing heritability [26] and diseases that cannot be explained by common variants. Roach et al.[27] predict that “haplotypes that include rare alleles … will play an increasingly important role in understanding biology, health, and disease”. Perhaps more so than in the analysis of common variants, pooling has an important role to play in the analysis of rare variants. This is because the standard methods for testing genetic association are underpowered for rare variants due to insufficient sample size as only a small percentage of study subjects would carry a rare mutation, and pooling is a way to increase the chance of observing a rare mutation. By using a pooling design, we could include more individuals in a study at the same genotyping cost. The study by Kuk et al.[28] shows that pooling does not lead to much loss of estimation efficiency relative to no pooling when the alleles are rare.
Our focus is on developing computationally feasible EMtype algorithms to estimate haplotype frequencies of rare variants from pooled genotype data. There are two main impediments to the use of EM algorithm in estimating haplotype distribution from pooled genotype data. First, the number of putative haplotypes grows exponentially with the number of loci. Secondly, things get worse when pool size increases as the number of individual haplotype configurations compatible with the observed pool totals quickly becomes astronomical. As a result, the EM algorithm can only be applied to data with small to moderate number of loci and pool size. For example, Gasbarra et al.[20] commented that without prior knowledge or restriction on the possible haplotypes, existing algorithms cannot handle the case of 21 loci with pool size 6. We have recorded running times of 1862 and 2900 seconds on an intel (R) Core (TM) desktop when the traditional EM algorithm is applied to pooled genotype data with 12 loci for 74/37 pools of size 2/4 each. Gasbarra et al.[20] advocate the use of database information to create a list of frequently occurring haplotypes. By combining this idea of using database information to create a list with a normal approximation [17] for the density of the pooled allele frequencies, Pirinen [21] proposed an AEML (Approximate EM with List) algorithm which runs much faster than the unrestricted EM algorithm.
We do not assume the existence of an external list for two reasons. First, database information for rare alleles is currently still lacking. Secondly, an EM type algorithm restricted to a list is sensitive to the correct choice and completeness of the external list used. Instead, we use the data on hand to construct an internal list. Motivated by the collapsed data estimation method developed by Kuk et al.[29] which only keeps track of whether an allele count is “0” or “ ≥1”, we propose a collapsed data (CD) list of possible haplotypes. It will be shown in the Methods section that for rare genetic variants, the CD list has inflated probabilities of capturing the true underlying haplotypes. To improve coverage, we augment the CD list by adding those haplotypes with only one “1” (i.e., only one rare variant occurs) to result in an augmented CD (ACD) list. The EM algorithm restricted to the ACD list still does not perform satisfactorily in our simulation studies, apparently due to the inclusion of too many false haplotypes. In response, we propose an ATCD (augmented and trimmed CD) list where those haplotypes with estimated frequencies lower than a threshold at each iteration of the algorithm are removed from the list. We propose a method to select the threshold by benchmarking the resulting EM estimate of the frequency of the ancestral haplotype of all zeros (i.e., no variant occurs) with the corresponding estimate obtained using the collapsed data method of Kuk et al.[29]. To assess the performance of the various estimators, we simulate genotype data resembling those collected for the 148 obese individuals in the CRESCENDO cohort study http://clinicaltrials.gov/ct/show/NCT00263042 webcite, at 25 loci near the MGLL gene on chromosome 3, and 32 loci near the FAAH gene on chromosome 1. The EM estimates based on the CD list and the ACD list do not perform well in the simulation study. In particular, they overestimate the haplotype frequency of the ancestral haplotype of all zeros. The EM estimates based on the ATCD list, on the other hand, perform very well. In the two scenarios involving 25 and 32 loci, the EMATCDL estimates outperform the EM estimates based on other lists as well as the collapsed data maximum likelihood estimates. We conclude that the augmented and trimmed CD list is a useful list for the EM algorithm to base upon in estimating the haplotype distributions of rare variants.
Results
To identify rare genetic variants associated with obesity, investigators of the CRESCENDO cohort study obtained resequenced data for 148 obese persons and 150 controls around two genes known to be involved in endocannabinoid metabolism: FAAH on chromosome 1, and MGLL on chromosome 3. There are 31Kbp of resequenced data near the FAAH gene, and 157Kbp near the MGLL region. Bhatia et al.[30] discovered two 5Kbp regions enriched in rare variants (RVs) located just upstream of the FAAH and MGLL genes respectively, with 32 RVs in the first region, and 25 RVs in the second region. To estimate the underlying haplotype distributions, we apply the algorithms proposed in this paper, as well as the EM with a list (EML) method described in Kuk et al.[29], where the list is determined combinatorially. The collapsed data maximum likelihood estimates (CDMLE) are also computed. To save space, we only report the estimates based on the obese individuals, which is the more interesting case, as there are very few mutations among the control subjects. Table 1 reports the CDMLE’s, as well as the estimates obtained using EML, EMCDL (EM with CD list) and EMATCDL (EM with augmented and trimmed CD list) algorithms for the 25 loci case. The estimates on the left panel (k=1) are based on individual genotype data, whereas the right panel (k=2) estimates are based on pooled genotype data that result from grouping the 148 obese individuals into 74 pools of size 2 each. Obviously, the estimates based on 148 pools of size 1 (i.e., individual genotype data) should be more reliable than those based on 74 pools of size 2, and so we should use the estimates on the left panel of Table 1 as the benchmark. It is interesting to note that as the pool size k increases to 2, the CDMLE, EML and EMCDL estimates remove some haplotypes that are assigned probabilities in the k=1 case, and in their place, some other haplotypes not presented in the k=1 case are assigned probabilities in the k=2 case. We will see later in the Methods section that it is an inherent property of the CD list to include extraneous false haplotypes as pool size increases. By augmenting and trimming the CD list in the proposed way, the EMATCDL estimates based on k=1 and 2 are much more comparable with similar support, which is desirable.
Table 1. Haplotype frequency estimates in the MGLL region using data from 148 obese individuals
Table 2 reports the running times of various algorithms. It can be seen that the EML algorithm takes longer to run than EMCDL and EMATCDL, and is computationally prohibitive (takes longer than 10 hours on an Intel (R) Core (TM) 2 desktop) when the pool size is k=4 in both the 25 and 32 loci cases. Both EMCDL and EMATCDL remain computationally feasible when k=4. Understandably, EMCDL is a bit faster to run as no augmentation and trimming is involved.
Table 2. Running times of EM algorithms based on different lists
To facilitate comparison of estimators in situations similar to those under which the original data were collected, we simulate haplotype data from the MGLL region (25 loci) and FAAH region (32 loci) according to the haplotype distributions listed as “true” in Tables 3 and 4. These distributions are actually the haplotype distributions estimated using EMCDL from the individual genotype data of the 148 cases of the CRESCENDO cohort study, but we will treat them as the true distributions in our simulation study. Thus there are only 22 possible haplotypes for the 25 loci case, and 32 haplotypes for the 32 loci case. After generating the haplotypes, we form n pools of 2k haplotypes each (n=100,200; k=1,2,3,4) and the resulting pooled genotype data will be treated as the observed data to be used to construct estimates. The results reported in Tables 3 and 4 are based on 100 simulations. The gold standard that we use is the EMPL estimator, which assumes knowledge of the perfect list (i.e, knowing exactly which f(y)>0). Because the perfect list is used, the EM algorithm in this case will yield the MLE based on the pooled genotype data. We will not have such knowledge in reality and so our real interest is in comparing the performance of the following estimators: CDMLE (collapsed data MLE), EML (EM with combinatorially determined list), EMCDL (CD list), EMACDL (augmented CD list), EMATCDL (augmented and trimmed CD list), and EMTCDL (CD list with trimming and no augmentation). For removing haplotypes from both the ATCD and TCD lists, we try threshold values from 0.0001 to 0.002 in steps of 0.0001, and select the threshold to yield an estimate of f(0) as close to as possible. Based on the study of Kuk et al.[29], seems to be a reasonable benchmark to use. In fact, we can see from Tables 3 and 4 that the average of (over 100 simulations) is always close to the average of the gold standard , and this lends further support to the use of as a benchmark. We have simulated data for k=1,2,3,4. As the EML algorithm takes too long to run (see Table 2), we compute the EML estimates for k=1 and 2 only. To save space, we only report the results of k=2 and 4 in Tables 3 and 4. The results for EMCDL and EMACDL are close, and so we table the results of EMCDL only. In order not to make the tables unduly long, we table only the averages of for those y with f(y)>0, together with the sum of over the remaining y’s, as well as the averages over simulations of the sum of squared errors , Ω={0,1}^{L}, for the various estimators of f(y). To supplement Tables 3 and 4, we plot the simulated averages of the sum of squared errors against pool size k for all 7 estimators, including EMPL.
Table 3. Average estimates of haplotype frequencies for a 25 loci case
Table 4. Average estimates of haplotype frequencies for a 32 loci case
It can be seen from Tables 3 and 4 that EMCDL overestimated the frequency f(0) of the ancestral haplotype quite severely, and it has the largest sum of squared error among all the estimators. The performance of EML is very similar to that of EMCDL (both unsatisfactory) but the computational cost is much higher. It suffers from assigning small probabilities to too many false haplotypes. For example, for the 25 loci case with n=100, k=2, the EML list on the average contains 116 haplotypes even though the true distribution is concentrated on 22 haplotypes. The total probability that the EML estimator attaches to haplotypes outside of the true 22 is only 0.0248 on the average. This foretells the need for trimming which is a point we will come back to later.
Augmenting the CD list did not help much as the results for EMACDL are almost the same as that of EMCDL when n=200, and only slightly better when n=100 (not shown in the tables, but we can see this from Figures 1 and 2). Trimming in addition to augmenting the CD list improved things a lot, as demonstrated by the good results of EMATCDL in both Tables 3 and 4. From Figures 1 and 2, we can see that EMATCDL is clearly the best estimator among those considered, other than the perfect list estimator which is not a legitimate estimator. Since augmenting alone did not improve results much, but trimming in addition to augmenting did, we were curious to see whether trimming alone would work or not. As expected, we can see from Tables 3 and 4 that the TCD list (trimming without augmentation) is on the average shorter than the ATCD list. Consequently, the TCD list will miss more true haplotypes, and the sum of probabilities of the missed haplotypes is higher for EMTCDL than for EMATCDL, and more so for the 32 loci case and when the number of pools is 100 rather than 200. In particular, the sum of probabilities of the missed haplotypes for the 32 loci case with k=4 is 0.0798 (after averaging over simulations) when n=100, and improves slightly to 0.0553 when n=200. The corresponding figures for EMATCDL are 0.0328 and 0.0222. In terms of sum of squared errors, EMTCDL is also inferior to EMATCDL for the 32 loci case, particularly when n=100.
Figure 1. Expected sum of squared errors of various haplotype frequency estimators for a 25 loci case. Expected sum of squared errors of various haplotype frequency estimators (EMCDL: EM with CD list; EMACDL: augmented CD list; EML: EM with combinatorially determined list; CDMLE: collapsed data MLE; EMTCDL: CD list with trimming and no augmentation; EMATCDL: augmented and trimmed CD list; EMPL: EM with perfect list) based on 100 simulations of (a)n=100 and (b)n=200 pools of k individuals each when the true haplotype distribution over 25 loci is as given in Table 3.
Figure 2. Expected sum of squared errors of various haplotype frequency estimators for a 32 loci case. Expected sum of squared errors of various haplotype frequency estimators (EMCDL: EM with CD list; EMACDL: augmented CD list; EML: EM with combinatorially determined list; CDMLE: collapsed data MLE; EMTCDL: CD list with trimming and no augmentation; EMATCDL: augmented and trimmed CD list; EMPL: EM with perfect list) based on 100 simulations of (a)n=100 and (b)n=200 pools of k individuals each when the true haplotype distribution over 32 loci is as given in Table 4.
The collapsed data MLE advocated by Kuk et al.[29] behaves very similarly to the gold standard EMPL estimator in terms of bias or expected value, but it suffers from having a larger variance, especially for larger pool size. In contrast, the EMCDL estimates have small variance but large bias. By benchmarking against CDMLE, the EMATCDL estimates have smaller bias than EMCDL and smaller variance than CDMLE. The main advantage of the collapsed data MLE is its simplicity and small bias. As shown by Kuk et al.[29], the loss in efficiency due to collapsing the pooled genotype data locuswise to just “0” and “ ≥1” is not large for small pool size (especially when k=1 which corresponds to individual genotype data) and rare alleles, but it is better to use EMATCDL if k≥2.
To further see if our benchmarking method of determining the threshold for the removal of haplotypes is reasonable or not, we also compute the EMATCDL estimates based on fixed threshold in our simulation study to find out which threshold is “optimal”. Figures 3 and 4 depict the averages of the sum of squared errors , Ω={0,1}^{L}, over 100 simulations for the EMATCDL estimates as a function of the threshold value. The position of the “optimal” threshold which minimizes that averaged sum of squared errors is depicted by the vertical dashed line, whereas the average of the adaptively chosen thresholds (obtained by minimizing the distance between and f(0) over the grid 0.0001 to 0.002 in steps of 0.0001) is depicted by the dotted vertical line. It can be seen that the averages of the adaptively chosen thresholds are quite close to the “optimal” thresholds which lends support to the proposed adaptive method.
Figure 3. Expected sum of squared errors of the EMATCDL estimator with fixed threshold (25 loci case). Expected sum of squared errors of the EMATCDL estimator for various choices of the threshold based on 100 simulations of n=200 pools of (a)k=2 and (b)k=4 individuals each when the true haplotype distribution over 25 loci is as given in Table 3; optimal threshold: the threshold obtained by minimizing the averaged sum of squared errors; average adaptive threshold: adaptively chosen thresholds obtained by minimizing the distance between and f(0) over the grid 0.0001 to 0.002 in steps of 0.0001.
Figure 4. Expected sum of squared errors of the EMATCDL estimator with fixed threshold (32 loci case). Expected sum of squared errors of the EMATCDL estimator for various choices of the threshold based on 100 simulations of n=200 pools of (a)k=2 and (b)k=4 individuals each when the true haplotype distribution over 32 loci is as given in Table 4; optimal threshold: the threshold obtained by minimizing the averaged sum of squared errors; average adaptive threshold: adaptively chosen thresholds obtained by minimizing the distance between and f(0) over the grid 0.0001 to 0.002 in steps of 0.0001.
Discussion and conclusions
The EM algorithm for estimating haplotype frequencies from pooled genotype data is computationally not feasible when the number of loci and/or the pool size is large due to the combinatorial challenge of finding all possible haplotypes that are compatible with the observed pool tools. Gasbarra et al.[20] raised the possibility of using database information to form a list of frequently occurring haplotypes, and by restricting attention to only those haplotypes on such a list, Pirinen [21] made the EM algorithm much more viable. The success of the EM with a list method is, however, dependent on the correctness of the list used. In the absence of an external list of possible haplotypes, especially for rare alleles for which there is not a lot of database information, and to protect against using the wrong list, we look at the feasibility of using the data at hand to create an internal list of possible haplotypes to be fed into the EM algorithm. Motivated by the collapsed data method studied by Kuk et al.[29], we propose a CD list with amplified haplotype frequencies. This alone does not work well but with appropriate augmentation and trimming, the resulting EMATCDL algorithm performs very well in our simulation study. It should be pointed out that even though the ATCD list originates from the CD list which is based on collapsed data: a further reduction of pooled genotype data, the EMATCDL estimates themselves are computed using the pooled data, which explains why they are better than the collapsed data MLEs. The simulation results also suggest that augmenting the collapsed data list alone, or trimming the list alone, is not good enough, and it is necessary to do both. The average lengths of the various lists are also shown in Tables 3 and 4. We can see that the average length of the ATCD list ranges from 20 (k=1,n=100) to 30 (k=4,n=200) for the 25 loci case, and from 28 (k=1,n=100) to 36 (k=4,n=200) for the 32 loci case. Without using a list, there are 2^{25}≈3e7 and 2^{32}≈4e9 possible haplotypes. Thus by using the ATCD list, we can restrict our attention to only 20 to 40 haplotypes, hence the huge savings in running time. It can also be seen from Tables 3 and 4 that making a list longer does not guarantee better results, as the EML and CD lists are much longer than the ATCD list but the resulting estimates are much worse. What seems important is to add the right haplotypes and remove unnecessary ones. If an imperfect external list exists, then a sensible hybrid method is to combine it with the collapsed data list to form a union list which can be further augmented and trimmed using the techniques described in this paper.
Currently we are only adding haplotypes with a single “1” to the list, which seems reasonable for the study of rare variants, but one can conceivably also add haplotypes with two 1’s to the list. This will increase the number of possibilities substantially during the first iteration of the EM algorithm, but most of these haplotypes will be removed after one iteration.
The signs are promising that the use of the ATCD list can push the limit of the EM algorithm in terms of the number of loci and pool size that it can handle. This method is particularly well suited for estimating the haplotype distributions of rare variants which are of substantial current interest. Note that our method does not require sampling, and is shown in simulation study to work for case of 32 loci and pool size 4, which is beyond the scope of most samplingbased methods, MCMC or deterministic.
Methods
Definitions and notation
Focusing on biallelic loci, the two possible alleles at each locus can be represented by “1” (the minor or variant allele) and “0” (the major allele). As a result, the alleles at selected loci of a chromosome can be represented by a binary haplotype vector. Since human chromosomes come in pairs, there are 2 haplotype vectors for each individual, one maternal, and one paternal. Suppose we have n pools of k individuals each so that there are K=2k haplotypes within each pool. Denote by Y_{ij}=(Y_{1ij},⋯,Y_{Lij}) the j^{th} haplotype in the i^{th} pool, where i=1,⋯,n, j=1,⋯,K, and L is the number of loci to be genotyped. Assuming HardyWeinberg equilibrium, the nK haplotype vectors are independent and identically distributed with probability function
for every Ltuple y=(y_{1},⋯,y_{L}) belonging to the Cartesian product Ω={0,1}^{L}. With pooling, the observed data are the pool totals
The probability function p(t_{1},⋯,t_{L}) of each pool total is given by the Kfold convolution of the haplotype probability function f(y_{1},⋯,y_{L}) and so the likelihood based on the observed pooled data is highly intractable and not easy to maximize directly.
Kuk et al.[29] defined the collapsed data via indicator functions as
Note that what Z_{i} does is to collapse each total allele frequency to either “0” (coded as 0) or “at least 1” (coded as 1) as done in classical group testing [31]. From here on, we will call {Y_{ij},i=1,⋯,n,j=1,⋯,K} the complete haplotype data (usually not observed); {T_{i},i=1,⋯,n} the pooled genotype data (reduces to individual genotype data if the pool size is 1), and {Z_{i},i=1,⋯,n} the collapsed data. In this paper, we refer to k as the pool size, not K.
The collapsed data maximum likelihood estimator
Suppressing its dependence on the pool size k, let
be the probability function of the collapsed data, and
the probability of zero pool totals at a subset Λ of the L loci. Kuk et al.[29] argued that the MLE of g_{0}(Λ) based on the collapsed data Z_{1},⋯,Z_{n} is given by
where n_{Z0}(Λ) is the number of pools with zero allele counts at the positions specified by Λ.
By making use of the relationship , where f_{0}(Λ)=P(Y_{lij}=0,l∈Λ), and the inclusionexclusion principle, Kuk et al.[29] obtained
as the collapsed data MLE of f(y), where Λ(y)={l:y_{l}=0} stores the positions of the 0’s in y=(y_{1},...,y_{L}), with complement Λ^{′}(y)={l:y_{l}=1}, which stores the positions of the 1’s, and m is the number of 1’s in the haplotype vector y.
Since the collapsed data is a reduction of the pooled data, the collapsed data MLE is less efficient than the pooled data MLE. Kuk et al.[29] showed that the loss of estimation efficiency due to the collapsing of pooled data is not large for rare variants and small pool size. However, if the pool size is moderate or large, which is recommended from the cost saving point of view, an estimator based on the original pooled data without collapsing can be substantially more efficient than the collapsed data MLE. This is why we want to modify the EM algorithm for finding the pooled data MLE to make it computationally feasible.
The EM algorithm based on the collapsed data list with augmentation and trimming
If the individual haplotypes Y_{ij}, i=1,⋯,n, j=1,⋯,K, were actually observed, then the population haplotype distribution function can be estimated simply by the empirical haplotype distribution. In other words, the socalled complete data MLE of f(y), y∈Ω, is
where is the number of times y appears in Y_{ij}. The Estep of the EM algorithm involves taking conditional expectation of m(y) given the observed data and current estimates , to get
where
Since the complete data multinomial likelihood belongs to the exponential family, the Mstep can be carried out analytically to yield the updating formula
which is just (2) with m(y) replaced by the imputed value .
The Estep of the EM algorithm is very time consuming. As one can see from (3), it involves finding all possible underlying haplotype vectors that sum up to the observed pool total. The combinatorial problem is greatly reduced if we can restrict the possible haplotypes to come from a relatively short list.
Let R⊂Ω be a reduced list of possible haplotypes obtained by whatever method. The generic EM with a list algorithm operates in the same way as the EM algorithm described above except that the updating formula (4) is only applied to y∈R⊂Ω, and Ω is replaced by R under the summation symbols in Equation (3).
Kuk et al.[29] described a combinatorial method to arrive at a reduced list R, but the resulting EML algorithm is still very time consuming. As can be seen from Table 2, the EML algorithm is not feasible for pool size larger than 2. Thus there is a need for alternative methods to arrive at a reduced list. Motivated by the fact that the collapsed data MLE only for “those haplotypes y which coincide with at least one of the collapsed data vectors Z_{i},i=1,⋯,n, in the sample”, it seems sensible to apply the EM algorithm with haplotypes restricted to this list, which we call the CD list.Let y be a nonancestral haplotype (i.e., y≠0, the vector of all zeros) with frequency f(y)>0, the probability that it is captured in a list of n randomly sampled haplotypes is 1−{1−f(y)}^{n}(≈1−e^{−nf(y)} if f(y) is small and n is large), whereas the probability that it is captured by the CD list constructed from n pools of k individuals each is 1−{1−g(y)}^{n}≈1−e^{−ng(y)}. Thus if g(y)>f(y), the probability that y is captured by the CD list is higher than the probability that it is captured by direct sampling of haplotypes (not to mention the extra cost incurred in resolving the phase ambiguity to sample the haplotypes directly), and by increasing the number of pools n, we can make the capture probability arbitrarily large. For example, if we want the CD list to capture y with probability at least 1−ε, all we have to do is to solve 1−e^{−ng(y)}≥1−ε(after Poisson approximation) to get . A sufficient condition for g(y) to be greater than f(y) is given below.
Lemma 1
Let y be nonancestral, g(y)>f(y) if .
Proof
A sufficient condition for Z=y is that one of the 2k haplotype vectors Y_{1},⋯,Y_{2k} in a pool of k individuals is equal to y=(y_{1},⋯,y_{L}), and the other 2k−1 haplotype vectors are all zero vectors. Thus g(y)≥2kf(y)f(0)^{2k−1}, and the lemma follows. □
The values of for various choices of the pool size k are given in Table 5. Thus if the alleles are sufficiently rare in the sense that f(0) is larger than the threshold given in Table 5, then there is a better chance of capturing each nonancestral haplotype with f(y)>0 by the CD list than by direct sampling of haplotypes. This is achieved by redistributing the probability of the ancestral haplotype in the process of pooling and collapsing. In other words, the reason why it is possible to have g(y)>f(y) for nonancestral y is because g(0)=f(0)^{2k}<f(0). We cannot have g(y)>f(y) for all haplotypes y because both g(y) and f(y) must sum to 1. Table 6 shows how the probabilities are being redistributed for a 25loci case. The true haplotype distribution f(y) is listed in column 1 of Table 6, whereas the distributions g(y) of the collapsed data for various pool sizes are given in the subsequent columns. For nonancestral y, we can see from Table 6 that g(y)>f(y), and more so when the pool size is increased (up to a point), which is good news for the CD list. For example, f(1,0,⋯,0)=0.0509, whereas g(1,0,⋯,0)=0.0839 when k=1, and continues to increase to 0.1143 and 0.1169 when the pool size is increased to 2 and 3. We are particularly interested in the capability of the CD list in capturing haplotypes with multiple 1’s. For the last haplotype listed in Table 6 (which contains five 1’s), f(y)=0.0034, but g(y) is 0.0097 when the pool size is 3. Thus if we have n=200 pools (which is one setting of our simulation study) of k=3 individuals each, the probability that this haplotype is captured by the CD list is 0.8577=1−(1−0.0097)^{200}≈1−e^{−200(0.0097)}=0.8563. But g(y) will also assign positive probabilities to some haplotypes y even though f(y)=0 since , which is why we propose to trim the CD list. To see how g(y) can be positive even though f(y)=0, consider the following case with just 2 loci. Suppose f(1,0)>0, f(0,1)>0, but f(1,1)=0. By pooling k individuals together, it is obviously possible to have total allele counts T_{1}≥1,T_{2}≥1 at both loci, and hence (Z_{1},Z_{2})=(1,1), which means that (1,1) will appear on the CD list even though f(1,1)=0.
Table 5. Sufficient conditions for nonancestral haplotype frequencies to be increased by collapsing data
Table 6. Induced collapsed data frequencies
The CD list misses some haplotypes with f(y)>0, while some other haplotypes with f(y)=0 are erroneously included. This suggests that the CD list needs to be augmented as well as trimmed. Since we are focusing on rare variants, we augment the CD list by adding all those vectors with only one “1” to the list if they are not already there. Thus we are adding at most L haplotypes to the CD list. Beginning the EM iteration with the augmented CD list, we remove a haplotype from the list if its estimated frequency at the current iteration of the EM algorithm is less than a threshold. The way we select the threshold (typically over a grid) is to choose the one that results in an estimate of the ancestral haplotype frequency f(0) closest to the collapsed data MLE , which should be a reasonable benchmark.
Availability
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
AYCK and XL contributed 40% each and JX contributed 20% to this work. All authors read and approved the final manuscript.
Acknowledgements
The authors would like to thank two referees for their helpful comments, which help improve the paper.
References

Morris RW, Kaplan NL: On the advantage of haplotype analysis in the presence of multiple disease susceptibility alleles.
Genet Epidemiol 2002, 23:221233. PubMed Abstract  Publisher Full Text

Clark A: The role of haplotypes in candidate gene studies.
Genet Epidemiol 2004, 27:321333. PubMed Abstract  Publisher Full Text

Schaid DJ: Evaluating associations of haplotypes with traits.
Genet Epidemiol 2004, 27:348364. PubMed Abstract  Publisher Full Text

Muers M: Genomics: No half measures for haplotypes.
Nat Rev Genet 2011, 12:77. PubMed Abstract  Publisher Full Text

Tewhey R, Bansal V, Torkamani A, Topol EJ, Schork NJ: The importance of phase information for human genomics.
Nat Rev Genet 2011, 12:215223. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Niu T: Algorithms for inferring haplotypes.
Genet Epidemiol 2004, 27:334347. PubMed Abstract  Publisher Full Text

Marchini J, Cutler D, Patterson N, Stephens M, Eskin E, Halperin E, Lin S, Qin ZS, Munro HM, Abecasis GR, Donnelly P, International HapMap Consortium: A comparison of phasing algorithms for trios and unrelated individuals.
Am J Hum Genet 2006, 78:437450. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Excoffier L, Slatkin M: Maximumlikelihood estimation of molecular haplotype frequencies in a diploid population.
Mol Biol Evol 1995, 12:921927. PubMed Abstract  Publisher Full Text

Stephens M, Scheet P: Accounting for decay of linkage disequilbrium in haplotype inference and missingdata imputation.
Am J Hum Genet 2005, 76:449462. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sham P, Bader JS, Craig I, O’Donovan M, Owen M: DNA pooling: A tool for largescale association studies.
Nat Rev Genet 2002, 3:862871. PubMed Abstract  Publisher Full Text

Norton N, Williams NM, O’Donovan MC, Owen MJ: DNA pooling as a tool for largescale association studies in complex traits.
Annals Med 2004, 36:146152. Publisher Full Text

Meaburn E, Butcher L, Schalkwyk L, Plomin R: Genotyping pooled DNA using 100k snp microarrays: A step towards genomewide association scans.

Homer N, Tembe W, Szelinger S, Redman M, Stephan D, Pearson J, Nelson D, Craig D: Multimarker analysis and imputation of multiple platform poolingbased genomewide association studies.
Bioinformatics 2008, 24:18961902. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Macgregor S, Zhao ZZ, Henders A, Nicholas MG, Montgomery GW, Visscher PM: Highly costefficient genomewide association studies using DNA pools and dense SNP arrays.
Nucleic Acids Res 2008, 36:e35. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ito T, Chiku S, Inoue E, Tomita M, Morisaki T, Morisaki H, Kamatani N: Estimation of haplotype frequencies, linkagedisequilibrium measures, and combination of haplotype copies in each pool by use of pooled DNA data.
Am J Hum Genet 2003, 72:384398. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kirkpatrick B, SantosArmendariz C, Karp RM, Halperin E: Haplopool: improving haplotype frequency estimation through DNA pools and phylogenetic modeling.
Bioinformatics 2007, 23:30483055. PubMed Abstract  Publisher Full Text

Zhang H, Yang HC, Yang Y: PoooL: An efficient method for estimating haplotype frequencies from large DNA pools.
Bioinformatics 2008, 24:19421948. PubMed Abstract  Publisher Full Text

Kuk AYC, Zhang H, Yang Y: Computationally feasible estimation of haplotype frequencies from pooled DNA with and without HardyWeinberg equilibrium.
Bioinformatics 2009, 25:379386. PubMed Abstract  Publisher Full Text

Pirinen M, Kulathinal S, Gasbarra D, Sillanpää MJ: Estimating population haplotype frequencies from pooled DNA samples using PHASE algorithm.
Genet Res 2008, 90:509524. Publisher Full Text

Gasbarra D, Kulathinal S, Pirinen M, Sillanpää M: Estimating haplotype frequencies by combining data from large DNA pools with database information.

Pirinen M: Estimating population haplotype frequencies from pooled SNP data using incomplete database information.
Bioinformatics 2009, 25:32963302. PubMed Abstract  Publisher Full Text

Iliadis A, Anastassiou D, Wang X: Fast and accurate haplotype frequency estimation for large haplotype vectors from pooled DNA data.
BMC Genetics 2012, 13:94. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kim SY, Li Y, Guo Y, Li R, Holmkvist J, Hansen T, Pedersen O, Wang J, Nielsen R: Design of association studies with pooled or unpooled nextgeneration sequencing data.
Genet Epidemiol 2010, 34:47991. PubMed Abstract  Publisher Full Text

Liang WE, Thomas DC, Conti DV: Analysis and optimal design for association studies using nextgeneration sequencing with casecontrol pools.
Genet Epidemiol
in press

Mardis ER: Nextgeneration DNA sequencing methods.
Annu Rev Genomics Hum Genet 2008, 9:387402. PubMed Abstract  Publisher Full Text

Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, Nadeau JH: Missing heritability and strategies for finding the underlying causes of complex disease.
Nat Rev Genet 2010, 11:446450. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Roach JC, Glusman G, Hubley R, Montsaroff SZ, Holloway AK, Mauldin DE, Srivastava D, Garg V, Pollard KS, Galas DJ, Hood L, Smit AFA: Chromosomal haplotypes by genetic phasing of human families.
Am J Hum Genet 2011, 89:382397. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kuk AYC, Xu J, Yang Y: A study of the efficiency of pooling in haplotype estimation.
Bioinformatics 2010, 26:25562563. PubMed Abstract  Publisher Full Text

Kuk AYC, Li X, Xu J: A fast collapsed data method for estimating haplotype frequencies from pooled genotype data with applications to the study of rare variants.
Stat Med 2013, 32:134360. PubMed Abstract  Publisher Full Text

Bhatia G, Bansal V, Harismendy O, Schork NJ, Topol E, Frazer K, Bafna V: A covering method for detecting genetic associations between rare variants and common phenotypes.
PLoS Comput Biol 2010, 6:e1000954. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dorfman R: The detection of defective members of large populations.
Annals Math Stat 1943, 14:43640. Publisher Full Text