Abstract
Background
Haplotype analysis has gained increasing attention in the context of association studies of disease genes and drug responsivities over the last years. The potential use of haplotypes has led to the initiation of the HapMap project which is to investigate haplotype patterns in the human genome in different populations. Haplotype inference and frequency estimation are essential components of this endeavour.
Results
We present a twostage method to estimate haplotype frequencies in pedigrees, which includes haplotyping stage and estimation stage. In the haplotyping stage, we propose a linear time algorithm to determine all zerorecombinant haplotype configurations for each pedigree. In the estimation stage, we use the expectationmaximization (EM) algorithm to estimate haplotype frequencies based on these haplotype configurations. The experiments demonstrate that our method runs much faster and gives more credible estimates than other popular haplotype analysis software that discards the pedigree information.
Conclusion
Our method suggests that pedigree information is of great importance in haplotype analysis. It can be used to speedup estimation process, and to improve estimation accuracy as well. The result also demonstrates that the whole haplotype configuration space can be substituted by the space of zerorecombinant haplotype configurations in haplotype frequency estimation, especially when the considered haplotype block is relatively short.
Background
The modelling of human genetic variation is critical to the understanding of the genetic basis for complex diseases. Single nucleotide polymorphisms (SNPs) are the most frequent form of variation. The Human Genome Project and other largescale efforts have identified millions of SNP markers. Although each marker can be analyzed independently, it is more informative to analyze them in groups. Therefore, it is useful to analyze haplotypes (haploid genotypes), which are sequences of linked markers on a single chromosome. In diploid organisms, such as human beings, chromosomes come in pairs, and experiments often yield genotype information, which blend haplotype information for chromosome pairs. There is growing evidence that, in order to better characterize the role of a candidate gene, full haplotype information should be exploited instead of using only genotype information. Unfortunately, it is both timeconsuming and expensive to derive haplotype information experimentally. This explains the increasing interest in inferring haplotype information, or haplotyping, computationally. In fact, the potential use of haplotypes has led to the initiation of the HapMap project which is to investigate haplotype patterns in the human genome in different populations. Haplotype inference and frequency estimation are essential components of this endeavour.
Genotype data can be with or without any pedigree information, the first category is called population genotype data while the second one is pedigree genotype data. A large number of algorithms have been designed to estimate haplotype frequencies based on population data [14]. Among them, EM algorithms are most popular due to their interpretability and stability.
For any given pedigree genotype data, we can certainly discard the pedigree information and simply take the genotype sequences as the input of EM estimation algorithms for population data. However, it is well accepted that information obtained by analyzing pedigree genotype data is more reliable: the constraint provided by other members in a pedigree would force one genotype to settle on a unique haplotype pair as being most probable.
Here we propose a twostage method to estimate haplotype frequencies in pedigrees. The first stage is the haplotyping stage, which finds out all feasible haplotype configurations for each pedigree. In the second estimation stage, we use EM algorithm to estimate haplotype frequencies in pedigrees based on the haplotype configurations inferred in the former stage.
In general, haplotyping pedigrees need consider the entire solution space of all possible consistent haplotype configurations. However, the genomic DNA can be partitioned into long blocks such that recombinations within each block are rare or even nonexistent [5,6]. Thus it is believed that haplotype configurations with fewer recombinations should be preferred in haplotype inference [79]. When the region of interest is so small that the expected number of recombinations in the pedigree data is very close to zero, the solution space of all consistent haplotype configurations can be replaced by that of zero recombination (provided it is nonempty) to estimate haplotype frequencies. It is because the contribution of the solutions of recombinations to the overall likelihood becomes so small compared to those of zero recombination while they bring considerable complexity to the computation. Thus, we are interested in finding the consistent haplotype configurations of zero recombination.
Wijsman [10] proposed a 20rule algorithm, and O'Connell [11] described a genotypeelimination algorithm, both of which can be used to find out zerorecombinant haplotype configurations for pedigrees. Recently, Li and Jiang [8,9] showed that it could be solved in polynomial time. Here we propose an algorithm to find out zerorecombinant haplotype configurations in linear time using a technique called HCLlinkage analysis.
In the second stage, we use the EM algorithm to estimate haplotype frequencies based on haplotype configurations obtained from the haplotyping stage. We employ the HardyWeinberg Equilibrium to obtain the probabilities of founder genotypes and use a genetic model [12] to deduce the transmitted probabilities of nonfounders. While the likelihood of each configuration is computed by multiplying the probabilities of each genotype, the frequency of each haplotype that appears in the configuration is calculated by a genecounting method.
We implement all the algorithms in a C software package named HANAP (Haplotype ANAlysis in Pedigrees) and test its effectiveness and efficiency both on simulated and real data sets. The experimental results show that, our method runs much faster than the direct frequency estimation software that discards the pedigree information. Moreover, because our method utilizes such information, the estimation is more reliable.
Methods
Haplotyping stage: haplotyping algorithm based on HCLlinkage analysis
Excoffier's EM algorithm was widely applied in haplotype analysis [14,15]. Unfortunately, it should calculate the frequencies of all possible haplotype pairs consistent to each given genotype, which is unbearable in storage when the haplotype length grows to more than 20 [16]. O'Connell [10] showed that genetic information from relatives could be used to resolve one genotype's ambiguity, and thus reduce the number of haplotypes that should be considered. However, O'Connell's method had an exponential time complexity. Recently, Li and Jiang [8,9] showed that, for any genotype in a given pedigree, its ambiguity could be solved in cubic time (O(m^{3}n^{3})), where n is the number of members in the pedigree and m is the number of loci in each genotype. Here we present a socalled HCLlinkage analysis method to do haplotyping in linear time (O(mn)).
HCLlinkage definition
Trios are simple pedigrees that contain only a pair of parents and a child. A consistent zerorecombinant haplotype configuration for a general pedigree should also be a consistent zerorecombinant haplotype configuration when restricted to each trio in this pedigree. Given trio T = (F, M, C), here F is the father, M is the mother and C is the child, suppose that locus i of F, M, and C have alleles {a, b}, {c, d} and {e, f} (note that {e, f} ⊂ ({a, b} ∪ {c, d})). The genotype information of C can be homozygous or heterozygous. If it is homozygous (e = f), then it is clear that the paternal allele and the maternal allele are the same (e or f). The situation becomes complicated if it is a heterozygous site (e ≠ f). Table 1 lists out all possible situations. We can see that given the locus genotype information for the three members, it may or may not be possible to determine the paternal allele and the maternal allele for the child. We call a locus ambiguous if its inheritance relationship cannot be resolved.
Table 1. Imputing the paternal allele and the maternal allele for the child at a single locus
In fact, for any trio, ignoring the ambiguous loci, the consistent (partial) haplotype configuration for the unambiguous loci is unique and specifies a linkage of alleles on some heterozygous loci for each node in the trio. We define such linkage as HCLlinkages (linkages of Haplotype Configuration on the nonambiguous Loci).
Definition
An HCLlinkage ψ is a quadruplet <v, RE, LS, PH> defined on node v and specified by the unique consistent (partial) haplotype configuration within a trio that contains v. Here v denotes the node to which the HCLlinkage belongs. LS = {a_{1},...,a_{l}} is the set of heterozygous loci where the haplotype configuration has been inferred. PH = {ph, ph'} = {(h_{1}...h_{l}), (h_{1}'...h_{l}')} records the two (partial) haplotypes imputed on these loci. RE = (R, R') denotes that ph (respectively ph') is inherited from or will be passed on to the node in set R(R').
An HCLlinkage describes the partial haplotype configuration of a node and the inheritance relationship between the parents and their children. Under our definition, we can conclude that every haplotype configuration should be consistent with any HCLlinkage specified by each trio in the pedigree.
Merge and transfer operations over HCLlinkages
In the case of multiple generations and multiple children, loci on one node may be linked by different HCLlinkages. HCLlinkages of the same node should be merged if they can. There are three cases when merging two HCLlinkages ψ_{1 }= <v, (R_{1}, R_{1}'), LS_{1}, {ph_{1}, ph_{1}'}> and ψ_{2 }= <v, (R_{2}, R_{2}'), LS_{2}, {ph_{2}, ph_{2}'}> on node v.
Case (1): (R_{1 }∪ R_{1}') ∩ (R_{2 }∪ R_{2}') ≠ Φ
l.a) R_{1 }∩ R_{2 }≠ Φ or R_{1}' ∩ R_{2}' ≠ Φ, it means that both ph_{1 }and ph_{2 }are from the nodes in R_{1 }and R_{2 }said ph_{1}' and ph_{2}' are from the nodes in R_{1}' and R_{2}', so ph_{1 }and ph_{2 }should be on the same haplotype, and ph_{1}' and ph_{2}' on the other: i) LS_{1 }∩ LS_{2 }= Φ, or LS_{1 }∩ LS_{2 }≠ Φ but ph_{1 }equals ph_{2 }when restricted to loci in LS_{1 }∩ LS_{2}, it means that ψ_{1 }and ψ_{2 }are compatible. In this case, they should be merged to ψ = <v, (R_{1 }∪ R_{2}, R_{1}' ∪ R_{2}'), LS_{1 }∪ LS_{2}, {ph_{1 }∪ ph_{2}, ph_{1}' ∪ ph_{2}'}>, here ph_{1 }∪ ph_{2 }denote a longer partial haplotype, which alleles equal to those of ph_{1 }and ph_{2 }when restricted to loci in LS_{1 }and LS_{2}; ii) LS_{1 }∩ LS_{2 }≠ Φ and ph_{1 }doesn't equal ph_{2 }when restricted to LS_{1 }∩ LS_{2}, it means that ψ_{1 }and ψ_{2 }are incompatible, i.e. no haplotype configuration can satisfy the two HCLlinkages in the same time.
1.b) R_{1 }∩ R_{2}' ≠ Φ or R_{1}' ∩ R_{2 }≠ Φ, it means that ph_{1 }and ph_{2}' should be on the same haplotype, and ph_{1}' and ph_{2 }on the other. Similarly, ψ_{1 }and ψ_{2 }can be merged to ψ = <v, (R_{1 }∪ R_{2}', R_{1}' ∪ R_{2}), LS_{1 }∪ LS_{2}, {ph_{1 }∪ ph_{2}', ph_{1}' ∪ ph_{2}}> when they are compatible.
Case (2): (R_{1 }∪ R_{1}') ∩ (R_{2 }∪ R_{2}') = Φ, but LS_{1 }∩ LS_{2 }≠ Φ,
2.a) ph_{1 }equals ph_{2 }(and ph_{1}' equals ph_{2}' consequently) or ph_{1 }equals ph_{2}' (then ph_{1}' equals ph_{2}) when restricted to LS_{1 }∩ LS_{2}, it means that ψ_{1 }and ψ_{2 }are compatible, in this case, they should be merged to ψ = <v, (R_{1 }∪ R_{2}, R_{1}' ∪ R_{2}'), LS_{1 }∪ LS_{2}, {ph_{1 }∪ ph_{2}, ph_{1}' ∪ ph_{2}'}> or ψ = <v, (R_{1 }∪ R_{2}', R_{1}' ∪ R_{2}), LS_{1 }∪ LS_{2}, {ph_{1 }∪ ph_{2}', ph_{1}' ∪ ph_{2}}>.
2.b) Else, ph_{1 }doesn't equal ph_{2 }or ph_{2}' when restricted to LS_{1 }∩ LS_{2}, it means that ψ_{1 }and ψ_{2 }are incompatible.
Case (3): (R_{1 }∪ R_{1}') ∩ (R_{2 }∪ R_{2}') = Φ, and LS_{1 }∩ LS_{2 }= Φ,
In this case, ψ_{1 }and ψ_{2 }cannot be merged and both should be recorded in a HCLlinkage set Ψ_{v }for node v.
With the merge operation, we can define the normalizing of a set of HCLlinkages Ψ_{v}: normalizing a set Ψ_{v }of HCLlinkages on node v means repeatedly applying the merge operation for pairs of HCLlinkages in Ψ_{v }until, ∀ψ_{i}, ψ_{j }∈ Ψ_{v}, (R_{i }∪ R_{i}') ∩ (R_{i }∪ R_{i}') = Φ, and LS_{1 }∩ LS_{2 }= Φ. Ψ_{v }is then said to be normalized. From now on, if there is no further notice, Ψ_{v }should be normalized after any changes.
Like genetic information, HCLlinkages will be passed on from generations to generations. Without loss of generality, let us define the transfer of HCLlinkage information from child C to its parent F. The other case from F to C would be similar. Let Ψ_{C }and Ψ_{F }represent the normalized HCLlinkage sets of C and F respectively, and let HS be the set of homozygous loci of F. The transfer of Ψ_{C }from C to F results in changes to Ψ_{F}, where each ψ_{C }= <C, (R_{C}, R_{C}'), LS_{C}, {ph_{C}, ph_{C}'}> ∈ Ψ_{C }is transferred independently. There are two cases to consider.
Case (1): if F ∈ R_{C }(or respectively, F ∈ R_{C}'), add ψ_{F }= <F, ({C}, Φ), LS_{C } HS, {ph_{F}, ph_{F}'}> to Ψ_{F}, here ph_{F }equals the resulting partial haplotypes of ph_{C }(respectively ph_{C}') when restricted to loci in LS_{C } HS and ph_{F}' is the compensatory partial haplotypes of ph_{F }consistent to genotype g_{F}.
Case (2): else, F ∉ R_{C }∪ R_{C}': i) both ph_{C }and ph_{C}' are consistent with the partial genotype g_{F }when restricted to loci in LS_{C}, then add ψ_{F }= <F, (Φ, Φ), LS_{C } HS, {ph_{F}, ph_{F}'}> to Ψ_{F}, here ph_{F }and ph_{F}' equal the resulting partial haplotypes of ph_{C }and ph_{C}' when restricted to loci in LS_{C } HS; ii) ph_{C}' (respectively ph_{C}) is not consistent with the partial genotype g_{F }when restricted to loci in LS_{C}, then add ψ_{F }= <F, ({C}, Φ), LS_{C } HS, {ph_{F}, ph_{F}'}> to Ψ_{F}, here ph_{F }equals the resulting partial haplotypes of ph_{C }(respectively ph_{C}') when restricted to loci in LS_{C } HS and ph_{F}' is the compensatory partial haplotypes of ph_{F }consistent to genotype g_{F}. Note that at least one of ph_{C }and ph_{C}' should be consistent with the partial genotype g_{F}.
Remember that Ψ_{F }should be normalized whenever adding a new HCLlinkage to it. In the case of transferring an HCLlinkage ψ_{F }from F to C, resulting in adding ψ_{C }= <C, (R_{C}, R_{C}'), LS_{C}, {ph_{C}, ph_{C}'}> to Ψ_{C}, note that we should add M into R_{C}' whenever we have determined that F ∈ R_{C}.
Our merge and transfer operations will not bring more or lose any HCLlinkage information for building consistent haplotype configurations.
Main HCLlinkages analysis haplotyping algorithm
Before the algorithm, we preprocess each trio in the pedigree. Whenever a trio specifies an HCLlinkage for node v, it will be stored in the HCLlinkage set Ψ_{v}. The objective of the algorithm is to collect the complete HCLlinkage information for each node, which is accomplished by traversing the tree twice.
Firstly, we will convert the input pedigree into a rooted searching tree T (at an arbitrary node R) (Step 1). Then we traverse T in postorder to transfer and merge the HCLlinkage information for each node from its relatives (Step 2). We do this from the left lowest nuclear family F_{o}. The HCLlinkages in nuclear family F_{o }will be merged at both parents, and then be transferred to the root of the subtree. The same operations will be conducted in its parental nuclear family on HCLlinkages specified in this family as well as on those transferred from its child families. And at last, we collect all the HCLlinkages at the root R. In Step 3, we traverse T again in preorder and transfer the linkage in another direction from R to its farmost descendants.
After step 3, the HCLlinkage set of each node preserves all HCLlinkages in the pedigree. In step 4, we choose a node v arbitrarily. Set Ψ_{v }contains several HCLlinkages ψ_{1}, ψ_{2},...,ψ_{l }defined on disjoint locus set LS_{1}, LS_{2},...LS_{l}. When a set of loci are linked by one HCLlinkage, they can be viewed as a compound locus, and the two partial haplotypes can be viewed as two compound alleles. These "loci" (and "alleles") will be treated equally as the other heterozygous loci and homozygous loci that are not involved in any HCLlinkage. We arbitrarily select one allele from the two at each locus to form a haplotype; the other alleles form another haplotype. It is called an imputing schema. Whenever the haplotype configuration of one node is determined, it can be used to determine the configurations of its relatives, and those of the whole pedigree at last.
During our algorithm, Incompatibleness may occur when normalizing HCLlinkage set Ψ_{v}. Then we declare that there is no solution and exit from the algorithm immediately. Even in step 4, incompatibleness may still occur when applying the haplotypes of the parents to resolve the genotype of the children in the case that an individual node has multiple children. Figure 1 shows an example. The key point is, if it exists a consistent haplotype configuration for a nuclear family (F, M, C_{1}, C_{2},...,C_{d}), every arbitrary imputing schema s can output one feasible solution ζ. Contrarily, if one imputing schema ends with incompatibleness, other schemata will fail too. We will prove this in the appendix.
Figure 1. Incompatibleness when imputing alleles.
The time complexity and space complexity of our algorithm are both O(mn) where n is the number of the members in the pedigree and m is the length of the loci.
Frequency estimation stage
Suppose that we are given K pedigrees P = {P_{1}, P_{2},...,P_{K}}. Each P_{i }consists of n_{i }nodes v_{i,j }(1 ≤ i ≤ K, 1 ≤ j ≤ n_{i}), in which the first n_{i}' are founders. The genotype of node v_{i,j }(1 ≤ j ≤ n_{i}) is g_{i,j}. Suppose that there are π_{i }consistent solutions for pedigree P_{i }and the sth solution is: S_{s,i }= <S_{s,i,1}, S_{s,i,2},...,> (1 ≤ s ≤ π_{i}), where S_{s,i,j }= <α_{s,i,j,1}, β_{s,i,j,2}> is a haplotype pair of genotype g_{i,j}. All haplotypes appear in these solutions form a list of haplotypes H = {h_{1}, h_{2},...,h_{l}} with frequencies Θ = {θ_{1}, θ_{2},...,θ_{l}}, here θ_{1 }+ θ_{2 }+ ... + θ_{t }= 1 is what we want to estimate.
The likelihood of haplotype frequencies given the observed pedigree data is,
Under the assumption of random mating, the paternal haplotype configuration and the maternal haplotype configuration are independent, and the child's haplotype configuration is transmitted from its parents. We have:
Here Pr (S_{s,i,j}Θ) is the probability of haplotype configuration of the founder nodes, it can be computed using the HardyWeinberg Equilibrium. Pr(S_{s,i,j'}<, >) is the gamete transmission probabilities of haplotype configuration S_{s,i,j }with the parental haplotype configurations of and . It can be computed using a genetic model presented by Elston and Stewart [6].
EM algorithm estimates the haplotype frequencies Θ starting with the initial arbitrary values Θ^{(0) }= {θ_{1}^{(0)}, θ_{2}^{(0)},...,θ_{l}^{(0)}}. These initial values are used as if they were the unknown true frequencies to estimate solution frequencies Pr(S_{s,i}Θ) (the expectation step). These expected solution frequencies are used in turn to estimate haplotype frequencies at the next iteration Θ^{(1) }= {θi^{(1)}, θ_{2}^{(1)},...,θ^{(1)}} (the maximization step), and so on, until convergence is reached.
Suppose that in the rth iteration, Θ = Θ^{(r) }and we want to estimate Θ^{(r+1)}. Then we have:
Let δ_{i,j,t }be an indicator variable equalling the number of haplotype h_{t }appear in solution S_{s,i}. Then the haplotype frequencies can be computed using a genecounting method,
There are several ways to initialize the haplotype frequencies Θ = {θ_{1}, θ_{2},...,θ_{l}}. For instance, the initial haplotype frequencies can be chosen at random, or all haplotypes are equally frequent, i.e. Θ_{t}^{(0) }= 1/l (t = 1, 2,...,l). Or that all initial haplotype frequencies are equal to the product of the corresponding singlelocus allele frequencies (i.e., a complete linkage equilibrium). Also, we can set all feasible solutions for each pedigree to be equally likely, i.e. Pr(S_{s,i}Θ^{(0)}) = 1/π_{i}, (j = 1,2,...,π_{i}). We can even initialize the haplotype frequencies by counting their occurrence in all the feasible solutions. Since in practical applications the EM algorithm could be trapped in some local maximum, we recommend to restart the algorithm several times with different initial haplotype frequencies and better with a randomized additive perturbation.
The stopping (convergence) criterion is defined as the absolute value of the difference of Θ between consecutive iterations being less than some small value ε > 0.
Results
Simulated data set
In order to generate a pedigree genotype data set for simulation experiments, we generate a population of haplotypes H* first, where each locus of each haplotype is set to some allele according to the probability distribution function P. In our simulation, we generated haplotypes of SNP loci as well as haplotypes of microsatellite loci. For a biallelic SNP locus i, suppose that i happens to be one state with a probability of p_{i}, and to be the other state with a probability of (1  p_{i}). For a microsatellite loci, suppose it has w different alleles: a_{1}, a_{2},...,a_{w}, each appears with the probability of p_{1}, p_{2},...,p_{w }(p_{1 }+ p_{2 }+...+ p_{w }= 1).
Each founder node in any tested pedigree is arbitrarily assigned a pair of haplotypes according to their frequencies θ*. The two haplotypes of a nonfounder node are arbitrarily selected from those of its parents (one from the father, one from the mother). At last, the pair of haplotypes of the same node is blended to form a genotype corresponding to that node.
All experiments are conducted on a Windows server with 1.7G Hz CPU and 256 MB RAM. And for each parameter setting, 100 copies are randomly generated and the performance is evaluated by computing the average numbers in these 100 runs.
Running time of the haplotyping algorithm
One of the main contributions of our paper is to do haplotyping in linear time, so we firstly examine the running time with respect to different number of nodes of each pedigree (n) and different number of loci in each sequence (m).
Several different tree pedigree structures are used in the simulation, the first pedigree is Figure 1 in [15], which is a tree with 13 nodes. The second one is Figure 8 in [9], which is a tree with 29 nodes. The third one is a 21 node pedigree from Figure 5 of [15]. The results are given in Figure 2. It is obvious that our HCLanalysis haplotyping algorithm runs in linear time and thus could be applied to largescale haplotype analysis.
Figure 2. Running time of the haplotyping algorithm: (a) running time vs. number of nodes (m = 100); (b) running time vs. number of loci (n = 13).
Number of solutions
We compare the numbers of haplotypes that should be considered in the estimation stage, with and without the haplotyping stage. In our experiment, we set P_{1 }(p_{1 }= p_{2 }= 0.5) and P_{2 }(p_{1 }= 0.9, p_{2 }= 0.1) for SNP loci, and set w = 4, P_{3 }(p_{1 }= p_{2 }= p_{3 }= p_{4 }= 0.25) and P_{4 }(p_{1 }= 0.5, p_{2 }= p_{3 }= 0.2, p_{4 }= 0.1) for microsatellite loci. We let H* = 20, and θ_{1}* = 0.2, θ_{2}* = θ_{3}* = θ_{4}* = 0.1, θ_{5}* = θ_{6}* = θ_{7}* = θ_{8}* = 0.05, θ_{9}* = θ_{10}* = ... = θ_{20}* = 0.025.
When only trio pedigrees are considered, the average numbers of haplotypes are recorded in Table 2. We can see from the table that the numbers of haplotypes that should be estimated have been greatly reduced after the haplotyping stage (HANAP vs. directly), which will immediately bring the improvement on the running time.
Table 2. Comparison of number of haplotypes (H) on trio pedigrees
We also consider a more complex pedigree that contains 13 nodes (Figure 1 of [15]). The average numbers of haplotypes are recorded in Table 3. We find that the number of haplotypes that should be estimated is even much smaller. We also notice that the number of haplotypes is growing with the length of haplotypes and the number of pedigrees. However, it grows very slowly.
Table 3. Comparison of number of haplotypes (H) on a general pedigree
Running time of HANAP
EMDeCODER is a popular software using the EM algorithm to estimate the haplotype frequencies based on population data. As we have pointed out, it can be used to estimate haplotype frequencies in pedigrees, simply by discarding the pedigree information. Here we also compare the running times of HANAP and EMDeCODER.
Figure 3 shows their running times over different number of trios (k), length of haplotypes (m) and distributions of alleleprobability (P). We can learn from the figure that HANAP runs much quicker than EMDeCODER, and thus can be applied to much larger instances. We also notice that the running time of both HANAP and EMDeCODER increase exponentially with the length of haplotypes while increasing nearlinearly with the number of trios (the running time of EMDeCODER is not plotted in Figure 3(b) because haplotypes of length 100 are out of its capability).
Figure 4. The estimate deviation of HANAP and EMDeCODER: (a) deviation vs. length of haplotypes (K = 100); (b) deviation vs. number of trios (m = 100).
Accuracy rate of HANAP
We define a parameter Δ to incarnate the deviation of the estimate haplotype frequencies from the underlying ones. Because the simulation data are generated according to the Θ*, we recognize that as the underlying true frequencies. Suppose the estimate haplotype set is H^{E }with frequencies Θ^{E}. Compare H^{E }with H*. Suppose the estimate frequencies of the 20 haplotypes in H* are θ_{1}^{E}, θ_{2}^{E},...,θ_{20}^{E}. We let,
Figure 4 shows the deviation of the estimate of HANAP and EMDeCODER over different number of trios (k), length of haplotypes (m) and distributions of alleleprobability (P). We can learn from the figure that the deviation of HANAP is smaller than that of EMDeCODER, which means HANAP is more accurate. We have also noticed that the deviation of the estimate increases with the length of haplotypes, and decreases with the number of trios.
Figure 3. The running time of HANAP and EMDeCODER: (a) running time vs. length of haplotypes (K = 100); (b) running time vs. number of trios (m = 100).
Two real data sets
We also test the efficiency and accuracy of HANAP on two real data sets. The first real data set is from dbMHCABDR, a set of 122 trios. Each genotype of these trios contains 31 markers of the same position on chromosome 6, 10 of which are microsatellite markers and others are SNPs. We run HANAP to find the most frequent haplotypes (with frequencies larger than 0.01). A list of 20 haplotypes is found by HANAP. Their frequencies are shown in Table 4. It only takes HANAP 0.97 second to find these haplotypes while it is out of the capability of EMDeCODER.
Table 4. The frequencies of the 20 most frequent haplotypes found by HANAP
The second data set is from the CEPH database [17], which contains 65 families; each consists of only three generations, usually with four grandparents, two parents and a number of children. Figure 5 in [15] shows a typical family with 21 nodes.
A great portion of the alleles in this data set have not been identified, and will be viewed as missing data. We carefully selected a data set of 28 families (totally 482 nodes) on a block (48 markers) from chromosome 14 (452 markers in total) with no recombination. Both HANAP and PHASE (another widely used software package [3] based on the GS algorithm) are applied to this data set. HANAP inferred 36 haplotypes with frequency larger than 0.01 and PHASE inferred 39 ones, among which 31 are common. Although we are not sure which output is closer to the real cases, the running time of HANAP (13m24s) is extremely shorter than that of PHASE (21h14m).
Discussion
Complexities of the HCLlinkage analysis haplotyping algorithm
We show that the algorithm runs in O(mn) time and O(mn) space. The preprocess need calculate no more than 3n HCLlinkages in no more than n trios. Each HCLlinkage can be computed in O(m) time, so the preprocess can be done in O(mn) time. It takes step 1 O(n) time to construct the rooted tree. In step 2 and step 3, we have to traverse the whole tree, and visit each node for no more than constant times.
When we process the HCLlinkages from the left lowest nuclear family, we should merge the d_{1 }HCLlinkages at each parent node (if we can), it need O(d_{1}m) time, here d_{1 }is the number of children in this family. We need another O(d_{1}m) time to exchange the HCLlinkage information between the two parents and transfer that to its root R_{1 }in the search tree T. So we need O(d_{1}m) time in total to process this nuclear family. When we transfer the normalized HCLlinkage set = {ψ_{1}, ψ_{2},...ψ_{k}} to the upper nuclear families, we only need to remember that all ψ_{i }is coming from R_{1}, so for each ψ_{i}, it will only take O(1) time to process RE_{i}, and O(LS_{i}) time to process LS_{i }and PH_{i}. The summation time is no more than O(k + LS_{1 }+ LS_{2 }+...+ LS_{k}) = O(m) because LS_{i }are disjoint subsets of {1,2,...,m}. In other words, the HCLlinkages in one nuclear family won't increase the processing time of its adjacent families. So the total running time to process all nuclear familes is no more than O(d_{1}m + d_{2}m +...+ d_{x}m) = O(nm), here x is the number of nuclear families in the whole pedigree.
We need another O(mn) time to complete step 4. Therefore, the time complexity of this algorithm is O(mn).
For the computation, we need to maintain a data structure to store the HCLlinkage set Ψ_{v }for each node v; we can maintain the storage always below O(d_{i}m) for nuclear family F_{i}. So the space complexity of the algorithm is also O(d_{1}m + d_{2}m +...+ d_{x}m) = O(nm).
Effectiveness of the haplotyping phase
Excoffier used the EM algorithm to estimate haplotype frequencies while ignoring the pedigree information. Here we adopt a twostage method, which tries to reduce the number of possible haplotypes to be considered in the stage of estimation by utilizing the relatives' information to do haplotyping at first.
Suppose we are estimating haplotype frequencies in trios and each haplotype consists of m biallelic SNP loci. For locus i, suppose that i happens to be one state with a probability of p_{i}, and to be the other state with a probability of (1  p_{i}). Then locus i of the genotype is heterozygous with the probability of 2p_{i}(1  p_{i}). Suppose the expected value of p_{i }is p, then the genotype is expected to have 2p(1  p)·m heterozygous loci. As a consequence, a total number of 2^{2p(1p)·m1 }possible haplotype pairs is expected to be considered if we use the EM algorithm directly. However, the probability that locus i in a trio is ambiguous is 2p_{i}(1  p_{i})·2p_{i}(1  p_{i})·2/4 = 2p_{i}^{2}(1  p_{i})^{2}. So the expected number of possible haplotype configurations for the trio is . If p = 1/2, our method can handle λ = (2p(1  p))/(2p^{2}(1  p)^{2}) = 4 times longer genotypes than Excoffier's methods. Moreover, in most cases, the more frequent allele at one locus appears with a probability of more than 0.9, so our method usually can handle λ = 1/p(1  p) > 10 times longer genotypes.
Furthermore, if each locus of the haplotype is a microsatellite locus, and it has l different alleles: a_{1}, a_{2},...,a_{l}, each appears with the probability of p_{1}, p_{2},...,p_{l}. then the expected number of possible haplotype pairs for a genotype is , the expected number of feasible haplotype configurations for a trio is , so our method usually can handle times longer genotypes. For example, when l = 8, and p_{1 }= p_{2 }= ... = p_{8 }= 1/8, λ = 64, i.e. our method can be applied to cases of much larger scale.
Conclusion
We present a twostage method to do haplotyping and to estimate haplotype frequencies for pedigree genotype data in this paper. Given a set of pedigrees, it firstly determines all feasible haplotype configurations for each pedigree, then uses the EM algorithm to estimate the haplotype frequencies based on the inferred haplotype configurations. Because a large number of illegal haplotypes have been eliminated from the possible haplotype list, our method is both more efficient and more accurate. The experimental results show that, HANAP runs much faster than EMDeCODER, and thus can be applied to much larger scale of instances. Moreover, the deviation of the estimate of HANAP is smaller than that of EMDeCODER, which means it is more accurate.
Our method suggests that pedigree information is of great importance in haplotype analysis. It can be used to speedup estimation process, and to improve estimation accuracy as well. The result also demonstrates that whole haplotype configuration space can be substitute by the space of zerorecombinant haplotype configurations in haplotype frequency estimation, especially when the considered haplotype block is relatively short.
Authors' contributions
QZ proposed the whole estimation framework and both the haplotyping and the estimation algorithms, and wrote the manuscript. YZ implemented the software and helped writing the manuscript. GC and YX participated to the design of the study and wrote the manuscript. All authors read and approved the final manuscript.
Appendix
Correctness of the HCLlinkage analysis haplotyping algorithm
First, we point out that every consistent haplotype configuration for pedigree P should be consistent with all HCLlinkages calculated by the unique partial solutions within trios, and our merge and transfer operations keep this feature during the whole process, i.e. if ζ is a haplotype configuration for P, and ζ is consistent with all the HCLlinkages in the pedigree, it should also be consistent with those after the merge and transfer operations.
Obviously, HCLlinkages are sufficient to generate a consistent haplotype configuration within a trio. We prove that:
Lemma
If it exists consistent haplotype configuration for a nuclear family (F, M, C_{1}, C_{2},...,C_{d}), every imputing schema s of arbitrarily imputing the (compound) alleles at one node can output one feasible solution ζ. Contrarily, if one imputing schema ends with incompatibleness, there is no consistent haplotype configuration.
Proof
Firstly, HCLlinkages are necessary for constructing the consistent haplotype configurations, which means that if there is a feasible solution, it should correspond to one imputing schema.
Secondly, we show that if one imputing schema outputs a feasible solution, all the schemas output feasible solutions.
In particular, for a trio (F, M, C_{1}), without loss of generality, we suppose that Step 4 starts by imputing the "alleles" of node F. For a specified "locus" i, the two alleles of F, M and C_{1 }are denoted as (a, b), (c, d) and (e, f), and the partial haplotypes from locus 1 to locus (i  1) are denoted as (ph_{1}, ph_{2}), (ph_{3}, ph_{4}) and (ph_{5}, ph_{6}).
Suppose in schema s, allele a is imputed to ph_{1}, denoted as ph_{1 }← a, and ph_{2 }← b, ph_{3 }← c, ph_{4 }← d, ph_{5 }← e, ph_{6 }← f. If s outputs a consistent haplotype configuration for trio (F, M, C), we proof that s': ph_{2 }← a, ph_{1 }← b, ph_{4 }← c, ph_{3 }← d, ph_{6 }← e, ph_{5 }← f outputs another consistent haplotype configuration.
The loci from 1 to (i  1) can be viewed as a compound locus I. Let's refer to Table 1, because we can impute a or b to ph_{1}, either or both of "locus" I and i should be ambiguous. Else, they will be linked to a bigger compound "locus" in Step 2 or 3 by HCLlinkages. Whatever a or b will be linked with ph_{1}, we can not impute that arbitrarily in Step 4. Without loss of generality, we assume that locus i is ambiguous, which means that a ≠ b, and {a, b} = {c, d} = {e, f} (please refer to Table 1). We can prove that s' is also a consistent haplotype configuration by enumeration. Because nuclear family (F, M, C_{1}, C_{2},...,C_{d}) is the intersection of trio (F, M, C_{1}), (F, M, C_{2}),...,(F, M, C_{d}), the above prove shows that both both s: ph_{1 }← a, ph_{2 }← b, ph_{3 }← c, ph_{4 }← d and s': ph_{2 }← a, ph_{1 }← b, ph_{4 }← c, ph_{3 }← d will lead to a consistent haplotype configuration for the family.
We call s to s' a walk step by switching ph_{1 }← a to ph_{1 }← b. Obviously, for any two imputing schemata s_{1 }and s_{2}, we can transfer s_{1 }to s_{2 }by consecutive walk steps. So s_{1}and s_{2 }will lead to a consistent haplotype configuration or neither can.
This lemma indicates that our algorithm works in a nuclear family. We now can prove the correctness of our HCLlinkage analysis haplotyping algorithm by induction, i.e. if there is at least one feasible solution for a general pedigree, HCLlinkages are sufficient to generate all solutions.
Suppose that the root R has multiple child mating nodes: O_{1},...,O_{r}, each represents a nuclear family; and the algorithm works in all subtrees, i.e. all of the feasible solutions for those subtrees can be directly deduced from the HCLlinkages collected from the subtrees and stored at their roots. From the former lemma, we know that if incompatibleness of type II occurs, there is no feasible solution. We assume that there is no incompatibleness of type II and there are always feasible solutions for all subtrees. Suppose that haplotype configuration ζ is consistent with all the HCLlinkages at root R (and all the HCLlinkages in the P consequently). Then ζ should be consistent haplotype configuration when restricted to any nuclear family of O_{1},...,O_{r}, and it should also be consistent with the HCLlinkages at the root of lower subtrees of O_{1},...,O_{r}. By induction, ζ should be a feasible solution when restricted to any of these subtrees. So ζ is a consistent haplotype configuration of the whole pedigree P.
Acknowledgements
This work is supported by the National Science Foundation of China under the grant No.60533020.
This article has been published as part of BMC Bioinformatics Volume 7, Supplement 4, 2006: Symposium of Computations in Bioinformatics and Bioscience (SCBB06). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/7?issue=S4.
References

Excoffier L, Slatkin M: Maximumlikelihood estimation of molecular haplotype frequencies in a diploid population.
Mol Biol Evol 1995, 12:921927. PubMed Abstract  Publisher Full Text

Hawley ME, Kidd KK: HAPLO: a program using the EM algorithm to estimate the frequencies of multisite haplotypes.
J Hered 1995, 86(5):40911. PubMed Abstract

Stephens M, Smith NJ, Donnelly P: A new statistical method for haplotype reconstruction for population data.
Am J Hum Genet 2001, 68:978989. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Niu T, Qin ZS, Xu X, Liu JS: Bayesian haplotype inference for multiple linked single nucleotide polymorphisms.
Am J Hum Genet 2002, 70:157169. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Griffiths A, Gelbart W, Lewontin R, Miller J: Modern Genetic Analysis: Integrating Genes and Genomes. New York: W.H. Freeman and Company; 2002.

Cox R, Bouzekri N, et al.: Angiotensinlconverting enzyme (ACE) plasma concentration is influenced by multiple ACElinked quantitative trait nucleotides.
Hum Mol Genet 2002, 11:29692977. PubMed Abstract  Publisher Full Text

Qian D, Beckmann L: Minimum recombinant haplotyping in pedigrees.
Am J Hum Genet 2002, 70(6):14341445. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li J, Jiang T: Efficient rulebased haplotyping algorithms for pedigree data.

Li J, Jiang T: Efficient inference of haplotypes from genotypes on a pedigree.

Wijsman EM: A deductive method of haplotype analysis in pedigrees.
Am J Hum Genet 1987, 41(3):356373. PubMed Abstract

O'Connell JR: Zerorecombinant haplotyping: applications to fine mapping using SNPs.
Genet Epidemiol 2000, 19(Suppl 1):S6470. PubMed Abstract

Elston RC, Stewart J: A general model for the genetic analysis of pedigree data.
Human Heredity 1971, 21:523542. PubMed Abstract

Fallin D, Schork NJ: Accuracy of haplotype frequency estimation for biallelic loci, via the expectation maximization algorithm for unphased diploid genotype data.
Am J Hum Genet 2000, 67:947959. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhao H, et al.: Transmission/disequilibrium tests using multiple tightly linked markers.
Am J Hum Genet 2000, 67:936946. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhang Q, Chin FYL, Shen H: Minimum ParentOffspring Recombination Haplotype Inference in Pedigrees.
Transactions on Computational Systems Biology LNBI 36800100 2005, 2:10012.

Zhang Q, Che H, Zhou Z, Chen G: Comparative study on different approaches to in silico haplotyping. In Technical report. Dept of Computer Science and Technology, University of Science and Technology of China; 2003.

The CEPH genotype database [http://www.cephb.fr/] webcite