Estimate haplotype frequencies in pedigrees

Zhang, Qiangfeng; Zhao, Yuzhong; Chen, Guoliang; Xu, Yun

doi:10.1186/1471-2105-7-S4-S5

Volume 7 Supplement 4

Symposium of Computations in Bioinformatics and Bioscience (SCBB06)

Research
Open access
Published: 12 December 2006

Estimate haplotype frequencies in pedigrees

Qiangfeng Zhang¹,
Yuzhong Zhao¹,
Guoliang Chen¹ &
…
Yun Xu¹

BMC Bioinformatics volume 7, Article number: S5 (2006) Cite this article

4059 Accesses
5 Citations
Metrics details

Abstract

Background

Haplotype analysis has gained increasing attention in the context of association studies of disease genes and drug responsivities over the last years. The potential use of haplotypes has led to the initiation of the HapMap project which is to investigate haplotype patterns in the human genome in different populations. Haplotype inference and frequency estimation are essential components of this endeavour.

Results

We present a two-stage method to estimate haplotype frequencies in pedigrees, which includes haplotyping stage and estimation stage. In the haplotyping stage, we propose a linear time algorithm to determine all zero-recombinant haplotype configurations for each pedigree. In the estimation stage, we use the expectation-maximization (EM) algorithm to estimate haplotype frequencies based on these haplotype configurations. The experiments demonstrate that our method runs much faster and gives more credible estimates than other popular haplotype analysis software that discards the pedigree information.

Conclusion

Our method suggests that pedigree information is of great importance in haplotype analysis. It can be used to speedup estimation process, and to improve estimation accuracy as well. The result also demonstrates that the whole haplotype configuration space can be substituted by the space of zero-recombinant haplotype configurations in haplotype frequency estimation, especially when the considered haplotype block is relatively short.

Background

The modelling of human genetic variation is critical to the understanding of the genetic basis for complex diseases. Single nucleotide polymorphisms (SNPs) are the most frequent form of variation. The Human Genome Project and other large-scale efforts have identified millions of SNP markers. Although each marker can be analyzed independently, it is more informative to analyze them in groups. Therefore, it is useful to analyze haplotypes (haploid genotypes), which are sequences of linked markers on a single chromosome. In diploid organisms, such as human beings, chromosomes come in pairs, and experiments often yield genotype information, which blend haplotype information for chromosome pairs. There is growing evidence that, in order to better characterize the role of a candidate gene, full haplotype information should be exploited instead of using only genotype information. Unfortunately, it is both time-consuming and expensive to derive haplotype information experimentally. This explains the increasing interest in inferring haplotype information, or haplotyping, computationally. In fact, the potential use of haplotypes has led to the initiation of the HapMap project which is to investigate haplotype patterns in the human genome in different populations. Haplotype inference and frequency estimation are essential components of this endeavour.

Genotype data can be with or without any pedigree information, the first category is called population genotype data while the second one is pedigree genotype data. A large number of algorithms have been designed to estimate haplotype frequencies based on population data [1–4]. Among them, EM algorithms are most popular due to their interpretability and stability.

For any given pedigree genotype data, we can certainly discard the pedigree information and simply take the genotype sequences as the input of EM estimation algorithms for population data. However, it is well accepted that information obtained by analyzing pedigree genotype data is more reliable: the constraint provided by other members in a pedigree would force one genotype to settle on a unique haplotype pair as being most probable.

Here we propose a two-stage method to estimate haplotype frequencies in pedigrees. The first stage is the haplotyping stage, which finds out all feasible haplotype configurations for each pedigree. In the second estimation stage, we use EM algorithm to estimate haplotype frequencies in pedigrees based on the haplotype configurations inferred in the former stage.

In general, haplotyping pedigrees need consider the entire solution space of all possible consistent haplotype configurations. However, the genomic DNA can be partitioned into long blocks such that recombinations within each block are rare or even nonexistent [5, 6]. Thus it is believed that haplotype configurations with fewer recombinations should be preferred in haplotype inference [7–9]. When the region of interest is so small that the expected number of recombinations in the pedigree data is very close to zero, the solution space of all consistent haplotype configurations can be replaced by that of zero recombination (provided it is non-empty) to estimate haplotype frequencies. It is because the contribution of the solutions of recombinations to the overall likelihood becomes so small compared to those of zero recombination while they bring considerable complexity to the computation. Thus, we are interested in finding the consistent haplotype configurations of zero recombination.

Wijsman [10] proposed a 20-rule algorithm, and O'Connell [11] described a genotype-elimination algorithm, both of which can be used to find out zero-recombinant haplotype configurations for pedigrees. Recently, Li and Jiang [8, 9] showed that it could be solved in polynomial time. Here we propose an algorithm to find out zero-recombinant haplotype configurations in linear time using a technique called HCL-linkage analysis.

In the second stage, we use the EM algorithm to estimate haplotype frequencies based on haplotype configurations obtained from the haplotyping stage. We employ the Hardy-Weinberg Equilibrium to obtain the probabilities of founder genotypes and use a genetic model [12] to deduce the transmitted probabilities of non-founders. While the likelihood of each configuration is computed by multiplying the probabilities of each genotype, the frequency of each haplotype that appears in the configuration is calculated by a gene-counting method.

We implement all the algorithms in a C software package named HANAP (Haplotype ANAlysis in Pedigrees) and test its effectiveness and efficiency both on simulated and real data sets. The experimental results show that, our method runs much faster than the direct frequency estimation software that discards the pedigree information. Moreover, because our method utilizes such information, the estimation is more reliable.

Methods

Haplotyping stage: haplotyping algorithm based on HCL-linkage analysis

Excoffier's EM algorithm was widely applied in haplotype analysis [14, 15]. Unfortunately, it should calculate the frequencies of all possible haplotype pairs consistent to each given genotype, which is unbearable in storage when the haplotype length grows to more than 20 [16]. O'Connell [10] showed that genetic information from relatives could be used to resolve one genotype's ambiguity, and thus reduce the number of haplotypes that should be considered. However, O'Connell's method had an exponential time complexity. Recently, Li and Jiang [8, 9] showed that, for any genotype in a given pedigree, its ambiguity could be solved in cubic time (O(m³n³)), where n is the number of members in the pedigree and m is the number of loci in each genotype. Here we present a so-called HCL-linkage analysis method to do haplotyping in linear time (O(mn)).

HCL-linkage definition

Trios are simple pedigrees that contain only a pair of parents and a child. A consistent zero-recombinant haplotype configuration for a general pedigree should also be a consistent zero-recombinant haplotype configuration when restricted to each trio in this pedigree. Given trio T = (F, M, C), here F is the father, M is the mother and C is the child, suppose that locus i of F, M, and C have alleles {a, b}, {c, d} and {e, f} (note that {e, f} ⊂ ({a, b} ∪ {c, d})). The genotype information of C can be homozygous or heterozygous. If it is homozygous (e = f), then it is clear that the paternal allele and the maternal allele are the same (e or f). The situation becomes complicated if it is a heterozygous site (e ≠ f). Table 1 lists out all possible situations. We can see that given the locus genotype information for the three members, it may or may not be possible to determine the paternal allele and the maternal allele for the child. We call a locus ambiguous if its inheritance relationship cannot be resolved.

Table 1 Imputing the paternal allele and the maternal allele for the child at a single locus

Full size table

In fact, for any trio, ignoring the ambiguous loci, the consistent (partial) haplotype configuration for the unambiguous loci is unique and specifies a linkage of alleles on some heterozygous loci for each node in the trio. We define such linkage as HCL-linkages (linkages of Haplotype Configuration on the non-ambiguous Loci).

Definition

An HCL-linkage ψ is a quadruplet <v, RE, LS, PH> defined on node v and specified by the unique consistent (partial) haplotype configuration within a trio that contains v. Here v denotes the node to which the HCL-linkage belongs. LS = {a₁,...,a_l} is the set of heterozygous loci where the haplotype configuration has been inferred. PH = {ph, ph'} = {(h₁...h_l), (h₁'...h_l')} records the two (partial) haplotypes imputed on these loci. RE = (R, R') denotes that ph (respectively ph') is inherited from or will be passed on to the node in set R(R').

An HCL-linkage describes the partial haplotype configuration of a node and the inheritance relationship between the parents and their children. Under our definition, we can conclude that every haplotype configuration should be consistent with any HCL-linkage specified by each trio in the pedigree.

Merge and transfer operations over HCL-linkages

In the case of multiple generations and multiple children, loci on one node may be linked by different HCL-linkages. HCL-linkages of the same node should be merged if they can. There are three cases when merging two HCL-linkages ψ₁ = <v, (R₁, R₁'), LS₁, {ph₁, ph₁'}> and ψ₂ = <v, (R₂, R₂'), LS₂, {ph₂, ph₂'}> on node v.

Case (1): (R₁ ∪ R₁') ∩ (R₂ ∪ R₂') ≠ Φ

l.a) R₁ ∩ R₂ ≠ Φ or R₁' ∩ R₂' ≠ Φ, it means that both ph₁ and ph₂ are from the nodes in R₁ and R₂ said ph₁' and ph₂' are from the nodes in R₁' and R₂', so ph₁ and ph₂ should be on the same haplotype, and ph₁' and ph₂' on the other: i) LS₁ ∩ LS₂ = Φ, or LS₁ ∩ LS₂ ≠ Φ but ph₁ equals ph₂ when restricted to loci in LS₁ ∩ LS₂, it means that ψ₁ and ψ₂ are compatible. In this case, they should be merged to ψ = <v, (R₁ ∪ R₂, R₁' ∪ R₂'), LS₁ ∪ LS₂, {ph₁ ∪ ph₂, ph₁' ∪ ph₂'}>, here ph₁ ∪ ph₂ denote a longer partial haplotype, which alleles equal to those of ph₁ and ph₂ when restricted to loci in LS₁ and LS₂; ii) LS₁ ∩ LS₂ ≠ Φ and ph₁ doesn't equal ph₂ when restricted to LS₁ ∩ LS₂, it means that ψ₁ and ψ₂ are incompatible, i.e. no haplotype configuration can satisfy the two HCL-linkages in the same time.

1.b) R₁ ∩ R₂' ≠ Φ or R₁' ∩ R₂ ≠ Φ, it means that ph₁ and ph₂' should be on the same haplotype, and ph₁' and ph₂ on the other. Similarly, ψ₁ and ψ₂ can be merged to ψ = <v, (R₁ ∪ R₂', R₁' ∪ R₂), LS₁ ∪ LS₂, {ph₁ ∪ ph₂', ph₁' ∪ ph₂}> when they are compatible.

Case (2): (R₁ ∪ R₁') ∩ (R₂ ∪ R₂') = Φ, but LS₁ ∩ LS₂ ≠ Φ,

2.a) ph₁ equals ph₂ (and ph₁' equals ph₂' consequently) or ph₁ equals ph₂' (then ph₁' equals ph₂) when restricted to LS₁ ∩ LS₂, it means that ψ₁ and ψ₂ are compatible, in this case, they should be merged to ψ = <v, (R₁ ∪ R₂, R₁' ∪ R₂'), LS₁ ∪ LS₂, {ph₁ ∪ ph₂, ph₁' ∪ ph₂'}> or ψ = <v, (R₁ ∪ R₂', R₁' ∪ R₂), LS₁ ∪ LS₂, {ph₁ ∪ ph₂', ph₁' ∪ ph₂}>.

2.b) Else, ph₁ doesn't equal ph₂ or ph₂' when restricted to LS₁ ∩ LS₂, it means that ψ₁ and ψ₂ are incompatible.

Case (3): (R₁ ∪ R₁') ∩ (R₂ ∪ R₂') = Φ, and LS₁ ∩ LS₂ = Φ,

In this case, ψ₁ and ψ₂ cannot be merged and both should be recorded in a HCL-linkage set Ψ_vfor node v.

With the merge operation, we can define the normalizing of a set of HCL-linkages Ψ_v: normalizing a set Ψ_vof HCL-linkages on node v means repeatedly applying the merge operation for pairs of HCL-linkages in Ψ_vuntil, ∀ψ_i, ψ_j∈ Ψ_v, (R_i∪ R_i') ∩ (R_i∪ R_i') = Φ, and LS₁ ∩ LS₂ = Φ. Ψ_vis then said to be normalized. From now on, if there is no further notice, Ψ_vshould be normalized after any changes.

Like genetic information, HCL-linkages will be passed on from generations to generations. Without loss of generality, let us define the transfer of HCL-linkage information from child C to its parent F. The other case from F to C would be similar. Let Ψ_Cand Ψ_Frepresent the normalized HCL-linkage sets of C and F respectively, and let HS be the set of homozygous loci of F. The transfer of Ψ_Cfrom C to F results in changes to Ψ_F, where each ψ_C= <C, (R_C, R_C'), LS_C, {ph_C, ph_C'}> ∈ Ψ_Cis transferred independently. There are two cases to consider.

Case (1): if F ∈ R_C(or respectively, F ∈ R_C'), add ψ_F= <F, ({C}, Φ), LS_C- HS, {ph_F, ph_F'}> to Ψ_F, here ph_Fequals the resulting partial haplotypes of ph_C(respectively ph_C') when restricted to loci in LS_C- HS and ph_F' is the compensatory partial haplotypes of ph_Fconsistent to genotype g_F.

Case (2): else, F ∉ R_C∪ R_C': i) both ph_Cand ph_C' are consistent with the partial genotype g_Fwhen restricted to loci in LS_C, then add ψ_F= <F, (Φ, Φ), LS_C- HS, {ph_F, ph_F'}> to Ψ_F, here ph_Fand ph_F' equal the resulting partial haplotypes of ph_Cand ph_C' when restricted to loci in LS_C- HS; ii) ph_C' (respectively ph_C) is not consistent with the partial genotype g_Fwhen restricted to loci in LS_C, then add ψ_F= <F, ({C}, Φ), LS_C- HS, {ph_F, ph_F'}> to Ψ_F, here ph_Fequals the resulting partial haplotypes of ph_C(respectively ph_C') when restricted to loci in LS_C- HS and ph_F' is the compensatory partial haplotypes of ph_Fconsistent to genotype g_F. Note that at least one of ph_Cand ph_C' should be consistent with the partial genotype g_F.

Remember that Ψ_Fshould be normalized whenever adding a new HCL-linkage to it. In the case of transferring an HCL-linkage ψ_Ffrom F to C, resulting in adding ψ_C= <C, (R_C, R_C'), LS_C, {ph_C, ph_C'}> to Ψ_C, note that we should add M into R_C' whenever we have determined that F ∈ R_C.

Our merge and transfer operations will not bring more or lose any HCL-linkage information for building consistent haplotype configurations.

Main HCL-linkages analysis haplotyping algorithm

Before the algorithm, we preprocess each trio in the pedigree. Whenever a trio specifies an HCL-linkage for node v, it will be stored in the HCL-linkage set Ψ_v. The objective of the algorithm is to collect the complete HCL-linkage information for each node, which is accomplished by traversing the tree twice.

Firstly, we will convert the input pedigree into a rooted searching tree T (at an arbitrary node R) (Step 1). Then we traverse T in post-order to transfer and merge the HCL-linkage information for each node from its relatives (Step 2). We do this from the left lowest nuclear family F_o. The HCL-linkages in nuclear family F_o will be merged at both parents, and then be transferred to the root of the sub-tree. The same operations will be conducted in its parental nuclear family on HCL-linkages specified in this family as well as on those transferred from its child families. And at last, we collect all the HCL-linkages at the root R. In Step 3, we traverse T again in pre-order and transfer the linkage in another direction from R to its farmost descendants.

After step 3, the HCL-linkage set of each node preserves all HCL-linkages in the pedigree. In step 4, we choose a node v arbitrarily. Set Ψ_vcontains several HCL-linkages ψ₁, ψ₂,...,ψ_ldefined on disjoint locus set LS₁, LS₂,...LS_l. When a set of loci are linked by one HCL-linkage, they can be viewed as a compound locus, and the two partial haplotypes can be viewed as two compound alleles. These "loci" (and "alleles") will be treated equally as the other heterozygous loci and homozygous loci that are not involved in any HCL-linkage. We arbitrarily select one allele from the two at each locus to form a haplotype; the other alleles form another haplotype. It is called an imputing schema. Whenever the haplotype configuration of one node is determined, it can be used to determine the configurations of its relatives, and those of the whole pedigree at last.

During our algorithm, Incompatibleness may occur when normalizing HCL-linkage set Ψ_v. Then we declare that there is no solution and exit from the algorithm immediately. Even in step 4, incompatibleness may still occur when applying the haplotypes of the parents to resolve the genotype of the children in the case that an individual node has multiple children. Figure 1 shows an example. The key point is, if it exists a consistent haplotype configuration for a nuclear family (F, M, C₁, C₂,...,C_d), every arbitrary imputing schema s can output one feasible solution ζ. Contrarily, if one imputing schema ends with incompatibleness, other schemata will fail too. We will prove this in the appendix.

The time complexity and space complexity of our algorithm are both O(mn) where n is the number of the members in the pedigree and m is the length of the loci.

Frequency estimation stage

Suppose that we are given K pedigrees P = {P₁, P₂,...,P_K}. Each P_iconsists of n_inodes v_i,j(1 ≤ i ≤ K, 1 ≤ j ≤ n_i), in which the first n_i' are founders. The genotype of node v_i,j(1 ≤ j ≤ n_i) is g_i,j. Suppose that there are π_iconsistent solutions for pedigree P_iand the s-th solution is: S_s,i= <S_s,i,1, S_s,i,2,...,> (1 ≤ s ≤ π_i), where S_s,i,j= <α_s,i,j,1, β_s,i,j,2> is a haplotype pair of genotype g_i,j. All haplotypes appear in these solutions form a list of haplotypes H = {h₁, h₂,...,h_l} with frequencies Θ = {θ₁, θ₂,...,θ_l}, here θ₁ + θ₂ + ... + θ_t= 1 is what we want to estimate.

The likelihood of haplotype frequencies given the observed pedigree data is,

Under the assumption of random mating, the paternal haplotype configuration and the maternal haplotype configuration are independent, and the child's haplotype configuration is transmitted from its parents. We have:

Here Pr (S_s,i,j|Θ) is the probability of haplotype configuration of the founder nodes, it can be computed using the Hardy-Weinberg Equilibrium. Pr(S_s,i,j'|<, >) is the gamete transmission probabilities of haplotype configuration S_s,i,jwith the parental haplotype configurations of and . It can be computed using a genetic model presented by Elston and Stewart [6].

EM algorithm estimates the haplotype frequencies Θ starting with the initial arbitrary values Θ⁽⁰⁾ = {θ₁⁽⁰⁾, θ₂⁽⁰⁾,...,θ_l⁽⁰⁾}. These initial values are used as if they were the unknown true frequencies to estimate solution frequencies Pr(S_s,i|Θ) (the expectation step). These expected solution frequencies are used in turn to estimate haplotype frequencies at the next iteration Θ⁽¹⁾ = {θ i⁽¹⁾, θ₂⁽¹⁾,...,θ⁽¹⁾} (the maximization step), and so on, until convergence is reached.

Suppose that in the r-th iteration, Θ = Θ^(r)and we want to estimate Θ^(r+1). Then we have:

Let δ_i,j,tbe an indicator variable equalling the number of haplotype h_tappear in solution S_s,i. Then the haplotype frequencies can be computed using a gene-counting method,

There are several ways to initialize the haplotype frequencies Θ = {θ₁, θ₂,...,θ_l}. For instance, the initial haplotype frequencies can be chosen at random, or all haplotypes are equally frequent, i.e. Θ_t⁽⁰⁾ = 1/l (t = 1, 2,...,l). Or that all initial haplotype frequencies are equal to the product of the corresponding single-locus allele frequencies (i.e., a complete linkage equilibrium). Also, we can set all feasible solutions for each pedigree to be equally likely, i.e. Pr(S_s,i|Θ⁽⁰⁾) = 1/π_i, (j = 1,2,...,π_i). We can even initialize the haplotype frequencies by counting their occurrence in all the feasible solutions. Since in practical applications the EM algorithm could be trapped in some local maximum, we recommend to restart the algorithm several times with different initial haplotype frequencies and better with a randomized additive perturbation.

The stopping (convergence) criterion is defined as the absolute value of the difference of Θ between consecutive iterations being less than some small value ε > 0.

Results

Simulated data set

In order to generate a pedigree genotype data set for simulation experiments, we generate a population of haplotypes H* first, where each locus of each haplotype is set to some allele according to the probability distribution function P. In our simulation, we generated haplotypes of SNP loci as well as haplotypes of micro-satellite loci. For a biallelic SNP locus i, suppose that i happens to be one state with a probability of p_i, and to be the other state with a probability of (1 - p_i). For a micro-satellite loci, suppose it has w different alleles: a₁, a₂,...,a_w, each appears with the probability of p₁, p₂,...,p_w(p₁ + p₂ +...+ p_w= 1).

Each founder node in any tested pedigree is arbitrarily assigned a pair of haplotypes according to their frequencies θ*. The two haplotypes of a non-founder node are arbitrarily selected from those of its parents (one from the father, one from the mother). At last, the pair of haplotypes of the same node is blended to form a genotype corresponding to that node.

All experiments are conducted on a Windows server with 1.7G Hz CPU and 256 MB RAM. And for each parameter setting, 100 copies are randomly generated and the performance is evaluated by computing the average numbers in these 100 runs.

Running time of the haplotyping algorithm

One of the main contributions of our paper is to do haplotyping in linear time, so we firstly examine the running time with respect to different number of nodes of each pedigree (n) and different number of loci in each sequence (m).

Several different tree pedigree structures are used in the simulation, the first pedigree is Figure 1 in [15], which is a tree with 13 nodes. The second one is Figure 8 in [9], which is a tree with 29 nodes. The third one is a 21 node pedigree from Figure 5 of [15]. The results are given in Figure 2. It is obvious that our HCL-analysis haplotyping algorithm runs in linear time and thus could be applied to large-scale haplotype analysis.

Number of solutions

We compare the numbers of haplotypes that should be considered in the estimation stage, with and without the haplotyping stage. In our experiment, we set P₁ (p₁ = p₂ = 0.5) and P₂ (p₁ = 0.9, p₂ = 0.1) for SNP loci, and set w = 4, P₃ (p₁ = p₂ = p₃ = p₄ = 0.25) and P₄ (p₁ = 0.5, p₂ = p₃ = 0.2, p₄ = 0.1) for micro-satellite loci. We let |H*| = 20, and θ₁* = 0.2, θ₂* = θ₃* = θ₄* = 0.1, θ₅* = θ₆* = θ₇* = θ₈* = 0.05, θ₉* = θ₁₀* = ... = θ₂₀* = 0.025.

When only trio pedigrees are considered, the average numbers of haplotypes are recorded in Table 2. We can see from the table that the numbers of haplotypes that should be estimated have been greatly reduced after the haplotyping stage (HANAP vs. directly), which will immediately bring the improvement on the running time.

Table 2 Comparison of number of haplotypes (|H|) on trio pedigrees

Full size table

We also consider a more complex pedigree that contains 13 nodes (Figure 1 of [15]). The average numbers of haplotypes are recorded in Table 3. We find that the number of haplotypes that should be estimated is even much smaller. We also notice that the number of haplotypes is growing with the length of haplotypes and the number of pedigrees. However, it grows very slowly.

Table 3 Comparison of number of haplotypes (|H|) on a general pedigree

Full size table

Running time of HANAP

EM-DeCODER is a popular software using the EM algorithm to estimate the haplotype frequencies based on population data. As we have pointed out, it can be used to estimate haplotype frequencies in pedigrees, simply by discarding the pedigree information. Here we also compare the running times of HANAP and EM-DeCODER.

Figure 3 shows their running times over different number of trios (k), length of haplotypes (m) and distributions of allele-probability (P). We can learn from the figure that HANAP runs much quicker than EM-DeCODER, and thus can be applied to much larger instances. We also notice that the running time of both HANAP and EM-DeCODER increase exponentially with the length of haplotypes while increasing near-linearly with the number of trios (the running time of EM-DeCODER is not plotted in Figure 3(b) because haplotypes of length 100 are out of its capability).

Accuracy rate of HANAP

We define a parameter Δ to incarnate the deviation of the estimate haplotype frequencies from the underlying ones. Because the simulation data are generated according to the Θ*, we recognize that as the underlying true frequencies. Suppose the estimate haplotype set is H^E with frequencies Θ^E. Compare H^E with H*. Suppose the estimate frequencies of the 20 haplotypes in H* are θ₁^E, θ₂^E,...,θ₂₀^E. We let,

Figure 4 shows the deviation of the estimate of HANAP and EM-DeCODER over different number of trios (k), length of haplotypes (m) and distributions of allele-probability (P). We can learn from the figure that the deviation of HANAP is smaller than that of EM-DeCODER, which means HANAP is more accurate. We have also noticed that the deviation of the estimate increases with the length of haplotypes, and decreases with the number of trios.

Two real data sets

We also test the efficiency and accuracy of HANAP on two real data sets. The first real data set is from dbMHC|ABDR, a set of 122 trios. Each genotype of these trios contains 31 markers of the same position on chromosome 6, 10 of which are micro-satellite markers and others are SNPs. We run HANAP to find the most frequent haplotypes (with frequencies larger than 0.01). A list of 20 haplotypes is found by HANAP. Their frequencies are shown in Table 4. It only takes HANAP 0.97 second to find these haplotypes while it is out of the capability of EM-DeCODER.

Table 4 The frequencies of the 20 most frequent haplotypes found by HANAP

Full size table

The second data set is from the CEPH database [17], which contains 65 families; each consists of only three generations, usually with four grandparents, two parents and a number of children. Figure 5 in [15] shows a typical family with 21 nodes.

A great portion of the alleles in this data set have not been identified, and will be viewed as missing data. We carefully selected a data set of 28 families (totally 482 nodes) on a block (48 markers) from chromosome 14 (452 markers in total) with no recombination. Both HANAP and PHASE (another widely used software package [3] based on the GS algorithm) are applied to this data set. HANAP inferred 36 haplotypes with frequency larger than 0.01 and PHASE inferred 39 ones, among which 31 are common. Although we are not sure which output is closer to the real cases, the running time of HANAP (13m 24s) is extremely shorter than that of PHASE (21h 14m).

Discussion

Complexities of the HCL-linkage analysis haplotyping algorithm

We show that the algorithm runs in O(mn) time and O(mn) space. The pre-process need calculate no more than 3n HCL-linkages in no more than n trios. Each HCL-linkage can be computed in O(m) time, so the pre-process can be done in O(mn) time. It takes step 1 O(n) time to construct the rooted tree. In step 2 and step 3, we have to traverse the whole tree, and visit each node for no more than constant times.

When we process the HCL-linkages from the left lowest nuclear family, we should merge the d₁ HCL-linkages at each parent node (if we can), it need O(d₁m) time, here d₁ is the number of children in this family. We need another O(d₁m) time to exchange the HCL-linkage information between the two parents and transfer that to its root R₁ in the search tree T. So we need O(d₁m) time in total to process this nuclear family. When we transfer the normalized HCL-linkage set = {ψ₁, ψ₂,...ψ_k} to the upper nuclear families, we only need to remember that all ψ_iis coming from R₁, so for each ψ_i, it will only take O(1) time to process RE_i, and O(|LS_i|) time to process LS_iand PH_i. The summation time is no more than O(k + LS₁ + LS₂ +...+ LS_k) = O(m) because LS_iare disjoint subsets of {1,2,...,m}. In other words, the HCL-linkages in one nuclear family won't increase the processing time of its adjacent families. So the total running time to process all nuclear familes is no more than O(d₁m + d₂m +...+ d_xm) = O(nm), here x is the number of nuclear families in the whole pedigree.

We need another O(mn) time to complete step 4. Therefore, the time complexity of this algorithm is O(mn).

For the computation, we need to maintain a data structure to store the HCL-linkage set Ψ_vfor each node v; we can maintain the storage always below O(d_im) for nuclear family F_i. So the space complexity of the algorithm is also O(d₁m + d₂m +...+ d_xm) = O(nm).

Effectiveness of the haplotyping phase

Excoffier used the EM algorithm to estimate haplotype frequencies while ignoring the pedigree information. Here we adopt a two-stage method, which tries to reduce the number of possible haplotypes to be considered in the stage of estimation by utilizing the relatives' information to do haplotyping at first.

Suppose we are estimating haplotype frequencies in trios and each haplotype consists of m biallelic SNP loci. For locus i, suppose that i happens to be one state with a probability of p_i, and to be the other state with a probability of (1 - p_i). Then locus i of the genotype is heterozygous with the probability of 2p_i(1 - p_i). Suppose the expected value of p_iis p, then the genotype is expected to have 2p(1 - p)·m heterozygous loci. As a consequence, a total number of 2^2p(1-p)·m-1possible haplotype pairs is expected to be considered if we use the EM algorithm directly. However, the probability that locus i in a trio is ambiguous is 2p_i(1 - p_i)·2p_i(1 - p_i)·2/4 = 2p_i²(1 - p_i)². So the expected number of possible haplotype configurations for the trio is . If p = 1/2, our method can handle λ = (2p(1 - p))/(2p²(1 - p)²) = 4 times longer genotypes than Excoffier's methods. Moreover, in most cases, the more frequent allele at one locus appears with a probability of more than 0.9, so our method usually can handle λ = 1/p(1 - p) > 10 times longer genotypes.

Furthermore, if each locus of the haplotype is a micro-satellite locus, and it has l different alleles: a₁, a₂,...,a_l, each appears with the probability of p₁, p₂,...,p_l. then the expected number of possible haplotype pairs for a genotype is , the expected number of feasible haplotype configurations for a trio is , so our method usually can handle times longer genotypes. For example, when l = 8, and p₁ = p₂ = ... = p₈ = 1/8, λ = 64, i.e. our method can be applied to cases of much larger scale.

Conclusion

We present a two-stage method to do haplotyping and to estimate haplotype frequencies for pedigree genotype data in this paper. Given a set of pedigrees, it firstly determines all feasible haplotype configurations for each pedigree, then uses the EM algorithm to estimate the haplotype frequencies based on the inferred haplotype configurations. Because a large number of illegal haplotypes have been eliminated from the possible haplotype list, our method is both more efficient and more accurate. The experimental results show that, HANAP runs much faster than EM-DeCODER, and thus can be applied to much larger scale of instances. Moreover, the deviation of the estimate of HANAP is smaller than that of EM-DeCODER, which means it is more accurate.

Our method suggests that pedigree information is of great importance in haplotype analysis. It can be used to speedup estimation process, and to improve estimation accuracy as well. The result also demonstrates that whole haplotype configuration space can be substitute by the space of zero-recombinant haplotype configurations in haplotype frequency estimation, especially when the considered haplotype block is relatively short.

Appendix

Correctness of the HCL-linkage analysis haplotyping algorithm

First, we point out that every consistent haplotype configuration for pedigree P should be consistent with all HCL-linkages calculated by the unique partial solutions within trios, and our merge and transfer operations keep this feature during the whole process, i.e. if ζ is a haplotype configuration for P, and ζ is consistent with all the HCL-linkages in the pedigree, it should also be consistent with those after the merge and transfer operations.

Obviously, HCL-linkages are sufficient to generate a consistent haplotype configuration within a trio. We prove that:

Lemma

If it exists consistent haplotype configuration for a nuclear family (F, M, C₁, C₂,...,C_d), every imputing schema s of arbitrarily imputing the (compound) alleles at one node can output one feasible solution ζ . Contrarily, if one imputing schema ends with incompatibleness, there is no consistent haplotype configuration.

Proof

Firstly, HCL-linkages are necessary for constructing the consistent haplotype configurations, which means that if there is a feasible solution, it should correspond to one imputing schema.

Secondly, we show that if one imputing schema outputs a feasible solution, all the schemas output feasible solutions.

In particular, for a trio (F, M, C₁), without loss of generality, we suppose that Step 4 starts by imputing the "alleles" of node F. For a specified "locus" i, the two alleles of F, M and C₁ are denoted as (a, b), (c, d) and (e, f), and the partial haplotypes from locus 1 to locus (i - 1) are denoted as (ph₁, ph₂), (ph₃, ph₄) and (ph₅, ph₆).

Suppose in schema s, allele a is imputed to ph₁, denoted as ph₁ ← a, and ph₂ ← b, ph₃ ← c, ph₄ ← d, ph₅ ← e, ph₆ ← f. If s outputs a consistent haplotype configuration for trio (F, M, C), we proof that s': ph₂ ← a, ph₁ ← b, ph₄ ← c, ph₃ ← d, ph₆ ← e, ph₅ ← f outputs another consistent haplotype configuration.

The loci from 1 to (i - 1) can be viewed as a compound locus I. Let's refer to Table 1, because we can impute a or b to ph₁, either or both of "locus" I and i should be ambiguous. Else, they will be linked to a bigger compound "locus" in Step 2 or 3 by HCL-linkages. Whatever a or b will be linked with ph₁, we can not impute that arbitrarily in Step 4. Without loss of generality, we assume that locus i is ambiguous, which means that a ≠ b, and {a, b} = {c, d} = {e, f} (please refer to Table 1). We can prove that s' is also a consistent haplotype configuration by enumeration. Because nuclear family (F, M, C₁, C₂,...,C_d) is the intersection of trio (F, M, C₁), (F, M, C₂),...,(F, M, C_d), the above prove shows that both both s: ph₁ ← a, ph₂ ← b, ph₃ ← c, ph₄ ← d and s': ph₂ ← a, ph₁ ← b, ph₄ ← c, ph₃ ← d will lead to a consistent haplotype configuration for the family.

We call s to s' a walk step by switching ph₁ ← a to ph₁ ← b. Obviously, for any two imputing schemata s₁ and s₂, we can transfer s₁ to s₂ by consecutive walk steps. So s₁and s₂ will lead to a consistent haplotype configuration or neither can.

This lemma indicates that our algorithm works in a nuclear family. We now can prove the correctness of our HCL-linkage analysis haplotyping algorithm by induction, i.e. if there is at least one feasible solution for a general pedigree, HCL-linkages are sufficient to generate all solutions.

Suppose that the root R has multiple child mating nodes: O₁,...,O_r, each represents a nuclear family; and the algorithm works in all sub-trees, i.e. all of the feasible solutions for those sub-trees can be directly deduced from the HCL-linkages collected from the sub-trees and stored at their roots. From the former lemma, we know that if incompatibleness of type II occurs, there is no feasible solution. We assume that there is no incompatibleness of type II and there are always feasible solutions for all sub-trees. Suppose that haplotype configuration ζ is consistent with all the HCL-linkages at root R (and all the HCL-linkages in the P consequently). Then ζ should be consistent haplotype configuration when restricted to any nuclear family of O₁,...,O_r, and it should also be consistent with the HCL-linkages at the root of lower sub-trees of O₁,...,O_r. By induction, ζ should be a feasible solution when restricted to any of these sub-trees. So ζ is a consistent haplotype configuration of the whole pedigree P.

References

Excoffier L, Slatkin M: Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol 1995, 12: 921–927.
CAS PubMed Google Scholar
Hawley ME, Kidd KK: HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. J Hered 1995, 86(5):409–11.
CAS PubMed Google Scholar
Stephens M, Smith NJ, Donnelly P: A new statistical method for haplotype reconstruction for population data. Am J Hum Genet 2001, 68: 978–989.
Article PubMed Central CAS PubMed Google Scholar
Niu T, Qin ZS, Xu X, Liu JS: Bayesian haplotype inference for multiple linked single nucleotide polymorphisms. Am J Hum Genet 2002, 70: 157–169.
Article PubMed Central CAS PubMed Google Scholar
Griffiths A, Gelbart W, Lewontin R, Miller J: Modern Genetic Analysis: Integrating Genes and Genomes. New York: W.H. Freeman and Company; 2002.
Google Scholar
Cox R, Bouzekri N, et al.: Angiotensinlconverting enzyme (ACE) plasma concentration is influenced by multiple ACElinked quantitative trait nucleotides. Hum Mol Genet 2002, 11: 2969–2977.
Article CAS PubMed Google Scholar
Qian D, Beckmann L: Minimum recombinant haplotyping in pedigrees. Am J Hum Genet 2002, 70(6):1434–1445.
Article PubMed Central CAS PubMed Google Scholar
Li J, Jiang T: Efficient rule-based haplotyping algorithms for pedigree data. Proc of RECOMB 2003, 197–206.
Google Scholar
Li J, Jiang T: Efficient inference of haplotypes from genotypes on a pedigree. J Bioinfo Comp Biol 2003, 1(1):41–69.
Article CAS Google Scholar
Wijsman EM: A deductive method of haplotype analysis in pedigrees. Am J Hum Genet 1987, 41(3):356–373.
PubMed Central CAS PubMed Google Scholar
O'Connell JR: Zero-recombinant haplotyping: applications to fine mapping using SNPs. Genet Epidemiol 2000, 19(Suppl 1):S64–70.
Article PubMed Google Scholar
Elston RC, Stewart J: A general model for the genetic analysis of pedigree data. Human Heredity 1971, 21: 523–542.
Article CAS PubMed Google Scholar
Fallin D, Schork NJ: Accuracy of haplotype frequency estimation for biallelic loci, via the expectation maximization algorithm for unphased diploid genotype data. Am J Hum Genet 2000, 67: 947–959.
Article PubMed Central CAS PubMed Google Scholar
Zhao H, et al.: Transmission/disequilibrium tests using multiple tightly linked markers. Am J Hum Genet 2000, 67: 936–946.
Article PubMed Central CAS PubMed Google Scholar
Zhang Q, Chin FYL, Shen H: Minimum Parent-Offspring Recombination Haplotype Inference in Pedigrees. Transactions on Computational Systems Biology LNBI 3680–0100 2005, 2: 100–12.
Google Scholar
Zhang Q, Che H, Zhou Z, Chen G: Comparative study on different approaches to in silico haplotyping. In Technical report. Dept of Computer Science and Technology, University of Science and Technology of China; 2003.
Google Scholar
The CEPH genotype database[http://www.cephb.fr/]

Download references

Acknowledgements

This work is supported by the National Science Foundation of China under the grant No.60533020.

This article has been published as part of BMC Bioinformatics Volume 7, Supplement 4, 2006: Symposium of Computations in Bioinformatics and Bioscience (SCBB06). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/7?issue=S4.

Author information

Authors and Affiliations

Department of Computer Science and Technology, University of Science and Technology of China, Hefei, Anhui, 230027, China
Qiangfeng Zhang, Yuzhong Zhao, Guoliang Chen & Yun Xu

Authors

Qiangfeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yuzhong Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Guoliang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yun Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guoliang Chen.

Additional information

Authors' contributions

QZ proposed the whole estimation framework and both the haplotyping and the estimation algorithms, and wrote the manuscript. YZ implemented the software and helped writing the manuscript. GC and YX participated to the design of the study and wrote the manuscript. All authors read and approved the final manuscript.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Zhang, Q., Zhao, Y., Chen, G. et al. Estimate haplotype frequencies in pedigrees. BMC Bioinformatics 7 (Suppl 4), S5 (2006). https://doi.org/10.1186/1471-2105-7-S4-S5

Download citation

Published: 12 December 2006
DOI: https://doi.org/10.1186/1471-2105-7-S4-S5

Symposium of Computations in Bioinformatics and Bioscience (SCBB06)

Estimate haplotype frequencies in pedigrees

Abstract

Background

Results

Conclusion

Background

Methods

Haplotyping stage: haplotyping algorithm based on HCL-linkage analysis

HCL-linkage definition

Definition

Merge and transfer operations over HCL-linkages

Main HCL-linkages analysis haplotyping algorithm

Frequency estimation stage

Results

Simulated data set

Running time of the haplotyping algorithm

Number of solutions

Running time of HANAP

Accuracy rate of HANAP

Two real data sets

Discussion

Complexities of the HCL-linkage analysis haplotyping algorithm

Effectiveness of the haplotyping phase

Conclusion

Appendix

Correctness of the HCL-linkage analysis haplotyping algorithm

Lemma

Proof

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us