Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: Eleventh International Conference on Bioinformatics (InCoB2012): Bioinformatics

Open Access Proceedings

A linear-time algorithm for reconstructing zero-recombinant haplotype configuration on a pedigree

En-Yu Lai12, Wei-Bung Wang3, Tao Jiang3 and Kun-Pin Wu1*

Author Affiliations

1 Institute of Biomedical Informatics, National Yang Ming University, Taipei 112, Taiwan

2 Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei 115, Taiwan

3 Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA

For all author emails, please log on.

BMC Bioinformatics 2012, 13(Suppl 17):S19  doi:10.1186/1471-2105-13-S17-S19


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/13/S17/S19


Published:13 December 2012

© 2012 Lai et al.; licensee BioMed Central Ltd.

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

When studying genetic diseases in which genetic variations are passed on to offspring, the ability to distinguish between paternal and maternal alleles is essential. Determining haplotypes from genotype data is called haplotype inference. Most existing computational algorithms for haplotype inference have been designed to use genotype data collected from individuals in the form of a pedigree. A haplotype is regarded as a hereditary unit and therefore input pedigrees are preferred that are free of mutational events and have a minimum number of genetic recombinational events. These ideas motivated the zero-recombinant haplotype configuration (ZRHC) problem, which strictly follows the Mendelian law of inheritance, namely that one haplotype of each child is inherited from the father and the other haplotype is inherited from the mother, both without any mutation. So far no linear-time algorithm for ZRHC has been proposed for general pedigrees, even though the number of mating loops in a human pedigree is usually very small and can be regarded as constant.

Results

Given a pedigree with n individuals, m marker loci, and k mating loops, we proposed an algorithm that can provide a general solution to the zero-recombinant haplotype configuration problem in O(kmn + k2m) time. In addition, this algorithm can be modified to detect inconsistencies within the genotype data without loss of efficiency. The proposed algorithm was subject to 12000 experiments to verify its performance using different (n, m) combinations. The value of k was uniformly distributed between zero and six throughout all experiments. The experimental results show a great linearity in terms of execution time in relation to input size when both n and m are larger than 100. For those experiments where n or m are less than 100, the proposed algorithm runs very fast, in thousandth to hundredth of a second, on a personal desktop computer.

Conclusions

We have developed the first deterministic linear-time algorithm for the zero-recombinant haplotype configuration problem. Our experimental results demonstrated the linearity of its execution time in relation to the input size. The proposed algorithm can be modified to detect inconsistency within the genotype data without loss of efficiency and is expected to be able to handle recombinant and missing data with further extension.

Keywords:
Haplotype inference; zero-recombinant haplotype configuration (ZRHC); pedigree; mating loop.

Background

A genetic disease is caused by the abnormality in an individual's genome. Genetic diseases have been studied extensively for decades by investigating the connection between diseases and genetic variations. In the human genome, chromosomes come in pairs; each gene consists of two alleles that reside in different chromosomes at the same locus. One of the two alleles comes from the father and the other comes from the mother. To study hereditary diseases in which the genetic variations are passed on to offspring, the ability to distinguish between paternal and maternal alleles is essential. Unfortunately, the haplotype structure of a human genome is not available directly from the genotyping and the unordered genotype data does not tell us which allele comes from which parent. A haplotype is a collection of alleles at multiple loci on a chromosome that tend to be inherited as a unit. The determination of haplotypes from genotype data is called haplotype phasing or haplotype inference. Algorithms for haplotype inference are indispensable and have been intensively studied.

The existing computational algorithms for haplotype inference can be classified into statistical and combinatorial and most of which were designed for genotype data collected from individuals in the form of a pedigree. A pedigree is a hierarchical structure that describes the parent-child relationship among members of a family. Individuals without parents are called founders. There may be cycles in a pedigree, which are referred to as mating loops. A mating loop arises from a couple if they have children and both of them are offspring of certain family ancestors. An example of a pedigree, coupled with genotype data, is depicted in Figure 1(a); each allele is denoted as 0 or 1 to represent its form within a gene. If two alleles of a gene are the same, the locus is homozygous; otherwise, it is heterozygous. A haplotype is regarded as a hereditary unit and therefore an input pedigree is preferred to be free of mutational events and to have minimum number of genetic recombinational events [1]. Haplotype inference under this assumption is referred to as the minimum-recombinant haplotype configuration (MRHC) problem, which requires the solving of the haplotype structure of the input pedigree with the minimum number of recombination events [1]. Several algorithms have been proposed to solve the MRHC problem [1-8]. A special case of MRHC is zero-recombinant haplotype configuration (ZRHC) problem, which strictly follows the Mendelian law of inheritance, namely that one haplotype of each child is inherited from the father and the other haplotype is inherited from the mother, without any mutation [9]. To reduce the complexity of the ZRHC, some algorithms have been applied to pedigrees without mating loops (called tree pedigrees) [10-12]. In contrast to algorithms targeting tree pedigrees, so far no linear-time algorithm for ZRHC has been proposed for general pedigrees, even though the number of mating loops in a human pedigree is usually very small and can be regarded as constant; the execution time of existing algorithms for ZRHC using general pedigrees is polynomial [4,13-17]. Regardless of whether it is a MRHC or a ZRHC problem, some algorithms have been extended to handle pedigrees with mutations or missing data [5,8,11,15]. In addition to haplotype inference from pedigree data, algorithms have been proposed for population datasets that come from unrelated individuals. Algorithms for population datasets try to decode the haplotype structure of each individual as well as the haplotype frequencies of a population [18-22]. All the above mentioned algorithms are mainly combinatorial. Readers who are interested in statistical approaches for haplotype inference can consult a recent review [23].

thumbnailFigure 1. A pedigree of 11 members. (a) A pedigree of 11 members coupled with genotype data. The paternal haplotype of an individual is listed left while its maternal haplotype is listed right, even though the haplotype information is not available from genotyping. For example, the paternal and maternal haplotypes of individual n6 are 0100 and 1110, respectively; the genotype of n6, however, is specified as {0, 1}{1, 1}{0, 1}{0, 0}. Circles represent females and boxes represent males. Children are listed below their parents with line connections. For example, the couple n7 and n8 have two children n9 and n10. There is a mating loop in the pedigree due to the common ancestor n2 of the couple n5 and n9. (b) A pedigree graph with a spanning tree. Tree edges are solid lines and non-tree edges are dotted lines. The genotype data are represented as vectors of g-constant. There is a local cycle of length 4 due to the couple n7 and n8 and their children n9 and n10. There is a global cycle of length 6 due to the mating loop. (c) There are four locus graphs for the different loci. Edges in locus forests are depicted as solid lines. Nodes with thick borders are predetermined.

In this study, we have targeted the ZRHC problem for pedigree data. If we assume we are given a pedigree with n individuals and m marker loci. Then for general pedigrees, Li and Jiang proposed an O(m3n3) time algorithm by converting the inheritance process into an equivalent linear system of O(mn) equations over Galois field GF(2) and invoking Gaussian elimination [4]. Xiao et al. improved the method to take O(mn2 + n3 log2 n log log n) time by removing redundant equations from the linear system [16]. Doan et al. proposed an O(mnα(m)) time algorithm by exploring constraints among marker loci rather than family members, where α(·) is the inverse of the Ackermann function [14]. For tree pedigrees, the execution time of the algorithm proposed by Xiao can be reduced from O(mn2 + n3 log2 n log log n) to O(mn + n3) [16]. Li and Li proposed an O(mnα(n)) time algorithm using disjoint-set data structures [11]. Liu et al. further lowered the complexity of Xiao's algorithm to linear time O(mn) [12]. Chan et al. also proposed a linear-time algorithm by maintaining a graph structure [10]. Chan's algorithm, however, only produce a particular solution. A particular solution assigns a numerical value to each system variable, while a general solution describes all possible solutions of the system by designating certain variables as free variables and the others as linear combinations of these free variables.

In this paper, we presented an O(kmn + k2m) time algorithm that provides a general solution for ZRHC for general pedigrees, where k is the number of mating loops. In human pedigrees, k is usually very small and can be regarded as constant. Our algorithm therefore turns out to be linear for most of the practical cases. The proposed algorithm was subject to 12000 experiments to verify its performance using different (n, m) combinations. The value of k was uniformly distributed between zero and six throughout all experiments. The experimental results show a great linearity of the execution time in relation to the input size when both n and m are larger than 100. For those experiments where n or m are less than 100, the proposed algorithm runs very fast, from thousandth to hundredth of a second on a personal desktop computer. We also showed that the proposed algorithm can be easily modified to detect inconsistencies among genotype data without loss of efficiency.

Methods

To apply computational techniques, we transformed the input pedigree into a pedigree graph by connecting each parent directly to its children (Figure 1(b)). A pedigree graph is an undirected graph G = (V, E), where V is a set of nodes and E a set of edges. Each node in V represents an individual in the pedigree; each pair of nodes is connected with an edge in E if and only if the two individuals have a parent-child relationship. G is defined to be undirected because the computational property of each edge is symmetric in our algorithm, even if the parent-child relationship is asymmetric. G may contain cycles. We only pay attention to two types of cycles: a cycle due to a mating loop, which is called a global cycle and a cycle due to a couple and two of their children, which is called a local cycle. Global cycles and local cycles are referred to as basic cycles. For ease of cycle processing, we construct a spanning tree T (G) on G. A basic cycle can be obtained by adding a non-tree edge into T (G). The set of non-tree edges is denoted by EX. Non-tree edges are further divided into two disjoint subsets <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M1">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M2">View MathML</a>; members in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M1">View MathML</a> induce local cycles and members in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M2">View MathML</a> induce global cycles. Mating loops seldom appear in human pedigrees and therefore <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M3">View MathML</a> is regarded as a small constant.

In the rest of this paper, we are assuming that G has n nodes and m loci, all alleles are bi-allelic (denoted by 0 or 1), and the input dataset is free of genotyping errors. Under this assumption, the input size of ZRHC is O(mn). The genotype data of a node ni are represented as a vector <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M4">View MathML</a> of size m. The genotype of ni at locus l, where 1 ≤ l m, is defined as follows:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M5">View MathML</a>

Genotype data are available, thus all g-variables can be regarded as constant (Figure 1(b)). We introduce a vector <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M6">View MathML</a> of size m to describe the haplotype information of ni; the paternal allele of ni at locus l, where 1 ≤ l m, is defined as follows:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M7">View MathML</a>

The vector <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M6">View MathML</a> is regarded as unknown even though we know that <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M8">View MathML</a> if ni is homozygous at locus l (i.e. <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M4">View MathML</a>[l] ≠ 2).

We formulated the ZRHC problem as follows.

ZRHC Given a pedigree graph G(V, E) with full g-constants, determine <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M6">View MathML</a> of each node ni in V.

The haplotype configuration of the input pedigree is identified by specifying the paternal haplotype of each family member.

A system of linear equations over GF(2)

In this section, we introduce a system of linear equations based on G and g-constants; this system was first proposed in [16] and will be reduced to determine all p-variables. Since p-variables carry binary values, all equations in the linear system are defined over GF(2) whose operations addition (+) and multiplication (·) are shown in Table 1.

Table 1. Addition (+) and multiplication (·) in GF(2)

The building block of the system: inheritance

"Inheritance" is the building block of the system. What parents pass to their children must be the same as what children receive from their parents. For a parent ni and a locus l, ni passes <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M6">View MathML</a>[l] + 1 to its children if and only if the genotype of ni at locus l is heterozygous and ni passes its maternal allele; otherwise ni passes <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M6">View MathML</a> [l] to its children. We introduce two auxiliary variables <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M9">View MathML</a> to formally state the above argument. The variable <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M10">View MathML</a> indicates if locus l of ni is heterozygous.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M11">View MathML</a>

The variable <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M12">View MathML</a> indicates which allele of ni is passed to its child nj.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M13">View MathML</a>

Therefore, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M14">View MathML</a> represents the allele at locus l that ni passes to nj.

On the other hand, assume that nj receives an allele from ni. If ni is nj's father, what ni passes to nj is the paternal allele of nj. In this case, we have <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M15">View MathML</a>. If ni is nj's mother, there are two sub-cases. If locus l of nj is homozygous, what ni passes to nj must be the same as the paternal allele of nj. In this case, we have <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M15">View MathML</a>. If locus l of nj is heterozygous, what ni passes to nj is the maternal allele of nj and is different from the paternal allele of nj. In this case, we have <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M16">View MathML</a>. The variable <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M17">View MathML</a> can be used to indicate if locus l of nj is homozygous or heterozygous, the two sub-cases can therefore be combined into a single equation <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M18">View MathML</a>. Moreover, if we introduce another auxiliary variable <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M19">View MathML</a> as follows,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M20">View MathML</a>

the inheritance relationship can be unified into the following equation:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M21">View MathML</a>

(1)

Note that the w- and d-variables are constant by definition, and the p- and h-variables are unknowns. Equation (1) formulates the property of edge (ni, nj) in G: p-variables and w-constants are attributes of the nodes ni and nj, and h-variables and d-constants describe the inheritance relation associated with the edge (ni, nj). With the information provided by Equation (1), various constraints on h-variables can be generated by traversing different paths in G. Our algorithm was designed to first determine h-variables based on these constraints and then the solution to the ZRHC problem can be obtained by determining all p-variables based on the solved h-values and Equation (1). One point needs special care: if nj is a child of <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M22">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M23">View MathML</a> are undefined. In our algorithm, we make the h-variables and d-constants symmetrical such that <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M24">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M25">View MathML</a>.

Linear constraints on h-variables

To reduce the computational complexity of our algorithm, we try to make the number of unknowns in the coming linear system as small as possible. In the pedigree graph G, we have mn p-variables and at most 2n h-variables (since each individual has two parents and there are at most n individuals). Observe that if a node ni itself or one of its parents is homozygous at locus l, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M26">View MathML</a> is determined by definition and Equation (1). In this case ni is referred to as predetermined at locus l and the number of unknown p-variables is reduced by one. Moreover, for an edge (ni, nj) ∈ E, where ni is a parent of <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M27">View MathML</a> is cancelled from Equation (1) if <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M28">View MathML</a> at locus l. If <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M29">View MathML</a> holds for all 1 ≤ l m, no constraints are imposed on <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M30">View MathML</a> and it becomes a free variable (or its value will finally depend on other free variables). In this case the number of h-variables to be determined is reduced by one, which is equipotent to the removal of edge (ni, nj) from G. Accordingly, w-constants can be viewed as the weight of edges in G; we only pay attention to edges with weight one (parent nodes that are heterozygous). To consider only the edges with weight one at locus l, we construct the lth locus graph Gl = (V, El), where El = {(ni, nj) | ni is a parent of nj, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M31">View MathML</a>}. Moreover, the spanning forest T(G) ∩ Gl is denoted by T(Gl) and is referred to as the lth locus forest (Figure 1(c)).

We define constraints on h-variables by traversing paths in the locus graphs. Consider a path p = n0, n1, ..., ni in Gl. Assume that n0 and ni are predetermined and all other in-between nodes are non-predetermined. Adding up all h-variables on the path will produce the following equation by Equation (1):

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M32">View MathML</a>

(2)

Since n0 and ni are predetermined and all d-constants are known, b is a constant. The constant b is said to be the constraint of path p. Note that the constraint b does not depend on the direction that path p is read because the h-variables and d-constants are symmetric. Moreover, if the path is a cycle c = n0, n1, ..., ni, n0 in Gl, we would have the following equation:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M33">View MathML</a>

(3)

Again, since all d-constants are known, b' is also a constant. The constant b' is said to be the constraint of cycle c. On the basis of Equations (2) and (3), we can generate constraint equations with only h-variables for cycles or for paths that connect predetermined nodes in Gl. Constraints can be classified into two categories with respect to the spanning tree T(G): cycle and path constraints derived from paths containing non-tree edges, and tree constraints derived from paths containing only tree edges.

Cycle and Path constraints

Adding a non-tree edge e into the spanning tree T (G) generates a basic cycle c. If Gl contains e, there are two cases of c in Gl.

Case 1 c is in Gl. A cycle constraint bc of cycle c can be obtained by Equation (3). The constraint is denoted interchangeably by bc or (bc, e), which is also said to be the cycle constraint of e.

Case 2 c is broken into several disjoint paths in Gl by predetermined nodes. Since these paths are disjoint, there is exactly one path p' of them containing e. Along the path p', we identify a subpath p = ni ...nj containing e such that ni and nj are predetermined and all other in-between nodes are non-predetermined. A path constraint bp of the subpath p can be obtained by Equation (2). The constraint is denoted interchangeably by bp or (ni, nj, bp, e), which is also said to be the path constraint of e. Path constraints are symmetric because (ni, nj, bp, e) = (nj, ni, bp, e).

Tree constraints

For each connected component of T (Gl), we arbitrarily pick a predetermined node ns as the seed. For the unique tree path p that connects ns and another predetermined node nk in the same connected component, a tree constraint bt of path p can be obtained by Equation (2). The constraint is denoted interchangeably by bt or (ns, nk, bt). Tree constraints are symmetric because (ns, nk, bt) = (nk, ns, bt). Note that if there exists a component that has no predetermined nodes, locus l must be heterozygous across the entire pedigree and no tree constraints will be generated.

Our algorithm in relation to the ZRHC problem

Our algorithm consists of four steps. We begin by initializing required data structures in the preprocessing step. The initialized data structures are subject to the constraint generation step to construct a system of linear constraints on h-variables. There are two issues should be addressed. First, since all constraints are derived from locus graphs that come from the same pedigree graph, there is usually redundancy in the system. Second, we actually do not need to know all h-values to solve the ZRHC problem. For a child node ni, there are two h-variables related to it and its parents. However, from Equation (1) we know that one of the two h-values is sufficient to determine <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M6">View MathML</a>. So it is easy to see that the (n - 1) h-variables in T (G) form a minimal sufficient set to solve the ZRHC problem. In the third step, constraint reduction and transformation, we therefore try to eliminate redundancy in the system and transform as many path constraints into tree constraints as possible. Finally, in the haplotype determination step, we introduce an efficient way to solve h-variables and further p-variables based on the reduced system.

Step 1: preprocessing

The data structures of our algorithm are initialized by the following procedures:

1. Transform the pedigree into a pedigree graph G = (V, E). Each node ni in V is equipped with its genotype vector <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M4">View MathML</a>. Since each individual has two parents, there are at most 2n edges in G, so we have |V| = O(n) and | E | = O(n).

2. Construct a spanning tree T (G) on G.

3. For each locus l,

(a) generate a locus graph Gl,

(b) generate a locus forest T (Gl), and

(c) identify predetermined nodes as well as their p-values, all d-constants, and all w-constants.

The operations applied in this step are graph traversal and spanning tree construction, both operations can be performed in time O(| V | + | E |) = O(n). The time complexity of this step is therefore O(mn).

Step 2: constraint generation

A system of linear equations on h-variables over GF(2) will be constructed in this step. The system consists of three sets CC, CP, and CT that contain different kinds of constraints. CC contains cycle constraints, if any, of all non-tree edges at all loci. Similarly, CP contains path constraints, if any, of all non-tree edges at all loci. Finally, CT contains tree constraints at all loci. To reduce computational complexity, repetitions of set members are forbidden in our algorithm; we do nothing if an existing member is going to be added into the same set.

There are O(mn) trials to generate a constraint for a non-tree edge since there are m locus graphs and each of which contains O(n) non-tree edges; in each trial we perform a cycle detection procedure to generate a cycle constraint or a path constraint, so we have | CC | + | CP | = O(mn). The cycle detection procedure is usually implemented by depth first graph traversal and its execution time depends on the length of the cycle. Consequently, if a non-tree edge induces a global cycle, the cycle detection procedure takes O(n) time; otherwise the procedure takes constant time because each local cycle contains only four edges. The time to generate O(mn) cycle and path constraints is O(kmn) since there are at most km trials to generate global cycle constraints. To generate tree constraints within a locus graph, we perform tree traversal on its locus forest. This procedure generates O(n) tree constraints in O(n) time. So we require O(mn) time to generate tree constraints at all loci. The time complexity to generate our constraint system is therefore O(kmn) + O(mn) = O(kmn).

Step 3: constraint reduction and transformation

Redundancy arises in the constraint system if a constraint can be represented as a linear combination of other constraints. We are especially interested in the following two types of redundancies.

Type 1 Assume there is a basic cycle c in G and it can be decomposed into two edge-disjoint paths p1 and p2 both connecting nodes ni and nj. There must be exactly a non-tree edge e in c, and without loss of generality, we assume that e belongs to path p1. If there is a cycle constraint (bc, e) of c, a path constraint (ni, nj, bp, e) of p1, and a tree constraint (ni, nj, bt) of p2, we have bc = bp + bt by Equations (2) and (3). That is, these three constraints are linearly dependent and each of them can be represented as a linear combination of the other two constraints (Figure 2(a)). A path constraint can therefore be transformed into a tree constraint by the equation bt = bp + bc, which is the basis of the reduction of our constraint system.

thumbnailFigure 2. Two types of redundancy arise from linearly dependence. (a) A cycle c is decomposed into a tree path p2 and a path p1 that contains a non-tree edge e. So we have bc = bp + bt, where bc, bp, and bt are constraints of cycle c, path p1 and path p2, respectively. The dotted line represents the non-tree edge e. (b) nl is the node closest to ni on the path p3. Assume that the constraint of tree path pi is bi, 1 ≤ i ≤ 6. We have b1 = b4 + b6, b2 = b4 + b5, and b3 = b5 + b6, which conclude that b1 = b2 + b3 due to the addition over GF(2).

Type 2 Assume there are three tree constraints (ni, nj, b1), (ni, nk, b2), and (nj, nk, b3) of paths p1, p2, and p3, respectively. By definition we know that a tree constraint is the summation of all h-variables along a unique path in T(G), so we have

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M34">View MathML</a>

Suppose that nl is the node closest to ni on the path p3. We then have three paths p4 between ni and nl, p5 between nl and nj, and p6 between nl and nk such that p1 = p4 + p5, p2 = p4 + p6, and p3 = p5 + p6. The tree constraints can therefore be rewritten as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M35">View MathML</a>

Because all constraints are defined over GF(2), we conclude that b1 + b2 = b3; the three tree constraints are linearly dependent and each of them can be represented as a linear combination of the other two constraints (Figure 2(b)). The above argument implies the following lemma.

Lemma 1 For any three nodes ni, nj, and nk, the tree constraint of the path between nj and nk is equal to the total tree constraint of the path between ni and nj and the path between ni and nk.

Lemma 1 still holds even if ni is on the path between nj and nk (ni = nl in Figure 2(b)), which means that if a tree path is partitioned into two disjoint sub-paths, the tree constraint of this path is equal to the total constraint of the two sub-paths.

In this step, we remove the type 1 redundancy by transforming as many path constraints to tree constraints as possible, and remove the type 2 redundancy by reducing CT to an equivalent set whose cardinality is at most (n - 1).

For each non-tree edge e EX, if cycle constraint (bc, e) exists, we remove all path constraints (ni, nj, bp, e), if any, from CP and add tree constraints (ni, nj, bc + bp) into CT. Since the size of CP is O(mn), this procedure can be carried out in time O(mn), and the new CT is of size O(mn).

To further remove the redundancy in CT, we construct a constraint graph G* of G. The constraint graph G* shares the same set of nodes V as G; for each tree constraint (ni, nj, bt) ∈ CT, we introduce an edge connecting nodes ni and nj in G* with weight bt (Figure 3(a)). An example of constraint graph is depicted in Figure 3(b). The constraint graph is used to reduce the size of CT. As shown in Figure 3(b), a constraint graph may not be connected. Within each connected component in G*, we randomly choose a seed ns and try to assign each node ni a variable W[ni] to represents the tree constraint of the tree path between nodes ns and ni in the pedigree graph G. The assignment is carried out by the following steps in each connected component of G*.

thumbnailFigure 3. The concept of a constraint graph. (a) A tree constraint (ni, nj, bt) of the path that connects ni and nj in a pedigree graph G will be transformed into an edge between ni and nj with weight bt in the corresponding constraint graph G*. (b) A constraint graph. There are three edges (n3, n11), (n7, n5), and (n9, n5) in the constraint graph, which means that there are three tree constraints in the linear system. Note that the constraint graph is disconnected and contains several connected components.

1. W[ns] of the seed ns is assigned the value zero,

2. start from ns, perform a breadth-first-search traversal via tree constraints, i.e., we can traverse from node ni to node nj if (ni, nj, bt) ∈ CT or (nj, ni, bt) ∈ CT,

3. as we traverse from ni to nj through (ni, nj, bt) or (nj, ni, bt), if nj is unvisited, we assign W[ni] + bt to W[nj] based on Lemma 1; otherwise we do nothing.

Since W[ni] represents the tree constraint (ns, ni, W[ni]), it can be regarded as the summation of h-variables along the unique path on T(G) from the seed ns to node ni, which implies the following lemma:

Lemma 2 The h-value of a tree edge (ni, nj) in T(G) can be obtained by <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M36">View MathML</a> if ni and nj reside in the same connected component of G*.

Therefore, if we can assign W-values to all nodes in V and make G* connected, G* would be equipotent to a reduced CT of size (n - 1) that covers h-variables of all tree edges of T(G) and is sufficient to solve the ZRHC problem. The construction of the constraint graph takes O(|CT|) = O(mn) time.

The constraint graph G*, however, may not be connected with fully assigned W-values. We therefore introduced an extension procedure to extend G* by adding extra tree constraints, if any, into G*; we would like to reduce the number of connected component in G* as much as possible. To explore more tree constraints to be added into G*, we examine those non-tree edges e EX that do not have cycle constraints in CC. The basic idea is that if we can synthesize a new cycle whose constraint is the same as the expected cycle constraint of e, we may obtain new tree constraints by transforming known path constraints of e.

For a non-tree edge e without cycle constraint, we try to synthesize a cycle only if <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M37">View MathML</a>. We do nothing if <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M38">View MathML</a> because no extra tree constraints of e can be obtained by cycle synthesis. To see this, suppose the local cycle induced by e connects a couple na and nb and their two children nc and nd; without loss of generality, we assume e = (na, nd) (Figure 4). We can examine the possible constraints derived from this local cycle. Constraints of a single edge with predetermined endpoints are not of interest and can be ignored because the p-values of the endpoints are known; we need only pay attention to constraints whose path lengths are longer than one. In the lth locus graph, if <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M39">View MathML</a>, a local cycle exists and we have cycle constraint (bc, e) (Figure 4(a)); if <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M40">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M41">View MathML</a>, we only have the path p1 = ncnand with path constraint (nc, nd, bp, e) (Figure 4(b)); if <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M42">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M43">View MathML</a>, we only have the path p2 = ncnbnd with tree constraint (nc, nd, bt) (Figure 4(c)); if, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M44">View MathML</a> all four nodes are predetermined and we can determine their p-values directly (Figure 4(d)). No useful constraints other than (bc, e), (nc, nd, bp, e), and (nc, nd, bt) can be derived from this local cycle. Here we already know that (bc, e) does not exist. If (nc, nd, bt) is already in CT, it is the only useful tree constraint of e and we are finished. If (nc, nd, bt) does not exist in CT, we cannot obtain (nc, nd, bt) by combining bc and bp because (bc, e) does not exist, even if the path constraint (nc, nd, bp, e) is available. If this case holds for all 1 ≤ l m, our linear system actually provides no information to obtain the tree constraint of p2; the h-variable of each edge on p2 will eventually be assigned a free variable, or its value will depend on other free variables. Therefore we do nothing if <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M45">View MathML</a>.

thumbnailFigure 4. All possible appearances of a local cycle in a locus graph. The dotted line represents the non-tree edge e. (a) The local cycle appears with cycle constraint (bc, e). (b) There is only one path containing e with path constraint (nc, nd, bp, e). (c) There is only one path with tree constraint (nc, nd, bt). (d) There are only four predetermined nodes without any constraint.

Assume that ES is the set of non-tree edges in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M2">View MathML</a> without cycle constraint. Cycle synthesis is carried out by concatenating paths with known path constraints or tree constraints. The extension procedure is applied to ES as follows.

E1. For each e ES, we check if there is an odd number, say 2t + 1, of path constraints of e that link different connected components in G* to form a synthetic cycle (Figure 5(a)); a constraint is said to link two components A and B if one of its endpoints resides in A and the other resides in B. There is a special case whereby we can also obtain a synthetic cycle if two endpoints of a single path constraint reside in the same connected component (t = 0). If no such 2t + 1 path constraints are found, we cannot synthesize a cycle of e and do nothing; otherwise we perform the following tasks:

thumbnailFigure 5. The concept of a synthetic cycle. (a) Five path constraints b1, b2, ..., b5 link five connected components A, B, C, D, and E to form a synthetic cycle in a constraint graph. (b) The conceptual view of the synthetic cycle of (a) in a pedigree graph. The synthetic cycle is actually a round trip through the tree edges and the non-tree edge e. The trip is composed of 10 different but connected paths in the pedigree graph. In this example, e would be visited five times during the trip.

E1.1 assign the constraint <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M46','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M46">View MathML</a> to the synthetic cycle, where

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M47">View MathML</a>

(4)

in which Se is the set of the chosen 2t + 1 path constraints;

E1.2 for each path constraint <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M48','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M48">View MathML</a> in CP, generate a tree constraint <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M49','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M49">View MathML</a> and add the new constraint into G*;

E1.3 update G*;

E1.4 remove e from ES;

E2. If ES becomes empty (there has been a synthetic cycle for every e in the original ES), or no synthetic cycle is synthesized (ES stays unchanged), we stop the extension procedure; otherwise we go back to E1 to start the next iteration.

We thus try to synthesize a cycle for each non-tree edge in ES to generate new tree constraints and update G*. To update G*, if more than one connected component is combined into a new one by new tree constraints, we arbitrarily choose one of the old seeds from these connected components as a new seed, and perform a graph traversal to update W-values within the new connected component. A non-tree edge that fails to receive a synthetic cycle in a trial of cycle synthesis may benefit from a later updated G* and therefore our extension procedure is designed to operate in an iterative fashion; the procedure terminates only if G*cannot be updated anymore. In this procedure, a non-tree edge may be checked many times (in different iterations) to form a synthetic cycle. In the worst case scenario, only one cycle is synthesized in each iteration, so we require k iterations to perform k + (k - 1) + ... + 1 = O(k2) trials of cycle synthesis.

To verify the correctness of the extension procedure, we need first to explain the meaning of Equation (4). Follow a similar argument to that of Lemma 1, for two nodes nx and ny that reside in the same connected component of G*, we know that W[nx] + W[ny] is actually the tree constraint of the path from nx to ny on T(G). The synthetic cycle is conceptually a round trip through tree edges and the non-tree edge e. The value <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M50">View MathML</a> in Equation (4) is therefore the summation of h-variables along the round trip (Figure 5(b)). Now we demonstrate that <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M50">View MathML</a> is the same as the cycle constraint of e. We first show that there is exactly one h-variable of e in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M50">View MathML</a>. According to Equation (4), we have 2t + 1 h-variables of e in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M50">View MathML</a>. Since we perform additions over GF(2), 2t out of the 2t + 1 h-variables will be cancelled and we finally have only one h-variable of e in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M50">View MathML</a>. To verify if <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M50">View MathML</a>is the same as the cycle constraint of e in G, we assume that the expected cycle constraint of e is bc. We generate a set <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M51">View MathML</a> by converting path constraints (ni, nj, bp, e) in Se to tree constraints (ni, nj, bc + bp). It is easy to see that the converted 2t + 1 tree constraints also link connected components in G* to form a new synthetic cycle, and the corresponding round trip only contains tree edges in T(G). T(G) has no cycle and therefore each edge of this new round trip must be visited an even number of times, which means that its h-variable will be cancelled in the new cycle constraint. So the constraint of the new synthetic cycle must be zero and we have the following equations:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M52">View MathML</a>

Since there are 2t + 1 constraints in <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M51">View MathML</a>, we have <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M53','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M53">View MathML</a>. We then obtain <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M54">View MathML</a> and conclude that <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M55','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M55">View MathML</a>.

For each e ES, the time to determine if there are odd number of path constraints that link connected components in G* to form a cycle is O(m). This time complexity can be achieved by regarding each connected component as a single node and each path constraint of e as a single edge, and following O(m) edges to perform a depth-first traversal. Since there are O(k2) cycle syntheses throughout the extension procedure, we require O(k2m) time to find synthetic cycles. Once we synthesized a cycle for e, we require O(m) time to convert path constraints to tree constraints because there are at most m path constraints of e in CP. There are O(k) non-tree edges in ES and therefore the extension procedure takes O(km) time to perform constraint conversion. To update G*, we require O(n) time to perform breadth-first traversal on every connected component to modify W-values similar to the way we initialize G*. There are at most k synthetic cycles and therefore G* is updated O(k) times in O(kn) time. In summary, the Step 3, constraint reduction and transformation, takes O(k2m) + O(km) + O(kn) = O(k2m + kn) time.

Step 4: haplotype determination

To solve the h-values of the tree edges of T(G) by Lemma 2, we try to make G* produced by Step 3 connected. Firstly, we pay attention to the founders in the pedigree. Founders cannot be predetermined endpoints of paths with either path constraints or tree constraints and therefore founders must be isolated nodes in G*. It is also impossible to know whether an allele of a founder is paternal or maternal. We attach a founder nf to G* by assuming that it passed its paternal haplotype to an arbitrary child nc. The attachment can be done by assigning weight zero to the edge (nf, nc) of G*, which implies <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M56">View MathML</a> (nf passes its paternal haplotype). There are O(n) edges in G* and therefore the attachment of founders to G* takes O(n) time.

Secondly, we check if there is any non-tree edge that can link any two connected components of G*. A non-tree edge e = (ni, nj) can link two connected components A and B if we can find a path constraint (nk, nl, bp, e) of path p that, without loss of generality, satisfies the following two conditions:

1. nk and ni reside in A and have available W [nk] and W [ni] derived from the seed nA of A,

2. nl and nj reside in B and have available W [nl] and W [nj] derived from the seed nB of B.

If we can find such a non-tree edge e, we can decompose p into three parts: a sub-path from nk to ni, the non-tree edge e, and the sub-path from nj to nl. The constraints of these three parts are <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M57','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M57">View MathML</a>, respectively. This turns out that <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M58','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M58">View MathML</a>. The non-tree edge e therefore can be used to link components A and B with known h-value <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M59','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S19/mathml/M59">View MathML</a>. Since there are at most O(mn) path constraints to be checked, this procedure requires O(mn) time.

Finally, assume that there remain t connected components of G*. We arbitrarily introduce (t -1) edges into G* to make it connected. Our algorithm does not impose any constraint on these (t - 1) edges and therefore the weights of these edges can be safely set as free variables. We then update all W-values within the new connected G* (new W-values may contain free variables), and apply Lemma 2 to determine the h-values of all edges in T(G). With these solved h-values as well as the w-constants and d-constants, we can determine the p-values of all nodes in the locus graphs by Equation (1).

The update of G* takes O(n) time. Moreover, we require O(n) time to determine all h-values of edges in T(G). For a locus graph, the determined h-values are used to solve all p-values in O(n) time. Since there are m locus graphs, we require O(mn) time to determine the p-vectors of all nodes in G. Consequently, the three procedures of this step take O(n) + O(mn) + O(n) + O(mn) = O(mn) time.

Results and discussion

An execution example

We use the pedigree given in Figure 6(a) as an example to demonstrate how the proposed algorithm works. There are 19 individuals in the pedigree; eight of them are founders. Each individual is equipped with genotype data collected from four marker loci. There is a mating loop in the pedigree.

thumbnailFigure 6. An execution example. (a) A pedigree of 19 individuals with genotype data. (b) The corresponding pedigree graph G with a spanning tree T(G). Tree edges are solid lines and non-tree edges are dotted lines. (c) Locus graphs and forests. Nodes with thick borders are predetermined. (d) The corresponding constraint graph G*. Nodes and solid lines compose the initial constraint graph. The three dotted lines are path constraints that form a synthetic cycle of the non-tree edge B-F. (e) The final G*. All edges except B-F have weight zero; hB, F is a free variable.

In the first step, we transform the input pedigree into a pedigree graph G and construct a spanning tree T(G) on G (Figure 6(b)). There are three local cycles A-H-B-I-A, D-K-E-L-D, and P-R-Q-S-P and one global cycle B-F-C-G-D-L-O-Q-N-I-B in G. Edges A-H, E-L, Q-R, and B-F are chosen as non-tree edges within the four cycles. From the pedigree graph G, we construct the four locus graphs and forests that are depicted in Figure 6(c). The p-values of predetermined nodes, w-constants of all nodes, and d-constants of all edges within the four locus graphs are also identified.

In the second step, we generate all cycle, path, and tree constraints for each of the four locus graphs using Equations (2) and (3). For example, cycle A-H-B-I-A in the second locus graph has cycle constraint hA, H + hH, B +hB, I +hI, A = dA, H [2] + dH, B [2] +dB, I [2] +dI, A [2] = 0 + 1 + 1 + 0 = 0, and path G-C-F-B-I-N-Q in the third locus graph has path constraint of the non-tree edge B-F hG, C + hC, F + hF, B + hB, I + hI, N + hN, Q = pG[3] + dG, C[3] + dC, F [3] + dF, B[3] + dB, I [3] + dI, N [3] + dN, Q[3] + pQ[3] = 0 + 0 + 0 + 1 + 1 + 0 + 0 + 0 = 0.

At the end of this step we receive CC = {(0, eE-L), (0, eQ-R), (0, eA-H)}, CP = {(I, H, 0, eA-H), (N, G, 0, eB-F), (G, Q, 0, eB-F), (R, Q, 0, eR-Q), (N, Q, 0, eB-F)}, and CT = {(I, N, 0), (G, K, 0), (G, L, 0), (G, O, 0), (Q, S, 0)}.

In the third step, we obtain two new tree constraints (R, Q, 0) and (I, H, 0) by (0, eQ-R) + (R, Q, 0, eR-Q) and (0, eA-H) + (I, H, 0, eA-H), respectively. The set CT is therefore extended to {(I, N, 0), (G, K, 0), (G, L, 0), (G, O, 0), (Q, S, 0), (R, Q, 0), (I, H, 0)}. We construct the initial constraint graph G* based on the updated CT (Figure 6(d)). In the initial G*, we choose H, G, and Q as component seeds to determine W-values. We can further find that the three path constraints (N, G, 0, eB-F),(G, Q, 0, eB-F), and (N, Q, 0, eB-F) link three connected components to form a synthetic cycle of the non-tree edge B-F with constraint zero. So we further obtain three extra tree constraints (N, G, 0), (G, Q, 0), and (N, Q, 0) derived from the synthetic cycle and add them to G*.

In the final step, we try to make G* connected to solve all h-values. We first arbitrarily introduce eight edges A-H, B-I, C-G, D-G, E-L, J-N, M-O, and P-R to attach the eight founders to G*; all the eight edges are of weight zero to imply that founders passes their paternal haplotypes to one of their children. Now there are only two connected components in G*, one of which is an isolated node, F. we attach F to G* by set hB, F as a free variable. This final connected G* is depicted in Figure 6(e). After the final update of G*, all h-values other than hB, F are zero, and hB, F is free to be either zero or one. Given these known h-values, all p-values over the four locus graphs can be solved by Equation (1).

Time complexity and experimental result

According to the analyses at the end of each step in Section 3, the time complexity of our algorithm is O(mn) (step 1: preprocessing) + O(kmn) (step 2: constraint generation) + O(k2m + kn) (step 3: constraint reduction and transformation) + O(mn) (step 4: haplotype determination) = O(kmn + k2m). Because k is regarded as a constant, our algorithm is linear.

To verify the efficiency and the correctness of our algorithm, we conducted some experiments using the proposed method. Our algorithm was implemented in C and was evaluated on a desktop computer equipped with Intel Core i7-2600 3.4 GHz CPU and 8 GB of RAM. The desktop ran Ubuntu Release 11.10 operating system with Linux kernel 3.0.0-16-generic and GNOME 3.2.1 graphical user interface.

In the experiments, we generated test cases by setting different number of individuals (n) and markers (m). We applied the algorithm developed by Thomas et al. [24] to generate 12 tree pedigrees with different n values ranging from 30 to 400. To observe how the number of mating loops (k) affects our algorithm, each tree pedigree was preprocessed to produce four variants with zero, two, four, and six mating loops. For each pedigree, we examined 10 different m values ranging from 10 to 300. Each (n, m) combination was tested 100 times. Each time we generated new genotypes and randomly selected one pedigree from the four variants of the given n. The haplotype configurations of all the 12000 trials were correctly identified. The experimental results are listed in Table 2.

Table 2. Experimental results

Table 2(a) shows that unknown p-variables were correctly solved without assigning any free variable if the number of marker loci was not less than 30, which covers most practical cases in regular genotyping. Free variables were required only when the number of marker loci was far less than the number of individuals. In this experiment, free variables were used only when m = 10, and they were used at most five times out of 100 trials. The result is reasonable because the dimension of the solution space of a pedigree with a limited number of marker loci is probable less than the number of unknown p-variables.

Table 2(b) shows the cumulative execution time of 100 trials of each (n, m) combination. We received a fluctuation in execution time if n or m were less than 100. We conjecture that, because the algorithm executes very fast for small values of n or m, the cumulative execution time might be dramatically affected by the context switches within the operating system that ran many background services. Furthermore, we believe that when both n and m were larger than 100, the execution time of the algorithm became more significant than that of the context switches. From the table it is apparent that the execution time is linear for the larger n and m values.

Finally, Table 2(c) shows that mating loops existed evenly throughout all 12000 trials, with the number ranging from zero to six per pedigree, and the number did not affect the linearity of the execution time of our algorithm in relation to the input size of n and m.

Issue of spanning tree and seed node selection

In the first step, preprocessing, a spanning tree T(G) is constructed on the pedigree graph G. As mentioned above, T(G) is constructed for the ease of cycle processing; it is merely an auxiliary data structure used to generate linear constraints of all cycles and paths between predetermined nodes in G. We do not impose any constraint on the construction of T(G) because predetermined nodes are defined by genotype data. Once the input pedigree is given, all the cycles and paths as well as their constraints are bound, no matter which spanning tree is constructed on the pedigree graph. Different spanning trees assign different edges as the non-tree edge in a cycle, and only affect the type of a constraint; a constraint may be a path constraint with respective to one spanning tree and a tree constraint with respect to another spanning tree. Since different spanning trees are used to generate the same set of constraints, without considering their type, the construction of the spanning tree can be arbitrary. In our implementation, T(G) was constructed by depth-first traversal.

In the second step, constraint generation, a seed node is arbitrarily selected from T(G) to generate tree constraints. To see why the seed node can be selected arbitrarily, assume that there are two possible seeds ni and nj. For any other predetermined node nk, we have (nj, nk, bjk) = (ni, nj, bjk) + (ni, nk, bik) by Lemma 1, which means that a tree constraint seeded with one predetermined node is a linear combination of two tree constraints seeded with another predetermined node. Hence, tree constraints seeded with different predetermined nodes are mathematical equivalent; we can safely choose any predetermined node as seed. Similarily, the seed nodes within a constraint graph can also be selected arbitrarily based on the above argument.

Consistency checking

Although we assume that the input pedigree is free of genotyping errors, our algorithm can be easily modified to detect inconsistencies within the genotype data without loss of efficiency. No recombination is allowed in the input pedigree and therefore inconsistencies will arise if there are different assignments of an h-value, that results in incompatible linear constraints. We may designate the following two checkpoints to detect inconsistencies within our linear system:

1. The generation of constraints. The constraint of a path or a cycle may be computed more than one time across all locus graphs; all these computations should arrive at the same value. So each time we compute a constraint, we check if it is the same as the current value, if any.

2. The initialization/update of G*. There may be loops in the constraint graph G* and therefore it is possible that there are more than one path from the seed ns to a node ni. It turns out that W [ni] may be assigned more than once in the initialization or update procedures of G*. By the definition of W-variables, however, all the assignments to W[ni] are actually associated with the same path from ns to ni on T (G) and therefore should be identical. So each time we compute a W-value, we check if it agrees with the current value, if any.

Conclusions

In this study, we proposed and implemented an algorithm to solve the zero-recombinant haplotype configuration (ZRHC) problem for a general pedigree in O(kmn + k2m) time. With the aid of free variables, our method provides a general solution to describe possible haplotype structures within a pedigree rather than a particular solution that only assigns a specific numerical setting to haplotypes. To the best of our knowledge, this algorithm is the first deterministic one to provide a general solution in linear time for pedigrees having small number of mating loops. Moreover, the algorithm can be easily modified to detect inconsistency among genotype data without loss of efficiency. Our experimental results confirm its linearity. In the future, we will try to extend the proposed algorithm to handle recombination and missing data in linear time for general pedigrees.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

EYL, WBW, TJ, and KPW contributed to the algorithm design. EYL implemented the algorithms and performed the experiments. KPW and EYL analyzed the complexity of the algorithm and wrote the paper. All authors read and approved the final manuscript.

Acknowledgements

This work was supported in part by the National Science Council, Taiwan under NSC99-2320-B-010-022-MY2.

This article has been published as part of BMC Bioinformatics Volume 13 Supplement 17, 2012: Eleventh International Conference on Bioinformatics (InCoB2012): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S17.

References

  1. Qian D, Beckmann L: Minimum-recombinant haplotyping in pedigrees.

    The American Journal of Human Genetics 2002, 70(6):1434-1445. Publisher Full Text OpenURL

  2. Albers CA, Heskes T, Kappen HJ: Haplotype inference in general pedigrees using the cluster variation method.

    Genetics 2007, 177(2):1101-1116. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  3. Chin FYL, Zhang Q, Shen H: k-recombination haplotype inference in pedigrees. In Proceedings of the International Conference on Computational Science (ICCS). Springer-Verlag, Berlin; 2005:985-993. OpenURL

  4. Li J, Jiang T: Efficient rule-based haplotyping algorithms for pedigree data. In Proceedings of the 7th Annual Conference on Research in Computational Molecular Biology (RECOMB). ACM, New York; 2003:197-206. OpenURL

  5. Li J, Jiang T: An exact solution for finding minimum recombinant haplotype configurations on pedigrees with missing data by integer linear programming. In Proceedings of the 8th Annual Conference on Research in Computational Molecular Biology (RECOMB). ACM, New York; 2004:20-29. OpenURL

  6. Li J, Jiang T: Computing the minimum recombinant haplotype configuration from incomplete genotype data on a pedigree by integer linear programming.

    Journal of Computational Biology 2005, 12(6):719-739. PubMed Abstract | Publisher Full Text OpenURL

  7. Sobel E, Lange K, O'Connell JR, Weeks DE: Haplotyping algorithms. In Genetic Mapping and DNA Sequencing, Volume 81 of IMA Volumes in Mathematics and its Applications. Edited by Speed T, Waterman MS. Springer-Verlag; 1996:89-110. OpenURL

  8. Tapadar P, Ghosh S, Majumder PP: Haplotyping in pedigrees via a genetic algorithm.

    Human Heredity 2000, 50:43-56. PubMed Abstract | Publisher Full Text OpenURL

  9. O'Connell JR: Zero-recombinant haplotyping: Applications to fine mapping using SNPs.

    Genetic Epidemiology 2000, 19:64-70. PubMed Abstract | Publisher Full Text OpenURL

  10. Chan MY, Chan WT, Chin FYL, Fung SPY, Kao MY: Linear-time haplotype inference on pedigrees without recombinations and mating loops.

    SIAM J Comput 2009, 38(6):2179-2197. Publisher Full Text OpenURL

  11. Li X, Li J: Efficient haplotype inference from pedigrees with missing data using linear systems with disjoint-set data structure.

    7th Annual International conference on Computational Systems Bioinformatics 2008, 297-308. OpenURL

  12. Liu L, Jiang T: A linear-time algorithm for reconstructing zero-recombinant haplotype configuration on pedigrees without mating loops.

    Journal of Combinatorial Optimization 2008, 19:217-240. OpenURL

  13. Baruch E, Weller JI, Cohen-Zinder M, Ron M, Seroussi E: Efficient inference of haplotypes from genotypes on a large animal pedigree.

    Genetics 2006, 172(3):1757-1765. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. Doan DD, Evans PA, Horton JD: A near-linear time algorithm for haplotype determination on general pedigrees. [http:/ / www.biomedsearch.com/ nih/ near-linear-time-algorithm-haplotyp e/ 20937017.html] webcite

    J Comput Biol 2010, 17(10):1451-65. PubMed Abstract | Publisher Full Text OpenURL

  15. Wang WB, Jiang T: Efficient inference of haplotypes from genotypes on a pedigree with mutations and missing alleles. In CPM 2009, LNCS 5577. Edited by Kucherov G, Ukkonen E. Springer-Verlag Berlin Heidelberg; 2009:353-367. OpenURL

  16. Xiao J, Liu L, Xia L, Jiang T: Efficient algorithms for reconstructing zero-recombinant haplotypes on a pedigree based on fast elimination of redundant linear equations.

    SIAM J Comput 2009, 38:2198-2219. Publisher Full Text OpenURL

  17. Zhang K, Sun F, Zhao H: HAPLORE: a program for haplotype reconstruction in general pedigrees without recombination.

    Bioinformatics 2005, 21:90-103. PubMed Abstract | Publisher Full Text OpenURL

  18. Gusfield D: Inference of haplotypes from samples of diploid populations: complexity and algorithms.

    Journal of Computational Biology 2001, 8(3):305-323. PubMed Abstract | Publisher Full Text OpenURL

  19. Niu T, Qin ZS, Xu X, Liu JS: Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms.

    The American Journal of Human Genetics 2002, 70:157-169. Publisher Full Text OpenURL

  20. Stephens M, Smith NJ, Donnelly P: A new statistical method for haplotype reconstruction from population data.

    The American Journal of Human Genetics 2001, 68(4):978-989. Publisher Full Text OpenURL

  21. Wang S, Kidd KK, Zhao H: On the use of DNA pooling to estimate haplotype frequencies.

    Genetic Epidemiology 2003, 24:74-82. PubMed Abstract | Publisher Full Text OpenURL

  22. Yang Y, Zhang J, Hoh J, Matsuda F, Xu P, Lathrop M, Ott J: Efficiency of single-nucleotide polymorphism haplotype estimation from pooled DNA.

    Proceedings of the National Academy of Science of the United States of America 2002, 100:7225-7230. OpenURL

  23. Browning SR, Browning BL: Haplotype phasing: existing methods and new developments. [http://www.nature.com/nrg/journal/v12/n10/full/nrg3054.html] webcite

    Nature Reviews Genetics 2011, 12:703-714. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  24. Thomas A, Cannings C: Simulating realistic zero loop pedigrees using a bipartite Prufer code and graphical modelling. [http:/ / www.biomedsearch.com/ nih/ Simulating-realistic-zero-loop-pedi grees/ 15567888.html] webcite

    Math Med Biol 2004, 21(4):335-45. PubMed Abstract | Publisher Full Text OpenURL