Institute of Biomedical Informatics, National Yang Ming University, Taipei 112, Taiwan

Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei 115, Taiwan

Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA

Abstract

Background

When studying genetic diseases in which genetic variations are passed on to offspring, the ability to distinguish between paternal and maternal alleles is essential. Determining haplotypes from genotype data is called haplotype inference. Most existing computational algorithms for haplotype inference have been designed to use genotype data collected from individuals in the form of a pedigree. A haplotype is regarded as a hereditary unit and therefore input pedigrees are preferred that are free of mutational events and have a minimum number of genetic recombinational events. These ideas motivated the zero-recombinant haplotype configuration (ZRHC) problem, which strictly follows the Mendelian law of inheritance, namely that one haplotype of each child is inherited from the father and the other haplotype is inherited from the mother, both without any mutation. So far no linear-time algorithm for ZRHC has been proposed for general pedigrees, even though the number of mating loops in a human pedigree is usually very small and can be regarded as constant.

Results

Given a pedigree with ^{2}

Conclusions

We have developed the first deterministic linear-time algorithm for the zero-recombinant haplotype configuration problem. Our experimental results demonstrated the linearity of its execution time in relation to the input size. The proposed algorithm can be modified to detect inconsistency within the genotype data without loss of efficiency and is expected to be able to handle recombinant and missing data with further extension.

Background

A

The existing computational algorithms for haplotype inference can be classified into statistical and combinatorial and most of which were designed for genotype data collected from individuals in the form of a

A pedigree of 11 members

**A pedigree of 11 members**. (a) A pedigree of 11 members coupled with genotype data. The paternal haplotype of an individual is listed left while its maternal haplotype is listed right, even though the haplotype information is not available from genotyping. For example, the paternal and maternal haplotypes of individual _{6 }are 0100 and 1110, respectively; the genotype of _{6}, however, is specified as {0, 1}{1, 1}{0, 1}{0, 0}. Circles represent females and boxes represent males. Children are listed below their parents with line connections. For example, the couple _{7 }and _{8 }have two children _{9 }and _{10}. There is a mating loop in the pedigree due to the common ancestor _{2 }of the couple _{5 }and _{9}. (b) A pedigree graph with a spanning tree. Tree edges are solid lines and non-tree edges are dotted lines. The genotype data are represented as vectors of _{7 }and _{8 }and their children _{9 }and _{10}. There is a global cycle of length 6 due to the mating loop. (c) There are four locus graphs for the different loci. Edges in locus forests are depicted as solid lines. Nodes with thick borders are predetermined.

In this study, we have targeted the ZRHC problem for pedigree data. If we assume we are given a pedigree with ^{3}^{3}) time algorithm by converting the inheritance process into an equivalent linear system of ^{2 }+ ^{3 }log^{2 }^{2 }+ ^{3 }log^{2 }^{3})

In this paper, we presented an ^{2}

Methods

To apply computational techniques, we transformed the input pedigree into a ^{X}

In the rest of this paper, we are assuming that _{i }_{i }

Genotype data are available, thus all _{i}_{i }

The vector _{i }

We formulated the ZRHC problem as follows.

**ZRHC **_{i }

The haplotype configuration of the input pedigree is identified by specifying the paternal haplotype of each family member.

A system of linear equations over GF(2)

In this section, we introduce a system of linear equations based on

Addition (+) and multiplication (

+

0

1

0

1

0

0

1

0

0

0

1

1

0

1

0

1

The building block of the system: inheritance

"Inheritance" is the building block of the system. What parents pass to their children must be the same as what children receive from their parents. For a parent _{i }_{i }_{i }_{i }_{i }_{i }

The variable _{i }_{j}

Therefore, _{i }_{j}

On the other hand, assume that _{j }_{i}_{i }_{j}_{i }_{j }_{j}_{i }_{j}_{j }_{i }_{j }_{j}_{j }_{i }_{j }_{j }_{j}_{j }

the inheritance relationship can be unified into the following equation:

Note that the _{i}, n_{j}_{i }_{j}_{i}, n_{j}_{j }

Linear constraints on h-variables

To reduce the computational complexity of our algorithm, we try to make the number of unknowns in the coming linear system as small as possible. In the pedigree graph _{i }_{i }_{i}, n_{j}_{i }_{i}, n_{j}_{l }_{l}_{l }_{i}, n_{j}_{i }_{j}_{l }_{l}

We define constraints on _{0}, _{1}, ..., _{i }_{l}_{0 }and _{i }

Since _{0 }and _{i }_{0}, _{1}, ..., _{i}, n_{0 }in _{l}

Again, since all _{l}

Cycle and Path constraints

Adding a non-tree edge _{l }_{l}

**Case 1 **_{l}_{c }_{c }_{c}, e

**Case 2 **_{l }_{i }_{j }_{i }_{j }_{p }_{p }_{i}, n_{j}, b_{p}, e_{i}, n_{j}, b_{p}, e_{j}, n_{i}, b_{p}, e

Tree constraints

For each connected component of _{l}_{s }_{s }_{k }_{t }_{t }_{s}, n_{k}, b_{t}_{s}, n_{k}, b_{t}_{k}, n_{s}, b_{t}

Our algorithm in relation to the ZRHC problem

Our algorithm consists of four steps. We begin by initializing required data structures in the _{i}

Step 1: preprocessing

The data structures of our algorithm are initialized by the following procedures:

1. Transform the pedigree into a pedigree graph _{i }

2. Construct a spanning tree

3. For each locus

(a) generate a locus graph _{l}

(b) generate a locus forest _{l}

(c) identify predetermined nodes as well as their

The operations applied in this step are graph traversal and spanning tree construction, both operations can be performed in time

Step 2: constraint generation

A system of linear equations on ^{C}, C^{P}^{T }^{C }^{P }^{T }

There are ^{C }^{P }

Step 3: constraint reduction and transformation

Redundancy arises in the constraint system if a constraint can be represented as a linear combination of other constraints. We are especially interested in the following two types of redundancies.

**Type 1 **Assume there is a basic cycle _{1 }and _{2 }both connecting nodes _{i }_{j}_{1}. If there is a cycle constraint (_{c}, e_{i}, n_{j}, b_{p}, e_{1}, and a tree constraint (_{i}, n_{j}, b_{t}_{2}, we have _{c }_{p }_{t }_{t }_{p }_{c}

Two types of redundancy arise from linearly dependence

**Two types of redundancy arise from linearly dependence**. (a) A cycle _{2 }and a path _{1 }that contains a non-tree edge _{c }_{p }_{t}_{c}, b_{p}_{t }_{1 }and path _{2}, respectively. The dotted line represents the non-tree edge _{l }_{i }_{3}. Assume that the constraint of tree path _{i }_{i}_{1 }= _{4 }+ _{6}, _{2 }= _{4 }+ _{5}, and _{3 }= _{5 }+ _{6}, which conclude that _{1 }= _{2 }+ _{3 }due to the addition over GF(2).

**Type 2 **Assume there are three tree constraints (_{i}, n_{j}, b_{1}), (_{i}, n_{k}, b_{2}), and (_{j}, n_{k}, b_{3}) of paths _{1}, _{2}, and _{3}, respectively. By definition we know that a tree constraint is the summation of all

Suppose that _{l }_{i }_{3}. We then have three paths _{4 }between _{i }_{l}, p_{5 }between _{l }_{j}_{6 }between _{l }_{k }_{1 }= _{4 }+ _{5}, _{2 }= _{4 }+ _{6}, and _{3 }= _{5 }+ _{6}. The tree constraints can therefore be rewritten as

Because all constraints are defined over GF(2), we conclude that _{1 }+ _{2 }= _{3}; the three tree constraints are linearly dependent and each of them can be represented as a linear combination of the other two constraints (Figure

**Lemma 1 **_{i}, n_{j}, and n_{k}, the tree constraint of the path between n_{j }and n_{k }is equal to the total tree constraint of the path between n_{i }and n_{j }and the path between n_{i }and n_{k}

Lemma 1 still holds even if _{i }_{j }_{k }_{i }_{l }

In this step, we remove the type 1 redundancy by transforming as many path constraints to tree constraints as possible, and remove the type 2 redundancy by reducing ^{T }

For each non-tree edge ^{X}_{c}, e_{i}, n_{j}, b_{p}, e^{P }_{i}, n_{j}, b_{c }_{p}^{T}^{P }^{T }

To further remove the redundancy in ^{T}_{i}, n_{j}, b_{t}^{T}_{i }_{j }_{t }^{T}_{s }_{i }_{i}_{s }_{i }

The concept of a constraint graph

**The concept of a constraint graph**. (a) A tree constraint (_{i}, n_{j}, b_{t}_{i }_{j }_{i }_{j }_{t }_{3}, _{11}), (_{7}, _{5}), and (_{9}, _{5}) in the constraint graph, which means that there are three tree constraints in the linear system. Note that the constraint graph is disconnected and contains several connected components.

1. _{s}_{s }

2. start from _{s}_{i }_{j }_{i}, n_{j}, b_{t}^{T }_{j}, n_{i}, b_{t}^{T}

3. as we traverse from _{i }_{j }_{i}, n_{j}, b_{t}_{j}, n_{i}, b_{t}_{j }_{i}_{t }_{j}

Since _{i}_{s}, n_{i}, W_{i}_{s }_{i}

**Lemma 2 **_{i}, n_{j}) in T(G) can be obtained by _{i }and n_{j }reside in the same connected component of G

Therefore, if we can assign ^{T }^{T}

The constraint graph ^{X }^{C}

For a non-tree edge _{a }_{b }_{c }_{d}_{a}, n_{d}_{c}, e_{1 }= _{c}n_{a}n_{d }_{c}, n_{d}, b_{p}, e_{2 }= _{c}n_{b}n_{d }_{c}, n_{d}, b_{t}_{c}, e_{c}, n_{d}, b_{p}, e_{c}, n_{d}, b_{t}_{c}, e_{c}, n_{d}, b_{t}^{T}_{c}, n_{d}, b_{t}^{T}_{c}, n_{d}, b_{t}_{c }_{p }_{c}, e_{c}, n_{d}, b_{p}, e_{2}; the _{2 }will eventually be assigned a free variable, or its value will depend on other free variables. Therefore we do nothing if

All possible appearances of a local cycle in a locus graph

**All possible appearances of a local cycle in a locus graph**. The dotted line represents the non-tree edge _{c}, e_{c}, n_{d}, b_{p}, e_{c}, n_{d}, b_{t}

Assume that ^{S }^{S }

E1. For each ^{S}

The concept of a synthetic cycle

**The concept of a synthetic cycle**. (a) Five path constraints _{1}, _{2}, ..., _{5 }link five connected components A, B, C, D, and E to form a synthetic cycle in a constraint graph. (b) The conceptual view of the synthetic cycle of (a) in a pedigree graph. The synthetic cycle is actually a round trip through the tree edges and the non-tree edge

E1.1 assign the constraint

in which _{e }

E1.2 for each path constraint ^{P}

E1.3 update

E1.4 remove ^{S}

E2. If ^{S }^{S}^{S }

We thus try to synthesize a cycle for each non-tree edge in ^{S }^{2}) trials of cycle synthesis.

To verify the correctness of the extension procedure, we need first to explain the meaning of Equation (4). Follow a similar argument to that of Lemma 1, for two nodes _{x }_{y }_{x}_{y}_{x }_{y }_{c}_{i}, n_{j}, b_{p}, e_{e }_{i}, n_{j}, b_{c }_{p}

Since there are 2

For each ^{S}^{2}) cycle syntheses throughout the extension procedure, we require ^{2}^{P}^{S }^{2}^{2}

Step 4: haplotype determination

To solve the _{f }_{c}_{f}, n_{c}_{f }

Secondly, we check if there is any non-tree edge that can link any two connected components of _{i}, n_{j}_{k}, n_{l}, b_{p}, e

1. _{k }_{i }_{k}_{i}_{A }

2. _{l }_{j }_{l}_{j}_{B }

If we can find such a non-tree edge _{k }_{i}_{j }_{l}

Finally, assume that there remain

The update of

Results and discussion

An execution example

We use the pedigree given in Figure

An execution example

**An execution example**. (a) A pedigree of 19 individuals with genotype data. (b) The corresponding pedigree graph _{B, F }is a free variable.

In the first step, we transform the input pedigree into a pedigree graph

In the second step, we generate all cycle, path, and tree constraints for each of the four locus graphs using Equations (2) and (3). For example, cycle A-H-B-I-A in the second locus graph has cycle constraint _{A, H }_{H, B }_{B, I }_{I, A }_{A, H }_{H, B }_{B, I }_{I, A }[2] = 0 + 1 + 1 + 0 = 0, and path G-C-F-B-I-N-Q in the third locus graph has path constraint of the non-tree edge B-F _{G, C }_{C, F }_{F, B }_{B, I }_{I, N }_{N, Q }_{G}_{G, C}_{C, F }_{F, B}_{B, I }_{I, N }_{N, Q}_{Q}

At the end of this step we receive ^{C }_{E-L}_{Q-R}_{A-H}^{P }_{A-H}_{B-F)}_{B-F)}_{R-Q}_{B-F)}_{T }

In the third step, we obtain two new tree constraints (_{Q-R}_{R-Q}_{A-H}_{A-H}^{T }^{T }_{B-F)},(_{B-F)}, and (_{B-F) }link three connected components to form a synthetic cycle of the non-tree edge B-F with constraint zero. So we further obtain three extra tree constraints (

In the final step, we try to make _{B, F }_{B, F }are zero, and _{B, F }

Time complexity and experimental result

According to the analyses at the end of each step in Section 3, the time complexity of our algorithm is ^{2}^{2}

To verify the efficiency and the correctness of our algorithm, we conducted some experiments using the proposed method. Our algorithm was implemented in C and was evaluated on a desktop computer equipped with Intel Core i7-2600 3.4 GHz CPU and 8 GB of RAM. The desktop ran Ubuntu Release 11.10 operating system with Linux kernel 3.0.0-16-generic and GNOME 3.2.1 graphical user interface.

In the experiments, we generated test cases by setting different number of individuals (

Experimental results

**Number of individuals ( n)**

**Number of loci ( m)**

**30**

**60**

**100**

**130**

**160**

**200**

**230**

**260**

**300**

**330**

**360**

**400**

10

0.02

0.00

0.00

0.02

0.05

0.01

0.02

0.01

0.04

0.03

0.02

0.02

30

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

60

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

100

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

130

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

(a)

160

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

200

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

230

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

260

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

300

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

10

0.10

0.40

0.55

0.50

0.51

0.61

0.99

0.52

1.41

1.77

1.28

0.41

30

0.11

0.50

0.90

0.48

0.66

0.55

2.28

1.22

1.06

2.94

1.16

2.71

60

0.21

1.66

1.09

0.68

2.19

2.77

1.63

3.26

3.08

2.76

2.20

4.38

100

0.49

1.03

1.79

1.63

1.93

2.66

3.21

3.42

3.63

4.65

3.42

3.73

130

1.74

1.55

2.39

1.24

1.84

3.57

3.07

3.49

4.81

3.38

5.01

5.26

(b)

160

0.57

1.53

2.17

1.53

2.99

3.76

3.71

4.81

6.44

4.52

5.99

6.45

200

1.20

1.82

2.02

2.10

5.18

4.31

4.89

4.49

5.37

6.16

6.77

8.87

230

0.70

2.59

2.34

2.71

5.52

3.79

5.28

6.19

6.63

7.89

7.87

9.77

260

1.15

2.29

2.62

3.72

5.10

5.99

5.97

6.45

7.12

8.34

10.11

10.77

300

1.67

2.31

3.33

4.27

5.27

6.49

6.35

7.11

8.70

9.12

11.40

13.22

10

3.52

3.36

2.92

3.18

3.02

3.18

3.36

3.08

2.98

2.42

3.06

3.14

30

2.96

3.00

2.98

3.14

3.18

2.90

2.96

3.04

3.18

3.02

3.22

2.94

60

3.08

3.08

3.02

2.66

3.36

2.92

3.24

3.02

3.10

2.86

2.90

2.86

100

3.00

2.90

2.78

3.36

3.00

3.28

3.38

2.72

3.30

2.66

2.98

2.74

130

2.80

3.00

3.24

3.50

3.72

3.30

2.90

3.04

3.08

2.94

3.68

3.20

(c)

160

3.02

3.18

3.46

2.92

3.10

2.86

3.32

3.40

2.88

3.40

3.16

2.62

200

2.84

3.18

3.06

2.76

2.78

2.82

3.14

3.12

3.12

2.86

3.00

3.14

230

3.24

3.22

2.90

2.74

3.32

2.86

2.94

3.34

3.08

2.70

2.84

3.42

260

2.72

2.70

2.66

3.00

3.22

3.42

3.10

3.32

3.24

2.86

2.66

2.92

300

3.24

3.02

2.70

2.76

2.92

2.74

2.94

2.98

2.62

3.02

3.34

3.44

(a) Average number of free variables. (b) Execution time (seconds) to generate solutions. Each entry in the table is the cumulative execution time of 100 replicates. (c) Average number of mating loops.

Table

Table

Finally, Table

Issue of spanning tree and seed node selection

In the first step,

In the second step, _{i }_{j}_{k}_{j}, n_{k}, b_{jk}_{i}, n_{j}, b_{jk}_{i}, n_{k}, b_{ik}

Consistency checking

Although we assume that the input pedigree is free of genotyping errors, our algorithm can be easily modified to detect inconsistencies within the genotype data without loss of efficiency. No recombination is allowed in the input pedigree and therefore inconsistencies will arise if there are different assignments of an

1.

2. _{s }_{i}_{i}_{i}_{s }_{i }

Conclusions

In this study, we proposed and implemented an algorithm to solve the zero-recombinant haplotype configuration (ZRHC) problem for a general pedigree in ^{2}

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

EYL, WBW, TJ, and KPW contributed to the algorithm design. EYL implemented the algorithms and performed the experiments. KPW and EYL analyzed the complexity of the algorithm and wrote the paper. All authors read and approved the final manuscript.

Acknowledgements

This work was supported in part by the National Science Council, Taiwan under NSC99-2320-B-010-022-MY2.

This article has been published as part of