Faculty of Computer Science University of New Brunswick, Fredericton, New Brunswick, Canada

Abstract

Background

Genetic disease studies investigate relationships between changes in chromosomes and genetic diseases. Single haplotypes provide useful information for these studies but extracting single haplotypes directly by biochemical methods is expensive. A computational method to infer haplotypes from genotype data is therefore important. We investigate the problem of computing the minimum number of recombination events for general pedigrees with two sites for all members.

Results

We show that this NP-hard problem can be parametrically reduced to the Bipartization by Edge Removal problem and therefore can be solved by an ^{k}^{2}) exact algorithm, where

Conclusions

Our work can therefore be useful for genetic disease studies to track down how changes in haplotypes such as recombinations relate to genetic disease.

Background

Human genomes contain two copies of each chromosome. Research shows that single chromosomes, called haplotypes, are useful to study complex genetic diseases

In the absence of recombination events, haplotypes of members in a pedigree follow the Mendelian law of inheritance, where the two haplotypes of a child are transferred from its parents, one haplotype from its father and the other from its mother. Various haplotyping algorithms exist for non-recombinant pedigree data

Non-recombination vs. recombination.

**Non-recombination vs. recombination.** Recombination happens between sites 1 and 2 of parent u and the child c receives a combined haplotype from parent u. Here haplotypes of members are displayed in columns.

The haplotyping problem has been studied extensively in the last few years, both for pedigree and population data. If recombinations are allowed, the problem of inferring haplotypes for pedigrees with the minimum number of recombinations is NP-hard ^{d}n^{2}^{3}), where d is the largest number of children in a family, ^{k}^{+1}

We study the minimum haplotype configuration for general pedigrees, where each member in a pedigree has only two sites; even this restricted problem is NP-hard ^{k}^{2}), where

Concepts

A member is an individual. A set of members is called a

In diploid organisms, a cell contains two copies of each chromosome. The description data of the two copies are called a _{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}.

The problem in this paper is to find the haplotypes _{u}_{u}_{u}._{u}_{u}_{u}_{u}_{u}_{u}

**2-site-MRHC**
_{opt}
**:**

This optimization problem, called 2-site-MRHC_{opt}_{opt}

**2-site-MRHC**_{k}

There is a correspondence between an optimization version and a decision version of the MRHC problem. We can get a result for the optimization version of the problem by trying parameter

Methods

We construct a pedigree graph to represent the 2-site-MRHC_{k}

Label members

Given a member

If _{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}

Insert positive edges

If _{pos}_{pos}

Inserting positive and negative edges.

**Inserting positive and negative edges.** Here genotypes of members are displayed.

Insert negative edges

We also consider a trio with two parents, _{neg}_{neg}

This phenomenon can be explained as follows. If there is no recombination and

Figure

Process unlabeled members

So far, we have processed labeled members. Now we process an unlabeled member

If _{u}_{u}_{u}_{u}

Because we use unlabeled child members to insert negative edges only and there is no way detect haplotype shuffling in unlabeled parental members, we only consider members that are labeled from now on. Once labeled members are resolved, we can resolve unlabeled members accordingly.

Pedigree graph

Pedigree _{pos}_{neg}_{pos}_{neg}

**Observation 1.**

There are

Except for external parents, a member has two positive edges linking it to two parents. Therefore, the number of edges in the graph is linear in the number of child members. If a member is an unlabeled member, the two positive edges linking two parents and the child are replaced by a negative edge between the two parents. Thus the number of edges in the pedigree graph is

The 2-site-MRHC_{k}_{red}_{green}_{red}_{green}

Given a pedigree graph, any two adjacent members linked by a positive edge should be in the same partition, and any two adjacent members linked by a negative edge should be in different partitions. Any edge whose constraint is not satisfied represents a recombination event between the two adjacent members, or, in the case of a negative edge having endpoints in the same partition, between one parent and the child. Equation 1 thus counts the number of recombination events in the whole pedigree and ensures that it is at most

Signed graph

A graph _{pos}_{neg}_{pos}_{neg}_{1}, _{2}) be a partition of _{1} and _{2}. The _{1}, _{2}) is defined as:

The line index of graph G is defined as:

The corresponding decision version of finding the line index of graph

**LineIndex**_{k}

Clearly, the 2-site-MRHC_{k}_{k}_{k}

Data deduction rules

Given the signed graph

Let

**Observation 1**_{pos}_{neg}_{neg}_{k}.

This observation is true based on Equation 1.

**Observation 2**

If we label

**Observation 3**

If

**Reduction rule 1**_{k} if and only if

**Proof 1**

**Reduction rule 2 **_{k} if and only if

**Proof 2**

**Reduction rule 3**_{neg}_{k} if and only if_{neg}

**Proof 3** T

**Reduction rule 4**_{pos}_{k} if and only if_{pos}

**Proof 4**

**Reduction rule 5 **_{pos}_{k} if and only if_{pos}

**Proof 5**

**Reduction rule 6**_{neg}_{k} if and only if_{neg}

**Proof 6 **

**Reduction rule 7**

1.

(a) If _{k}

**Proof 7**

(b) If _{k}

**Proof 8 **

(c) If only _{k}

**Proof 9**_{pos}_{pos}

(d) If neither _{k} if and only if (

**Proof 10 **_{pos}_{pos}

2.

(a) If _{k}

**Proof 11**

(b) If _{k}

**Proof 12** We _{neg}_{neg}

(c) If only _{neg}_{k}

**Proof 13**_{neg}

(d) If neither _{k}_{pos}

**Proof 14 **_{pos}_{pos}_{neg}_{neg}

3. Either

(a) If _{k}

**Proof 15**_{pos}_{neg}

(b) If _{k}

**Proof 16**_{neg}

(c) If only

i. If _{k}

**Proof 17 **_{pos}_{neg}_{neg}

ii. If _{k}

**Proof 18**

(d) If neither _{k}

**Proof 19 **_{pos}_{neg}_{neg}

There is no grey vertex with degree less than three in the graph once these data reduction rules are applied. Vertices with high degrees will likely be eliminated. Therefore, our data reduction rules will be very useful for various types of pedigrees, such as pedigrees containing many members with no children, pedigrees with small families, or pedigrees with very big families.

We performed experiments with the data deduction rules to see how efficient these rules are to reduce the sizes of pedigree graphs. We generated 20 random and highly complex pedigrees with many cycles based on the method presented in. Each member can have many spouses and some of its spouses can be its children or grandchildren. These pedigree structures may not be common for human but would be easily found in other species such as goats, fish, and horses. The numbers of members in families vary from 1000 to 10000; each member has two sites. From these pedigree structures and their genotype data, we constructed initial pedigree graphs with the numbers of vertices varying from 496 to 5021, positive edges from 350 to 3991, and negative edges from 90 to 936. Table

Pedigree graphs before and after data deduction rules are used.

# of members

# of vertices in initial graph

# of positive edges in initial graph

# of negative edges in initial graph

# of vertices in reduced graph

# of positive edges in reduced graph

# of negative edges in reduced graph

1000

496

350

90

5

1

3

1500

769

509

100

6

5

1

2000

1022

685

186

5

2

2

2500

1248

771

167

6

5

1

3000

1553

1049

202

6

5

1

3500

1752

1230

268

18

11

5

4000

1991

1395

339

7

3

3

4500

2216

1655

415

7

4

2

5000

2535

2006

440

10

4

5

5500

2815

2371

556

11

9

1

6000

2037

1953

451

17

11

5

6500

3242

2142

484

14

11

2

7000

3486

2199

484

10

3

5

7500

3662

2668

661

16

9

5

8000

4972

3052

764

5

1

3

8500

4143

2464

506

7

7

1

9000

4444

3088

781

20

10

9

9500

4735

3365

867

14

6

6

10000

5021

3991

936

18

8

9

Fixed-parameter algorithm

A NP-hard problem cannot be solved by a polynomial time algorithm unless P=NP. However, if we can restrict some parameters of the problem to small values, the running time of an algorithm for the problem can potentially be greatly reduced

**Definition 1**

Practically, the parameter is a nonnegative integer or a set of nonnegative integers and therefore

**Definition 2**^{O}^{(1)}

A comprehensive survey of FPT problems can be found in

Transforming to bipartization by edge removal problem

We review an important property of a signed graph given by

**Theorem 1**

**Proof 20**_{1}, _{2}) _{1}, _{2}) =

_{1}, _{2})

Based on this property, the pedigree graph is transformed into a new graph by replacing every positive edge by two consecutive negative edges and adding new intermediate vertices. We obtain a new weighted graph

This equation is to ensure that the total number of edges within _{1} and edges within _{2} is at most

To make the GBER algorithm

**Definition 3**

Bipartization by Edge Removal is a classical NP-hard problem and is in FPT

FPT Algorithm for bipartization by edge removal

One efficient technique to tackle an FPT problem is ^{O}^{(1)}), the overall running time will be ^{O}^{(1)})

Iterative compression technique is used by Guo et al. ^{k}^{2}), where

Given a graph _{1}, …, _{m}}. Let _{i}_{1}, …, _{i}_{1} is empty. If i > 1, let _{i}_{1}, …, _{i}_{i}_{+1} = _{1}, …, _{i}_{+1}]. If _{i}_{+1} then _{i}_{+1}} is clearly an optimal edge bipartization set for _{i}_{+1}. From the edge bipartization set _{Φ} be Φ^{–1}(_{Φ} be Φ^{–1}(^{k}_{Φ} and _{Φ}

**Theorem 2**

(

(_{Φ}=Φ^{–1}(_{Φ}=Φ^{–1}(

Consider a graph

Compression step.

**Compression step.** The edge bipartization set is compressed by finding a mincut.

**Theorem 3**_{k} problem is solvable in O^{k}^{2})

**Proof 21**^{k}^{2}). ^{k}^{2})

Extensions to pedigrees with more than two sites

Our method can be extended to work with pedigrees with more than two sites. In order to detect a recombination event in a member, it is necessary to have at least two heterozygous sites; one on each side of the recombination breakpoint. For example, we cannot detect a recombination between sites 1 and 2 of member u in Figure _{ij}_{13} between site 1 and site 3, and a grey vertex _{34} between site 3 and site 4 of member _{ij}_{ij}_{34} and vertex _{34} in Figure _{ij}_{ij}

Graphs from a pedigree with multiple sites.

**Graphs from a pedigree with multiple sites.** Additional vertices are needed in order to capture the relationships between multiple pairs of sites in adjacent members of the pedigree.

The main difference between a pedigree with two sites and a pedigree with multiple sites is that besides vertices and edges created between closest heterozygous sites and closest homozygous sites, we may need to create additional vertices and edges for pedigrees with multiple sites to capture all constraints in a pedigree. For example, Figure _{13} is created in _{13} and _{13} to capture constraint between sites 1 and 3 of ^{2}), where

Additional vertices and edges are created in a member by the need of its adjacent members. They actually represent overlapped information. For example, vertex _{13} can be represented by vertices _{12} and _{23}. Thus when we solve the pedigree graph, we have to ensure that vertices are resolved consistently. For example, if vertices _{12} and _{23} are later resolved green and vertex _{13} is resolved red, there is a parity conflict. The reason is that _{c}_{c}_{c}_{12} and _{23}. However, _{c}_{c}_{13}. Therefore, a fixed-parameter algorithm for general pedigrees with multiple sites needs to ensure information consistency. We will investigate this problem in our future work.

Conclusion

We have shown that the MRHC problem for general pedigrees with two sites can be reduced to the line index of a signed graph, and the line index of a signed graph can, in turn, be reduced to the Bipartization by Edge Removal problem. Therefore we can solve the MRHC problem for general pedigrees with two sites with an ^{k}^{2}) fixed-parameter algorithm. Future work will extend the current method to deal with genetic data with more than two sites.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

DDD designed the algorithm and drafted the manuscript. PAE supervised the research, assisted in crafting the algorithm and polished the manuscript. Both authors read and approved the final manuscript.

Acknowledgements

This research was funded by the Natural Sciences and Engineering Research Council of Canada through Discovery Grant 204923 to P.A. Evans.

This article has been published as part of