Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong

Abstract

Background

Single nucleotide polymorphisms (SNPs) are the most common form of genetic variation in human DNA. The sequence of SNPs in each of the two copies of a given chromosome in a diploid organism is referred to as a haplotype. Haplotype information has many applications such as gene disease diagnoses, drug design, etc. The haplotype assembly problem is defined as follows: Given a set of fragments sequenced from the two copies of a chromosome of a single individual, and their locations in the chromosome, which can be pre-determined by aligning the fragments to a reference DNA sequence, the goal here is to reconstruct two haplotypes (_{1}, _{2}) from the input fragments. Existing algorithms do not work well when the error rate of fragments is high. Here we design an algorithm that can give accurate solutions, even if the error rate of fragments is high.

Results

We first give a dynamic programming algorithm that can give exact solutions to the haplotype assembly problem. The time complexity of the algorithm is ^{t }

Conclusions

We have tested our algorithm on a set of benchmark datasets. Experiments show that our algorithm can give very accurate solutions. It outperforms most of the existing programs when the error rate of the input fragments is high.

Background

The recognition of genetic variations is an important topic in bioinformatics. Single nucleotide polymorphisms (SNPs) are the most common form of genetic variation in human DNA. Humans are diploid organisms. There are two copies of each chromosome (except the sex chromosomes), one from each parent. The sequence of SNPs in a given chromosome copy is referred to as a

Currently, computational methods for computing haplotypes often fall into two categories:

The haplotype assembly problem was first introduced by Lancia _{1}, _{2}). The haplotype assembly problem with MEC is NP-hard

Levy

Recently, He ^{k }

In this paper, we propose a heuristic algorithm for the haplotype assembly problem with MEC. It is worth mentioning that in HapCUT

Preliminaries

The input to the haplotype assembly problem is a set of fragments sequenced from the two copies of a chromosome of a single individual. Each fragment covers some SNP sites. We assume that all the fragments have been pre-aligned to a reference DNA sequence. As a result, we can organize the input fragments as an

It is accepted that there are at most two distinct nucleotides at a SNP site. We assume that a column with more than two distinct nucleotides in

Illustration of the preprocessing on the input fragment matrix

**Illustration of the preprocessing on the input fragment matrix**. (a) The original fragment matrix

We say that row _{i}_{,j }is not a '-' or there are two integers _{i}_{,p }≠ - and _{i}_{,q }= -. The number of rows covering column

Two rows _{p}_{,j }≠ _{q}_{,j}, _{p}_{,j }≠ - and _{q}_{,j }≠ -. Obviously, for error-free data, two rows from the same copy of a chromosome should not conflict with each other, and two rows which conflict with each other must come from different copies of a chromosome. The distance between two rows

where

Minimum error correction (MEC) is a commonly used model for the haplotype assembly problem. For the haplotype assembly problem with MEC, the input is a fragment matrix _{1}, _{2}), one from each group, such that the total number of conflicts (errors) between the fragments and the constructed haplotypes (_{1}, _{2}) is minimized.

Methods

In this section, we will describe the algorithms used to solve the problem. We first design a dynamic programming algorithm that gives an exact solution and runs in ^{t }

A dynamic programming algorithm

Recall that the goal of the haplotype assembly problem is to partition the rows of the input fragment matrix ^{m }^{t }

Before we give the details of the dynamic programming algorithm, we first define some basic notations that will be used later:

•_{i}

•_{j}_{i}

•_{j}_{i }_{i}_{+1}.

•_{j}_{Ri}_{∩Ri+1}: the partition on _{i }_{i}_{+1 }obtained from _{j}_{i }_{i}_{+1}.

•_{j}_{k}_{k}_{Ri}_{∩Ri+1 }= _{j}

•_{j}_{i }_{j}

•_{j}_{j}

In order to compute _{j}

Let _{j}_{i}_{+1}, _{k}_{j}_{Ri}_{∩Ri+1}. The recursion formula of the dynamic programming algorithm is illustrated as follows:

Based on _{j}_{k}_{j}_{j}

The optimal MEC cost for partitioning all the rows of _{j}_{j}

Let us look at the time complexity of the dynamic programming algorithm. To compute each _{j}_{j}^{t }_{j}_{k}_{j}_{i}_{k}_{j}_{Ri}_{∩Ri+1 }in _{k}_{k}_{j}^{t }_{j}_{i}^{t}_{k}_{i}_{k}^{t }

**Theorem 1 **^{t }

Obtaining an initial solution via randomized sampling

The dynamic programming algorithm works well when

The detailed procedure for obtaining a submatrix from

1. Compute the coverage _{i }

2. For

3. If _{i }

4. Randomly choose

5. For each row _{j }

By employing this randomized sampling strategy, we can always make sure that the maximum coverage is bounded by the threshold

Refining the initial solution with all fragments

In the newly obtained submatrix, it is possible that (1) some columns are not covered by any rows, thus leaving the haplotype values at these SNP sites undetermined in the initial solution, (2) the haplotype values at some SNP sites in the initial solution are wrongly determined due to the lack of sufficient information sampled from

The refining procedure contains several iterations. In each iteration, we take two haplotypes as its input and output a new pair of haplotypes. Initially, the two haplotypes in the initial solution are used as the input to the first iteration. The haplotypes output in an iteration are then used as the input to the subsequent iteration. In each iteration, we try to reassign the rows of

The refining procedure stops when, at the end of some iteration, the obtained haplotypes no longer change, or when a certain number of iterations have been finished. The two haplotypes output in the last iteration are the output of the refining procedure.

Voting procedure

To further reduce the effect of randomness caused by the randomized sampling process, we try to obtain several different submatrices from

In the voting procedure, the two haplotypes are computed separately. We next see how to compute one of the two haplotypes. The other case is similar. Let _{1}), one from each solution in _{1 }all correspond to the same copy of a chromosome. With _{1}, we can then compute a haplotype by majority rule. Simply speaking, at each SNP site, we count the number of 0s and 1s at the given SNP site over the haplotypes in _{1}. If we have more 0s, the resulting haplotype takes 0 at the SNP site, otherwise, it takes 1.

How to find _{1}? First, we need to clarify that the two haplotypes in each solution in _{1}, _{2}), we do not know which chromosome copy _{1 }(or _{2}) corresponds to. So, we should first find the correspondence between the haplotypes in different solutions. Let _{1 }is the smallest among all the _{1 }as our reference and try to find the correspondence between haplotypes in _{1 }and other solutions. For each _{1 }we want to find.

Assume that at the beginning of this procedure, we obtain

Summarization of the algorithm

Generally speaking, given an input fragment matrix

Step 1: We first perform a preprocessing on

Step 2: We compute an initial solution by running the dynamic programming algorithm on a subset of

Step 3: Refine the initial solution with all the fragments in

Step 4: To further reduce the effect of randomness caused by the randomized sampling process, we repeat Step 2 and Step 3 several times. Each repeat ends with a solution, from which we then compute a single pair of haplotypes by adopting the voting procedure. The resulting pair of haplotypes is the output of our algorithm.

Results

We have tested our algorithm on a set of benchmark datasets and compare its performance with several other algorithms. The main purpose here is to evaluate how accurately our algorithm can reconstruct haplotypes from input fragments. All the tests have been done on a Windows-XP (32 bits) desktop PC with 3.16 GHz CPU and 4GB RAM.

The benchmark we use was created by Geraci in

Throughout our experiments, we measure the performance of our algorithm by the

where _{1}, _{2}) is the pair of correct haplotypes that is used to generate the problem instance, and is thus known a prior,

where

Intuitively speaking, the reconstruction rate measures the ability of an algorithm to reconstruct the correct haplotypes.

Recall that in Step 2 of our algorithm, we try to compute an initial solution by using only a subset of the input matrix. The initial solution forms the basis for the following steps of our algorithm and is closely related to the parameter

Evaluation of how the size of boundOfCoverage affects the initial solution.

**c = 3**

**c = 5**

**c = 8**

**c = 10**

10

0.708(

0.753(

0.764(0.14)

0.774(0.18)

12

0.728(

0.785(

0.794(0.15)

0.797(0.21)

15

0.776(0.30)

0.837(0.33)

0.841(0.36)

0.857(0.45)

There are 3 different sizes of

From Table

Next, to evaluate the performance of our algorithm, we have tested it on the set of benchmark datasets. The parameters we use are as follows:

Comparisons of the algorithms when

**
e
**

**
c
**

**SpeedHap**

**Fast Hare**

**2d-mec**

**HapCUT**

**MLF**

**SHR-three**

**DGS**

**Ours**

0.0

3

0.999

0.999

0.990

**1.000**

0.973

0.816

**1.000**

1.000

5

**1.000**

0.999

0.997

**1.000**

0.992

0.861

**1.000**

1.000

8

**1.000**

**1.000**

**1.000**

**1.000**

0.997

0.912

**1.000**

1.000

10

**1.000**

**1.000**

**1.000**

**1.000**

0.998

0.944

**1.000**

1.000

0.1

3

0.895

0.919

0.912

0.929

0.889

0.696

**0.930**

0.973

5

0.967

0.965

0.951

0.920

0.970

0.738

**0.985**

0.996

8

0.989

**0.993**

0.983

0.901

0.985

0.758

0.989

0.999

10

0.990

**0.998**

0.988

0.892

0.995

0.762

0.997

1.000

0.2

3

0.623

0.715

0.738

**0.782**

0.725

0.615

0.725

0.903

5

0.799

0.797

0.793

**0.838**

0.836

0.655

0.813

0.963

8

0.852

0.881

0.873

0.864

**0.918**

0.681

0.878

0.990

10

0.865

0.915

0.894

0.871

**0.938**

0.699

0.917

0.996

0.3

3

0.480

0.617

**0.623**

0.602

0.618

0.557

0.611

0.776

5

0.637

0.639

0.640

0.629

**0.653**

0.599

0.647

0.874

8

0.667

0.661

0.675

0.673

**0.697**

0.632

0.663

0.950

10

0.676

0.675

0.678

0.709

**0.715**

0.632

0.688

0.972

The columns

Comparisons of the algorithms when

**e**

**c**

**SpeedHap**

**Fast Hare**

**2d-mec**

**HapCUT**

**MLF**

**SHR-three**

**DGS**

**Ours**

0.0

3

0.999

0.990

0.965

**1.000**

0.864

0.830

0.999

1.000

5

**1.000**

0.999

0.993

**1.000**

0.929

0.829

**1.000**

1.000

8

**1.000**

**1.000**

0.998

**1.000**

0.969

0.895

**1.000**

1.000

10

**1.000**

0.999

0.999

**1.000**

0.981

0.878

**1.000**

1.000

0.1

3

0.819

0.871

0.837

**0.930**

0.752

0.682

0.926

0.970

5

0.959

0.945

0.913

0.913

0.858

0.724

**0.978**

0.993

8

0.984

0.985

0.964

0.896

0.933

0.742

**0.996**

0.999

10

0.984

0.995

0.978

0.888

0.962

0.728

**0.998**

1.000

0.2

3

0.439

0.684

0.675

**0.771**

0.642

0.591

0.691

0.877

5

0.729

0.746

0.728

**0.831**

0.728

0.632

0.769

0.953

8

0.825

0.853

0.791

**0.862**

0.798

0.670

0.842

0.988

10

0.855

0.877

0.817

0.867

0.831

0.668

**0.878**

0.994

0.3

3

0.251

0.590

**0.593**

0.565

0.581

0.548

0.578

0.725

5

0.578

0.602

0.606

0.582

0.606

0.557

**0.609**

0.833

8

0.629

0.626

0.623

0.621

**0.634**

0.604

0.628

0.922

10

0.638

0.644

0.634

**0.664**

0.641

0.619

0.641

0.951

The columns

Comparisons of the algorithms when

**e**

**c**

**SpeedHap**

**Fast Hare**

**2d-mec**

**HapCUT**

**MLF**

**SHR-three**

**DGS**

**Ours**

0.0

3

0.999

0.988

0.946

**1.000**

0.787

0.781

0.999

0.997

5

**1.000**

0.999

0.976

**1.000**

0.854

0.832

**1.000**

0.999

8

**1.000**

**1.000**

0.992

**1.000**

0.919

0.868

**1.000**

1.000

10

**1.000**

0.999

0.997

**1.000**

0.933

0.898

**1.000**

1.000

0.1

3

0.705

0.829

0.786

0.927

0.698

0.668

**0.931**

0.951

5

0.947

0.949

0.880

0.916

0.809

0.716

**0.977**

0.989

8

0.985

0.986

0.948

0.896

0.863

0.743

**0.987**

0.997

10

0.986

0.995

0.965

0.889

0.884

0.726

**0.997**

0.998

0.2

3

0.199

0.652

0.647

**0.753**

0.624

0.591

0.669

0.837

5

0.681

0.712

0.697

**0.825**

0.682

0.617

0.741

0.927

8

0.801

0.808

0.751

**0.856**

0.747

0.653

0.818

0.974

10

0.813

**0.872**

0.778

0.861

0.765

0.675

0.861

0.982

0.3

3

0.095

0.581

**0.583**

0.552

0.570

0.536

0.573

0.676

5

0.523

0.591

**0.596**

0.555

0.594

0.562

0.595

0.777

8

**0.616**

0.615

0.613

0.597

0.614

0.611

0.614

0.876

10

0.627

0.616

0.622

**0.645**

0.625

0.625

0.622

0.909

The columns

Take a close look at the three tables, we can see that (1) each of the seven algorithms studied in

Discussion

In the first step of our algorithm, we perform a preprocessing on the input fragment matrix. This allows us to detect errors in the input. For example, for the benchmark datasets with

Next, we further investigate how the voting procedure in Step 4 affects the performance of our algorithm. In Step 4, we first obtain

Illustration of the effect of the voting procedure

**Illustration of the effect of the voting procedure**. The reconstruction rates for the final version of our algorithm and the one without the voting procedure are depicted by black and gray bar, respectively. The error rate for the benchmark used in (a)(respectively, (b)) is 0.2 (respectively, 0.3).

From Figure

To see how the size of the parameter

Evaluation of how the size of ** x **affects the performance of our algorithm

**Evaluation of how the size of **. The reconstruction rates for

Conclusion

In this paper, we propose a heuristic algorithm for the haplotype assembly problem. Experiments show that our algorithm is highly accurate. It outperforms most of the existing programs when the error rate of input fragments is high.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

FD participated in the design of the study, performed the experiments and drafted the manuscript. WC participated in the design of the study and helped to draft the manuscript. LW conceived the study, participated in its design and helped to draft the manuscript. All authors read and approved the final manuscript.

Declarations

The publication costs for this article were funded by the corresponding author's institution.

Acknowledgements

The authors would like to thank Filippo Geraci for kindly providing us with the set of benchmark datasets. This work is fully supported by a grant from City University of Hong Kong (Project No. 7002728).

This article has been published as part of