The Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel

International Computer Science Institute, Berkeley, California, USA

Department of molecular microbiology and Biotechnology, Tel-Aviv University, Tel-Aviv, Israel

Abstract

Haplotype phasing is a well studied problem in the context of genotype data. With the recent developments in high-throughput sequencing, new algorithms are needed for haplotype phasing, when the number of samples sequenced is low and when the sequencing coverage is blow. High-throughput sequencing technologies enables new possibilities for the inference of haplotypes. Since each read is originated from a single chromosome, all the variant sites it covers must derive from the same haplotype. Moreover, the sequencing process yields much higher SNP density than previous methods, resulting in a higher correlation between neighboring SNPs. We offer a new approach for haplotype phasing, which leverages on these two properties. Our suggested algorithm, called

Introduction

The etiology of complex diseases is composed of both environmental and genetic factors. In the last few decades, there has been a tremendous effort to discover the genetic component of the etiology of a large number of common traits, that is, characterizing the heritability of these traits. In recent years, much of this effort has been focused on genome-wide association studies (

In recent years, genotyping technology, or the extraction of the SNP information from the genome, has been advancing rapidly. Only a few years ago, genome-wide association studies were simply infeasible. On the other hand, even the most modern genotyping technologies only provide a partial picture of the genome, since the number of positions measured with these technologies is typically less than a million, while there are more than three billion positions in the genome. Additionally, there are many different types of genetic variations that are not captured well by genotyping technologies, particularly rare SNPs, and short deletions and insertions. For this reason the next generation of genetic studies of diseases will surely include the new high-throughput sequencing technologies, or next generation sequencing platforms (

The technical analysis of disease association studies encountered a few computational challenges, some of which will remain when considering NGS based studies. One of the major obstacles in these studies has been the inference of haplotypes from the genotype data (

The phasing problem is of different nature when applied to sequencing studies. First, unlike genotyping technologies, sequence data allows us to consider both SNPs and short structural variations (e.g., short deletions). Second, in sequencing technologies the measured SNPs are closer to each other than in genotyping, resulting in a much higher LD, or correlations between neighboring SNPs. Third, the short reads obtained from the sequencing platform are always read from one chromosome, and some of these reads may contain more than a single SNP, suggesting that partial haplotypes are provided. Finally, the noise obtained by NGS technologies is inherently different than genotyping technologies; the sequence reads contain substantially more errors than genotypes, especially towards the end of the reads, and the final errors made by the algorithms are highly dependent on specific parameters such as the

Unlike the case of genotyping, sequencing allows the possibility of phasing a single individual (genotype data requires a population). Very recently, a few methods were suggested for the problem

Methods

The basic assumption of our algorithm is that within a short region, the history of the genetic variants (SNPs or deletions) follows the perfect phylogeny model, i.e., there are no recurrent mutations or recombinations. We denote by ^{m}_{v }

Modeling the sequencing procedure

The haplotypes themselves are not given by the sequencing procedure, but instead, the sequencing procedure provides a large set of short reads, arguably sampled from random positions in the genome and from a random copy of the chromosome (out of the two copies). The sequencing itself is a noisy procedure, which depends on two parameters, the coverage

Problem statement

The input for our algorithm is the sequence data, i.e., the set of reads obtained from the sequencer, where each read is assumed to be generated by randomly picking a position in the genome, randomly picking one of the copies of the chromosome in that position, and adding noise using the parameter _{i}_{i }

where _{i}

Our algorithm aims at finding a perfect phylogeny tree on the set of SNPs in a given window, and a corresponding haplotype assignment for each individual. The haplotypes assigned to each individual need to be taken from the tree, and the objective is to optimize the likelihood of the reads. We will explain later how the perfect phylogeny assumption can be relaxed.

Tree reconstruction within a window

Within a window, the tree reconstruction algorithm works in the following way. We first search for a SNP

Throughout, we assume that the alleles of each SNP are represented by the {0, 1} notation, where 0 is the more common allele. As shown in _{j }> f_{j'}

For a SNP _{ij }_{i }

where _{ij}_{g}

We use an expectation-maximization _{j}

The partitioning procedure

Now that we have chosen _{1 }and _{2}, where _{1 }corresponds to the subtree of the root excluding the edge _{2 }corresponds to the subtree located below _{1 }verses the likelihood of _{2}, and we assign

This approach is highly efficient (the running time of each partition iteration is linear), however, the algorithm is highly sensitive to mistakes occurring early on in the process of the tree reconstruction, and it does not take into account the overall multivariable relations between

- PPHS-3

Graph separation example

**Graph separation example**. One out of 3 different options of Tree _{2 }is the child of _{1}.

For each of these subtrees, _{v }^{t}

In practice, we start the EM algorithm from a distribution _{0 }be the SNP corresponding to the parent edge of _{s }

The EM algorithm provides us with a set of haplotype frequencies

We now define a set of four complete graphs _{ab }_{ab}_{ab}_{j1,j2 }∈ _{ab }

Let (_{1}, _{2}) be a partition of the graphs _{ab}

The algorithm proceeds by searching for the partition that maximizes ∑_{a, b }_{ab }_{1},_{2}). In order to find the best partition, for low values of

Generation of haplotypes for each window

For every window, the recursive process described above returns the perfect phylogeny tree _{1}; _{2}. Since the data may not perfectly fit the model, we add additional haplotypes to the pool of possible haplotypes. Using the tree

For each individual _{1}, _{2 }which maximizes the a posteriori probability:

The priors Pr(_{i }_{h }_{i}_{1 }weight in the _{2 }weight in the haplotypes derived from

which is maximized for _{h }_{h }_{h }

Stitching of windows

The framework discussed so far assumes that the haplotypes are inferred within a window of

Let

We calculate the Hamming distances _{1}, _{2}:

In case _{1 }<_{2}, _{1 }≤ _{2 }<_{1}, _{2 }≤

Results

In order to evaluate the performance of our methods, we first used simulated data to generate a large number of trees, and we measured the accuracy of the tree reconstruction directly. We then considered a set of haplotypes, either randomly generated from the simulated tree, or taken from real data, and we evaluated the phasing accuracy on these data. For the latter, we used an extension of the switch distance error metric. Generally, there are two types of errors; the first type is regular switch errors (for example two haplotypes 11 and 00 are phased as 10 and 01), and the second type is mismatch errors (two haplotypes 11,00 are phased as 11 and 01). We used the sum of all switch and mismatch errors (

Where all reads and haplotypes were restricted to SNP _{i }∈ {(0, 0), (1, 0), (1, 1)}.

Evaluation under Simulated data

We used the MSMS software

Accuracy of tree reconstruction

Using MSMS, we generated a set of 100 random haplotype groups over 30 SNPs, and for each group we randomly generated a set of _{True}_{True }_{1},..., _{31}. The output of the algorithm results in another tree with a possibly different set of haplotypes, _{PPHS}_{True }_{PPHS}_{i∈S }_{i}

Tree reconstruction precision

**Tree reconstruction precision**. The tree reconstruction percentage as a function of the population size. The tests were done with 30 SNPs and a minimal haplotype frequency of 3%. On the right (Figure 2.2), the sequencing error rate was 2%, with varying coverage values, and on the left (Figure 2.1), the coverage was 10, and the error rate was set to 4%.

Evaluation of phasing accuracy

We compared the results of the PPHS algorithm to BEAGLE

Phasing accuracy on simulated data

**Phasing accuracy on simulated data**. The window based SM error metric as a function of the sequencing error rate (Figure 3.1) and as a function of the coverage (Figure 3.2). The tests were done with haplotype windows of length was 5 over a population of 5 individuals. The coverage was set to 5 on the left, and the sequencing error rate was set to 3% on the right, each point is the average of 10000 windows.

Evaluation under real data

In the above section we evaluated the performance of the algorithms under the Coalescent model without recombinations. In practice, there are cases in which this model does not capture the empirical behavior of the data and it is therefore important to test the algorithm using real datasets. To do so, we used the EUR population data from the 1000 genome project

The genotype data provided by the 1000 Genomes project provides us with a realistic setting of the haplotype distribution in the population. Our experiments involved taking these haplotypes and simulating the sequencing process, as described in the Methods. We first measured the accuracy of the algorithms as a function of the coverage and the sequencing errors with a set of five individuals. We also compared our algorithm to haplotype assembly algorithms (HapCut

We observed that with coverage of 4,6 PPHS is constantly better than BEAGLE by 40%-100% when using the window based error function (see Figure

Phasing accuracy on read data with varying coverage values

**Phasing accuracy on read data with varying coverage values**. The window based SM error metric as a function of the sequence error rate for coverage values of 2 (Figure 4.1),4 (Figure 4.2),6 (Figure 4.3). Read length was set to 400. The test was done with a population of 5 individuals. The windows size which was used by PPHS is 5 SNPs.

We also compared our algorithm to HAPCUT, as can be seen in table

The performance of HapCut versus PPHS.

**Algorithm**

**Test 1**

**Test 2**

**Test 3**

**Test 4**

HapCut - No Calls

20.85

20.63

18.67

18.03

HapCut - Errors

15.92

8.71

24.80

15.56

Beagle

2.76

1.81

2.20

1.09

PPHS

2.43

1.20

1.71

0.24

All tests were done with 5 individuals, tests 1-3 had an expected coverage of 5 while test 4 had an expected coverage of 20. Tests 1,3 had a sequencing error rate of 5% and tests 2,4 had a sequencing error rate of 1%. The read length of tests 1,2 was 400 while for test 3,4 it was 2000 bases. The cells are the percent of Switch-Mismatch errors in the data.

Conclusions

This work presents a new algorithm for phasing. This algorithm works by reconstructing the prefect phylogeny tree in every short region. Unlike previous methods (e.g. HAP

The results demonstrate that the proposed algorithm works well and is immune to sequencing errors and small population sizes, making it robust. In addition to the solution for the phasing problem, the algorithm provides a new method to reconstruct perfect phylogeny under the condition of an error model. This approach, however, has a few limitations, which may be of interest for further research. The method is designed to work in windows. In order to stitch the windows the BEAGLE results are used as a skeleton, and it is likely that more tailored methods (e.g. a variant of [Halperin, Sharan, Eskin]

An additional limitation of the algorithm is its performance while handling large populations. As the results show, the performance of BEAGLE improves rapidly with the size of the population, whereas the performance of PPHS improves at a slower rate. However, given simulated data (based on the Coalescent model), the algorithm's performance does improve considerably as the population size is increased. This might suggest that when using the 1000 genomes data, the perfect phylogeny model breaks as the number of samples increases due to subtle population structure, or simply large number of errors within that data. If the former is true, there may be an optimization procedure that selects the length of the windows as a function of the linkage disequilibrium structure in the region. Such optimization of the parameters may result in better phasing algorithms, and particularly in a better reconstruction of the trees.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

AE and EH developed the method. AE and EH designed the experiments. AE implemented the method and performed experiments. AE and EH analyzed results and wrote the manuscript. All authors read and approved the final manuscript.

Acknowledgements

E.H. is a faculty fellow of the Edmond J. Safra Bioinformatics program at Tel-Aviv University. A.E. was supported by the Israel Science Foundation grant no. 04514831, and by the IBM Open Collaborative Research Program.

This article has been published as part of