Department of Computational Biology, Faculty of Frontier Science, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8561, Japan

Abstract

Background

Haplotype information is useful for various genetic analyses, including genome-wide association studies. Determining haplotypes experimentally is difficult and there are several computational approaches that infer haplotypes from genomic data. Among such approaches, single individual haplotyping or haplotype assembly, which infers two haplotypes of an individual from aligned sequence fragments, has been attracting considerable attention. To avoid incorrect results in downstream analyses, it is important not only to assemble haplotypes as long as possible but also to provide means to extract highly reliable haplotype regions. Although there are several efficient algorithms for solving haplotype assembly, there are no efficient method that allow for extracting the regions assembled with high confidence.

Results

We develop a probabilistic model, called MixSIH, for solving the haplotype assembly problem. The model has two mixture components representing two haplotypes. Based on the optimized model, a quality score is defined, which we call the 'minimum connectivity' (MC) score, for each segment in the haplotype assembly. Because existing accuracy measures for haplotype assembly are designed to compare the efficiency between the algorithms and are not suitable for evaluating the quality of the set of partially assembled haplotype segments, we develop an accuracy measure based on the pairwise consistency and evaluate the accuracy on the simulation and real data. By using the MC scores, our algorithm can extract highly accurate haplotype segments. We also show evidence that an existing experimental dataset contains chimeric read fragments derived from different haplotypes, which significantly degrade the quality of assembled haplotypes.

Conclusions

We develop a novel method for solving the haplotype assembly problem. We also define the quality score which is based on our model and indicates the accuracy of the haplotypes segments. In our evaluation, MixSIH has successfully extracted reliable haplotype segments. The C++ source code of MixSIH is available at

Introduction

Human somatic cells are diploid and contain two homologous copies of chromosomes, each of which is derived from either paternal or maternal chromosomes. The two chromosomes differ at a number of loci and the most abundant type of variation is single nucleotide polymorphism (SNP). Most current research does not determine the chromosomal origin of the variations and uses only genotype information for the analyses. However, haplotype information is valuable for genome-wide association studies (GWAS)

Let us consider a simple example to demonstrate the importance of haplotype information. Suppose that in a gene coding region, there are two SNP loci, each of which has an independent deleterious mutation in either one of the two homologous chromosomes. If both of the two deleterious mutations are located on the same chromosome, the other chromosome can produce normal proteins. On the other hand, if each chromosome contains either one of the two deleterious mutations, the cells cannot produce normal proteins. It is not possible to distinguish these two cases with only genotype information.

There is a group of algorithms for haplotype inference that statistically construct a set of haplotypes from population genotypes

Another group of algorithms is single individual haplotyping (SIH) or haplotype assembly. These algorithms infer the two haplotypes of an individual from sequenced DNA fragments

An illustration of SIH

**An illustration of SIH**. An illustration of single individual haplotyping (SIH). The input data for SIH are the SNP fragments (B) which are extracted from the heterozygous alleles in aligned DNA fragments (A). SIH algorithms (C) reconstruct the original haplotypes (D) from the SNP fragments.

SIH algorithms did not attract much attention until recently, since the read fragments of next-generation sequencing experiments are not long enough to span multiple heterozygous loci, which exist at only one in one kilo-base on average

The haplotype information which contains errors is likely to lead to wrong results in downstream analyses. For example, in detecting the recombination events from the parent-offspring haplotypes

The algorithms for SIH are classified into two strategies; most of the previous algorithms use deterministic strategies

On the other hand, the probabilistic approaches of Kim

In this paper, we develop a novel probabilistic SIH model that is very different from the probabilistic models of Kim

**This file includes the explanation of our model, detail of the parameter optimization and some additional analyses**.

Click here for file

Methods

Algorithms and implementation

Notation

Throughout the paper, we denote the number of elements of any set ^{⊗n}. Let _{1 }... _{M}_{j }_{j0}_{j}_{1}) is referred to as

Let _{i}|i _{i }_{ij }_{ij }_{i }_{i }_{ij }_{i }_{1}, _{2 }∈ _{i}_{1 }_{2}. The set of fragments that cover site ^{c}_{i}

The SIH problem takes a set of aligned SNP fragments

Mixture model

We model the probabilistic distribution of the observed fragments

where Θ represents a set of parameters defined later, Φ^{(i) }∈ Δ(_{i}_{i}_{i}_{1 }. . . _{N }_{i}^{m}_{i }_{i }^{(i) }as follows.

where,

is the probability that we observe

We take ^{m}^{m}

Let _{i }_{i}_{i}|i ^{(i)}

We explain the difference between our model and the models of Kim

The minimum connectivity score

As described above, the two haplotypes

Suppose that the probabilistic model is optimized for two segments of SNP sites between which there are no connecting fragments, then the association of the haplotypes {0, 1} to the true paternal and maternal chromosomes are selected at random for each segment. Even if there are several connecting fragments, the associations in each segment are determined almost randomly if the number of connecting fragments is not sufficient or there are many conflicting fragments. Such sites often cause switch errors. We define the connectivity at site _{0 }as a log ratio of the marginal log likelihoods:

Where _{0 }and _{0}. The second equality follows from the symmetry of _{0 }are necessary to compute the connectivity of site _{0}. The connectivity measures the resilience of the assembly result against swapping the two haplotypes 0 and 1 in the right part _{0}, . . . , _{0}.

For each pair of sites (_{1}, _{2}) (_{1 }_{2}), we define the minimum connectivity (MC) score as

We extract confidently assembled regions by selecting the pairs (_{1}, _{2}) with high MC values. From the above definition, it is obvious that if the MC value is higher than a given threshold for some pair (_{1}, _{2}), then all the pairs inside range [_{1}, _{2}] have MC values higher than the threshold. In this sense, MC(_{1}, _{2}) can be considered as defined on the range [_{1}, _{2}].

Variational bayesian inference

We use the VBEM algorithm to optimize the parameters Θ ^{H}^{Ψ}(^{Θ}(Θ) such that the Kullback-Leibler divergence _{H}_{ΨΘ}(

where ^{H}^{Ψ }is a normalization constant, _{ihjν }_{jν }_{j}|λ_{j}^{H}^{Ψ}(^{Θ}(Θ) are connected through the dependencies among the hyperparameters, they cannot be found simultaneously. Therefore, we optimize _{ihjν }_{jν }

In our model, the parameters often converge to sub-optimal solutions, because switch errors existing in the sub-optimal configurations are not removed by gradual parameter changes. Therefore, we apply a heuristic procedure that re-runs the VBEM several times with twisted parameter configurations after every convergence:

1. Do VBEM and calculate the connectivities for all the sites.

2. Do another VBEM with a parameter set Λ that is twisted at a site with low connectivity.

3. Repeat until convergence.

Here, the twist of hyperparameters Λ = {_{jν}_{jν}

Inferring haplotypes

We set _{jv }

We select the phase

Possible extensions of the model

In this paper, we consider only the binary representation of heterozygous sites. We also constrain the error rate to be constant throughout the sequence. However, some of these constraints are easily removed. We can include homozygous sites and four nucleotide alleles by expanding the phase set Δ. For example, the phase set of a multi-allelic variant is represented like Δ = {(A,C),(A,G),(C,A),(C,G),(G,A),(G,C)}. We can even include small structural variations if they can be represented by additional allele symbols and the phase set of a structural variant is represented such as Δ_{1 }= {(A,-),(-,A)} for indel and Δ_{2 }= {("AC","ACAC"),("ACAC","AC")} for short tandem repeats. With these extensions, the accuracy of genotype calling of multi-allelic variants from sequencing data might be improved by considering haplotypes simultaneously ^{e}

Datasets and data processing

Dataset generation

Simulation data were created through a strategy similar to the one reported by Geraci _{1 }and _{2}. We then randomly flipped the binary values of the fragments from 0(1) to 1(0) with probability _{1 }= 3, _{2 }= 7 and

For the real data, we used the dataset of Duitama's work ^{6 }heterozygous sites on autosomal chromosome and the haplotypes of about 1.36^{6 }sites were determined by a trio-based statistical phasing method

The normalized linkage disequilibrium

We compared our MixSIH software with ReFHap

For the comparison of the runtimes, we generated simulation data with

Accuracy measures

As described in the introduction, our algorithm is focusing on extracting the reliable haplotype regions. To examine whether we have succeeded in extracting the reliable haplotype regions, an accuracy measure which evaluates the quality of the piecewise haplotype regions is needed. However, existing accuracy measures are designed to compare the efficiency between the algorithms and are not suitable for evaluating the quality of the piecewise haplotype regions.

Let Φ^{(t) }be the true haplotypes, and Φ be inferred haplotypes. Because the inferred haplotypes Φ are sets of partially assembled haplotype segments Φ = (Φ_{1}, Φ_{2}, _{B}_{b }

Many previous papers used the Hamming distance to measure the quality of assembled haplotypes

where Φ_{0 }represents a fully assembled haplotype prediction and

However, this definition is inconvenient because the minimization is applied for each segment and this accuracy measure can always be improved just by breaking a segment into smaller pieces at random positions.

The switch error rate ^{(t) }at neighboring heterozygous sites:

An illustration of pair consistency

**An illustration of pair consistency**. Consistency of pair sites. A. a. We assume that the two true haplotypes are the sequences of all 0 and all 1. b. Inferred haplotypes contain switch errors indicated by the arrows: (i) a consistent pair, (ii) an inconsistent pair, and (iii) if there are an uncontrolled number of switch errors between a pair, the probabilities of being consistent or inconsistent are both 0.5. B. The example of the case that switch error rate is not suitable to evaluate the quality of the segment. The consistency of a reconstructed haplotype which has single switch error in the middle (top) is high than a reconstructed haplotype which has single switch error located at an end of the segment, but switch error rate cannot distinguish these situations. Two contiguous switch errors, which are caused by sequencing error or genotyping error and do not disrupt the consistency between front and back parts, are regarded as twice of a single switch error in switch error rate (bottom).

Here, we propose another simple accuracy measure based on the pairwise consistency of the prediction with the true haplotypes. This pairwise consistency score is inspired by the ^{'}^{'}

We define the total prediction space as follows. We consider a graph whose nodes are the set of all the heterozygous sites. We connect two nodes by an edge if there is a fragment spanning both the sites. We collect all the connected components with at least two nodes and consider each of the corresponding clusters of heterozygous sites as an independent segment. The total number of pairs is the sum of the numbers of all the pair sites over the segments. Although it is rare, there are cases in which some segments consist of noncontiguous heterozygous sites. For example, segment sets such as {(1, 4, 5), (2, 3)} and {(1, 3), (2, 4, 5)} may occur for the consecutive heterozygous sites (1, 2, 3, 4, 5). We define

A more detailed discussions of other accuracy measures is given in Additional file

Potential chimeric fragments

The processed sequence data derived from fosmid pool-based next-generation sequencing might contain chimeric fragments if a pool contains DNA fragments derived from the same region of different chromosomes and reads with different chromosomal origins are merged into a single SNP fragment. By using the trio-based haplotypes, we compute the 'chimerity' of each SNP fragment

where _{≤j }_{>j }_{0 }= 0.028 is the empirical sequence error rate computed by comparing the true haplotypes and all the SNP fragments. We removed potential chimeric fragments with chimerity higher than a given threshold. We recomputed the accuracies for this removed dataset and compared them with those for the original dataset.

Results and discussion

Comparison of pairwise accuracies

We examined whether MixSIH can extract the accurate haplotypes regions by using MC. Figure

Comparison of pairwise accuracies

**Comparison of pairwise accuracies**. Precision curves based on the consistent pair counts. The

Effects of potential chimeric fragments

Inspecting the switch errors in the prediction for the real dataset, we found that there are potential chimeric fragments that have a considerable effect on the pairwise accuracies. A chimeric fragment is defined as a fragment whose left and right parts match different chromosomes very well. Such fragments can occur in fosmid pool-based next-generation sequencing data. We show the chimerity distribution in Additional file

Effects of potential chimeric fragments

**Effects of potential chimeric fragments**. The precisions of the algorithms for the dataset in which fragments with chimerity greater than 10 are removed. For comparison, the precisions of MixSIH for the original dataset are also shown as diamonds.

These results suggest that more careful data processing to avoid spurious chimeric fragments is necessary to obtain high-quality haplotype assembly.

Incorporation of the trio-based data

Although the trio-based statistical phasing method can determine most of the phases of the sites, there still exist SNP sites whose phases cannot be determined by this method. SIH is capable of determining the phases which are not determined by the trio-based data, and we can obtain more complete haplotypes data by combining both of the SIH-based data and the trio-based data. To examine how many phases of the sites can be determined anew by combining both of the SIH-based data and the trio-based data, we devise a method that combines both information to determine the phases (see the Additional file

Spatial distribution of MC values

Figure

Spatial distribution of MC and LD

**Spatial distribution of MC and LD**. A. A colored density plot of the MC values and the number of fragments. The

Dependency of MC values on the fragment parameters

Figure _{1}, _{2}] (three panels), coverages

Dependency of MC values

**Dependency of MC values**. Dependency of the lowest MC value with precision _{1}, _{2}], and error rate

Optimality of inferred parameters

We use a heuristic method for parameter optimization to avoid sub-optimal solutions. To test whether the optimized parameters actually reach the global optimum, we compared the log likelihood of the optimized parameters with the approximate maximal log likelihood obtained by optimizing the parameters with an initial condition in which the optimal solution falls into the set of true haplotypes; we add one to the Dirichlet parameters for the true phase probability: that is,

Optimality of inferred parameters

**Optimality of inferred parameters**. Increase of log likelihood values for each iteration. The dotted line represents the approximate maximal log likelihood; the solid line, the changes of the optimized log likelihood for each twist operation; the broken line, the connectivity values at the positions that the optimizing parameters are twisted.

Comparison of running times

Figure ^{5 }heterozygous sites on chromosome 1, it is roughly estimated that MixSIH takes about 15 days to finish haplotyping for the data whose connected component includes all heterozygous sites, and MixSIH is still manageable for such chromosome-wide data.

Running times

**Running times**. The running times of the tested algorithms. The

Conclusions

With advances in sequencing technologies and experimental techniques, single individual haplotyping (SIH) has become increasingly appealing for haplotype determination in recent years. In this paper, we have developed a probabilistic model for SIH (MixSIH) and defined the minimal connectivity (MC) score that can be used for extracting accurately assembled haplotype segments. We have introduced a new accuracy measure, based on the pairwise consistency of the inferred haplotypes, which is intuitive and easy to calculate but nevertheless avoids some of the problems of existing accuracy measures. By using the MC scores our algorithm can extract highly accurate haplotype segments. We have also found evidence that there are a small number of chimeric fragments in an existing dataset from fosmid pool-based next-generation sequencing, and these fragments considerably reduce the quality of the assembled haplotypes. Therefore, a better data processing method is necessary to avoid creating chimeric fragments.

Our program uses only read fragment data derived from an individual. However, it is expected that more powerful analyses could be made by combining SIH algorithms with statistical haplotype phasing methods that use population genotype data. An interesting possibility would be to construct a unified probabilistic model that infers the haplotypes on the basis of both kinds of data.

Abbreviations

SIH: Single Individual Haplotyping; MC: Minimum connectivity.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

HM designed the probabilistic model, implemented the software, performed the analyses, and wrote the paper. HK contributed to develop the model, designed the experiments and wrote the paper. Both authors read and approved the final manuscript.

Acknowledgements

The authors thank their research group colleagues for assistance in this study. This study was supported by a Grant-in-Aid for Young Scientists (21700330), and a Grant-in-Aid for Scientific Research (A) (22240031). Computations were performed using the supercomputing facilities at the Human Genome Center, University of Tokyo. (

Declarations

The publication costs for this article were funded by a Grant-in-Aid for Young Scientists (21700330), and a Grant-in-Aid for Scientific Research (A) (22240031).

This article has been published as part of