Abstract
Background
Xorgenotype is a costeffective alternative to the genotype sequence of an individual. Recent methods developed for haplotype inference have aimed at finding the solution based on xorgenotype data. Given the xorgenotypes of a group of unrelated individuals, it is possible to infer the haplotype pairs for each individual with the aid of a small number of regular genotypes.
Results
We propose a framework of maximum parsimony inference of haplotypes based on the search of a sparse dictionary, and we present a greedy method that can effectively infer the haplotype pairs given a set of xorgenotypes augmented by a small number of regular genotypes. We test the performance of the proposed approach on synthetic data sets with different number of individuals and SNPs, and compare the performances with the stateoftheart xorhaplotyping methods PPXH and XORHAPLOGEN.
Conclusions
Experimental results show good inference qualities for the proposed method under all circumstances, especially on large data sets. Results on a real database, CFTR, also demonstrate significantly better performance. The proposed algorithm is also capable of finding accurate solutions with missing data and/or typing errors.
Background
A human genome is a sequence of nucleotides that can differ from one individual to another (approximately 0.1% difference between any two individual) due to various reasons, such as insertions/deletions of fractions of the sequence on the genome or mostly the substitution/mutation of single nucleotides on commonly observed sites called single nucleotide polymorphism (SNP) [1]. In most SNPs only two different nucleotides are observed out of 4 nucleotides. The information of nucleotide variations extracted from these SNP sites (loci) is encoded as a sequence called “haplotype”. That is, for a particular SNP site a notation is used for one of the observed nucleotides (e.g., the most commonly observed nucleotide variant  dominant/major allele) and another notation is used for the other (e.g., the least observed nucleotide variant  recessive/minor allele). Because of its informative and heredity nature identifying the haplotypes of individuals has been an important subject in various medical and scientific studies, such as gene related disease discovery and drug design [2,3], population history research [4], etc. Nonetheless, current experimental techniques are not lowcost and efficient enough for directly sequencing haplotypes of an individual; thereby identifying them is mostly based on indirect approaches, e.g., using computational methods to infer haplotypes from an alternative costeffective data called “genotype”.
The entire human genome consists of 23 distinct chromosomes each appearing in two copies (autosomes) except for the chromosome23 (allosome) which consists of two copies of chromosomeX in females or one chromosomeX and one chromosomeY in males. Each chromosome is a pair of two distinct sequences haplotypes inherited from the parents, i.e., one is from the maternal genome and the other is from the paternal genome. The genotype is sequenced by identifying the types of alleles nucleotide variants across the SNP locations (locus) in chromosomes. In a particular locus of a chromosome if both haplotypes have the same allele we call this site in the genotype homozygous and denote it with the type of alleles in both haplotypes as either commontype or wildtype; otherwise, if both haplotypes have different alleles –one commontype and one wildtype– we call this site heterozygous. When identifying haplotypes for a given genotype, the ambiguity occurs for the heterozygous sites since there is no information about which haplotype has the commontype allele and which haplotype has the wildtype allele. Clearly, genotypes are less informative than haplotypes, as they present an ambiguity on heterozygous sites due to possible permutations and computational methods can be employed to identify which allele come from which haplotype. Recently, more costeffective alternative methods have been used for genotype sequencing [5], e.g., widely used denaturing highperformance liquid chromatography (DHPLC) [6]. By certain applications of such methods one can only determine whether an individual has homozygous or heterozygous allele in a given SNP site, but cannot distinguish the type of allele in homozygous sites. The sequenced data is thereby less informative than the regular genotypes as it only represents the differing sites (XOR operation) between the haplotypes. This less informative form of genotype is named xorgenotype. One can solve the haplotype inference problem based on the xorgenotypes, i.e., xorhaplotyping, with a reasonable extra computational effort.
Methods for solving the haplotype inference problem given the regular genotypes can be summarized in two categories: combinatorial methods that usually state an explicit objective function and propose methods for optimizing it, and statistical methods that relies on the statistical modeling of the problem. Various methods have been published for the haplotype inference problem [713], however the xorhaplotyping problem mostly remained underinvestigated. Two particular methods are suitable for xorhaplotyping problems: parsimony haplotyping that is based on the maximum parsimony principle, and perfect phylogeny haplotyping that relies on a population genetics assumption called the infinite sites/alleles model[14], i.e., it assumes that allele sequences are long enough so that a particular allele will have a mutation only once in the phylogenetic tree. The perfect phylogeny (PP) model utilizes the infinite sites assumption by building a tree of individuals haplotypes where all individuals evolve, with no recurrent mutation, from one common ancestor. An approximate solution to xorhaplotyping problem in the case of PP model was introduced in [15] where the xorhaplotype inference was cast as a graph realization problem [16,17]. However, the proposed method (GREAL) in [15] is not wellsuited for the xorgenotypes with large number of SNPs, i.e., usually limited by 30 SNPs [18], and is not extended to missing data cases.
On the other hand, it is known that in a population of individuals certain haplotypes are frequently found in certain genomic regions [19]. This fact leads to the parsimony principle that states that the genotypes of a population of individuals are generated by the least number of distinct haplotypes. Identifying such smallest set of haplotypes is called Pure (Maximum) Parsimony Problem, which is NPhard [20]. An integer linear programing method was introduced in [21] that finds a pure parsimony solution to this problem, and in [22] a branchandbound method was used to solve pure parsimony problem. In [23] a method called XORHAPLOGEN was proposed for solving haplotype inference problem in the case of xorgenotype data. This method can find accurate solutions for xorgenotypes with large number of SNPs. Another parsimony method was introduced in [24] for xorhaplotype inference by representing it as a graph realization problem called pure parsimony xor haplotyping (PPXH).
In [25] a novel framework for (regular) haplotyping was proposed by interpreting the parsimony principle as a sparse representation of the genotypes. Two approaches are presented: maximizing a sparseness condition on the haplotype frequency vector determined by the inferred haplotypes, and casting the sparsity of this frequency vector as a sparse dictionary selection problem. The latter approach is based on an efficient greedy method SHSD where haplotypes explaining the given genotypes are determined according to a sparse selection from the set of compatible haplotypes. The method constructively determines the solution of each individual while selecting the haplotypes from this set, and it has the convergence guarantee.
For the xorhaplotyping problem, there is an increased ambiguity due to the XOR operation between haplotypes, i.e., the process of xorgenotyping that determines whether the type of alleles in both haplotypes differ in a particular site (heterozygous) or they are the same (homozygous). However this ambiguity can be resolved with the assistance of regular genotypes. Regular genotypes can either be used as postprocessing inputs for eliminating setequivalent solutions of a particular inference, or they can be used to refine inference while constructing the solution.
Tractability of the maximum parsimony haplotyping problem in the xorgenotype case is still open [24]. In this paper, we propose a modified version of SHSD —XHSD—, that can efficiently find a solution for maximum parsimony xorhaplotyping problem and resolve the ambiguity with the help of a small number of regular genotypes. For a given set of xorgenotypes the haplotype pairs for each individual are selected from the set of compatible haplotypes by a sparse dictionary selection method. The selection of dictionary columns from the set of compatible haplotypes and the sparse representation of xorgenotypes is formulated as a joint combinatorial optimization problem. The objective function of this problem maximizes a variance reduction metric over all individuals. Our algorithm is a lowcomplexity greedy method that terminates once the solution is fully determined. To resolve the ambiguity and to improve the inference accuracy, we employ a small number of regular genotypes as constraints for the set of compatible haplotypes to help resolve the type of homozygous alleles.
The remainder of the paper is organized as follows. In Preliminaries, we introduce the xorhaplotype inference problem. In Methods, we formulate the xorhaplotype inference as a sparse dictionary selection problem and present an efficient greedy method for solving this problem. We also discuss the use of regular genotypes to resolve ambiguity. In Extensions section we discuss how the algorithm deals with long sequences and data with missing sites. In Results and discussion, we present the experimental results on synthetic and real data sets under various conditions. Finally, the Conclusions section is given in the end.
Preliminaries
In an SNP locus only 2 nucleotides are observed, and a single bit is sufficient for the representation of nucleotide variants such that 0 encodes the major allele and 1 encodes the minor allele. The haplotype of an individual can thereby be represented with a binary vector that shows the SNP variants across the individual’s chromosome. The genotype can then be thought of as a ternary vector where a 0 (2) indicates that the site is homozygous and both haplotypes have major 0/0 (minor 1/1) alleles, and 1 indicates that the site is heterozygous and the haplotypes have different alleles 0/1 or 1/0. Notice that when encoding homozygous and heterozygous sites we used a different notation from the literature in order to express a genotype vector as the sum of two haplotypes: a minorhomozygous SNP is encoded with 2 and a heterozygous SNP is encoded with 1, so that a 2 in the genotype is given by (the sum of) two minor alleles, and a 1 in the genotype is given by (the sum of) one major and one minor allele.
In general, given a lengthL genotype vector, k ≤ L of the loci are heterozygous and thereby ambiguous, in each of the k sites one haplotype can take two values –0 or 1– and the other haplotype takes the complement value. Considering all k heterozygous sites, one haplotype can then be one of the 2^{k} binary sequences, and the other haplotype will be the complement (inverted values) of that sequence. Therefore, for solving a genotype with k heterozygous sites, the pair of haplotypes is drawn from a set of 2^{k} distinct binary vectors of lengthL.
On the other hand, in xorhaplotyping problem the conflated data — xorgenotype — is less informative than the regular genotype with respect to the information loss about the type of allele in homozygous sites. The xorgenotype is itself a binary vector, where for a given site, 1 indicates heterozygous SNP where both haplotypes have different alleles for this given site. The xorgenotype can be represented by the XOR sum of two haplotypes, likewise, for a given site 0 indicates a homozygous SNP where the haplotypes have the same allele but without any distinction whether the type of the allele is major or minor. That is, the xorgenotype contains the information whether a particular SNP site has homozygous alleles, but the type of alleles for those homozygous sites is not identified. Every site of an xorgenotype is ambiguous, and each site of the corresponding haplotype can take two values. Therefore, a lengthL xorgenotype can be explained by a pair of haplotypes that are drawn from a set of 2^{L} distinct binary vectors of lengthL. Hence, because of the additional ambiguity on homozygous sites, the number of possible solutions for an xorgenotype is significantly (in fact, exponentially) larger than that of a regular genotype of the same size.
Besides the xorhaplotyping problem is NPhard, there is also no unique solution to
this problem. The nature of the XOR operation results in a phenomenon called bit flip degree of freedom[15], i.e., for a particular solution set H consisting of lengthL haplotypes, one can produce equivalent solution sets by inverting a certain SNP i ≤ L (or a set of SNPs
Problem definition
For each SNP
where x_{i}(ℓ) ∈ {0, 1} is the xorgenotype of the ith individual in SNP ℓ, and
where
In regular haplotyping, a putative haplotype z ∈ {0, 1}^{L} is called compatible with a genotype g ∈ {0, 1, 2}^{L} if (g − z) ∈ {0, 1}^{L}, and such a haplotype is a possible solution that can explain that genotype. That
is, the haplotype pair {z,(g − z)} is one of the possible solutions to the genotype g. Therefore, for every given genotype g_{i} it is essential to determine a set of compatible haplotypes
In xorhaplotyping, on the other hand, it is trivial to see that any haplotype z ∈ {0, 1}^{L} is compatible (consistent) with any xorgenotype x, i.e., x = z ⊕ z^{′} since there always exists a haplotype z^{′} ∈ {0, 1}^{L} such that z^{′} = x ⊕ z. Therefore, the set of compatible haplotypes
Because of this compatibility between the xorgenotypes and candidate haplotypes an SNP site can always be explained by either of the two alleles, and thus unambiguous SNPs do not exist anymore. Notice that, in particular, an xorgenotype with allhomozygous SNPs is still ambiguous and requires to be solved up to bit flipping. However, we know that such an xorgenotype is always explained by a pair of identical haplotypes which correspond to the same column of Z. On the other hand, if there is at least one heterozygous SNP in the xorgenotype then its phasing haplotypes are not identical and correspond to the different columns in Z.
The xorgenotype of ith individual is expressed as
where (.)_{2} represents the componentwise modulo2 operation, and v_{i} ∈ {0, 1, 2}^{M}, 1^{T}v_{i} = 2, is the sparse vector indicating the haplotype locations as the indices of the matrix Z of consistent haplotypes. Notice that the modulo2 operation in (3) is equivalent to the XOR operation between the two haplotypes selected by v_{i}.
Given Z, finding the indicator vector v_{i} for an individual is equivalent to inferring its haplotype pair
Methods
Xor Haplotyping by Sparse Dictionary Selection (XHSD)
If an (allhomozygous) xorgenotype is explained by only one haplotype, i.e., x_{i} = h^{s} ⊕ h^{s}, where the haplotype h^{s} is the sth column of Z, then the indicator vector multiplies that haplotype by 2, i.e., v(s) = 2 and v(j) = 0 for
where
For each observed xorgenotype x_{i}, the phasing haplotypes are located in columns of Z indexed by
To solve the xorhaplotyping problem, we choose the sparse dictionary
For an observed xorgenotype the reconstruction accuracy can be interpreted as the Euclidean distance between the observation and its closest approximation, i.e.,
where
The individual cost function in (5) is then translated into a fitness function associated
with a given dictionary
Finally, the fitness value of
For a given cardinality (sparsity) of n, the best dictionary is therefore given by
and the sparsest dictionary that is sufficient to reconstruct all observed xorgenotypes is determined by
Notice that determining both
For xorhaplotype inference, on the other hand, the problem is fundamentally different. That is, the submodularity property may not hold for the cost function in (5) due to the XOR operation, and thereby the theoretical guarantee does not hold either for the greedy method. Nonetheless, we still use the similar greedy heuristic as SHSD in [25] in order to maximize the variance reduction metric in (5) over the set of observations.
In our algorithm Xor Haplotyping by Sparse Dictionary Selection (XHSD), we start with
an empty dictionary set
To compute (10) requires solving (5) and (6) for each k. In (6) for each individual i,
Notice that in XHSD algorithm the number of compatible haplotypes Z exponentially increase in comparison to regular haplotyping problem with SHSD. However, –when available– we can reduce Z with respect to regular genotype information via utilizing them in the cost function (5). The necessary modifications are discussed in the next section XHSD with regular genotypes. Another fundamental difference in xorhaplotyping is that the xorgenotypes do not provide unambiguous genotype information which one can initialize the dictionary with corresponding haplotypes and improve the reconstruction accuracy. Nonetheless, with a bias weight, the modified cost function can exploit the available regular genotypes even when they are not unambiguous.
Summary of XHSD algorithm:
• Initialization.
–
– n ← 1.
–
• Iterate until all xorgenotypes are explained, i.e.,
– Perform the greedy search.
∗ For
∗ Let
Set
∗ Check if any xorgenotype is explained by the addition of the new element
– n ← n + 1.
Given the xorgenotypes of a set of individuals, this algorithm finds the haplotypes of each individual based on the maximum parsimony principle.
As an example, consider the following demonstration. Let x_{1},x_{2} and x_{3} be the xorgenotypes of three individuals each corresponding to three SNPs, i.e.,
The set of compatible haplotypes for these individuals will consist of all length3 binary vectors, i.e.,
After initializing Z, and starting with empty dictionary
This simple example demonstrates how the proposed greedy approach can efficiently construct sparse solutions, where three xorgenotypes are explained by only three haplotypes within three iterations. Nonetheless, the solution set has the ambiguity of being one of the equivalent sets of the true solution due to the bit flip degree of freedom which should be resolved.
Resolving bit flip degree of freedom
In [15] it is shown that the xor perfect phylogeny problem can be solved up to bit flipping
based on the characteristics of the given xorgenotypes. Let X ∈ {0, 1}^{L × N} be the xorgenotypes matrix of N individuals such that
By bit flipping on a given solution H, one attempts at choosing among the setequivalent solutions
Figure 1. Ambiguity resolution for PPXH method. Informative regular genotypes
However, this postprocessing method have certain limitations. Notice that, for large L the setequivalent solutions are highly specific to the infererred H, e.g., for a given set of xorgenotypes it is very likely that any two different inferences H_{1} and H_{2} —which are not setequivalent— can have very different setequivalent solutions. Bit flipping on different inferences likely leads to different results, and thereby the bit flipping accuracy largely depends on the initial inference H which is made by avoiding the prior knowledge on homozygous SNPs, i.e., regular genotypes. Besides, –when available– utilizing more regular genotypes in postprocessing does not necessarily improve the bit flipping accuracy. Basically, to decide among the appropriate bit flippings for a particular locus requires the knowledge of that homozygous SNP from a regular genotype. Intuitively, to reveal a set of homozygous SNPs by employing the least number of regular genotypes, e.g., provided by the MTI method, will be necessary and sufficient for removing the bit flip degree of freedom across those SNPs. On the other hand, a larger number of regular genotypes will not be any more informative due to possible inconsistencies on the type of homozygous allele for an SNP site across the given regular genotypes.
Furthermore, notice that flipping the bits on some loci across all the haplotypes in H does not affect the parsimony of the solution. The final solution H^{′} will have the same parsimony with H regardless of the set of loci that are flipped. From the maximum parsimony point of view, refining an xorhaplotyping solution via bit flipping method does not necessarily lead to global optimum unless the initial inference is a setequivalent of the global optimal solution.
Therefore, instead of using regular genotypes to postprocess a solution, a more intuitive way could be to aim at resolving the bitflip degree of freedom while constructing the solution. In particular, regular genotypes can be used as constraints when solving the homozygous sites of an xorgenotype. In this sense, given a set of individuals’ xorgenotypes we determine the individuals that have the most informative regular genotypes and preprocess the data set by replacing with the regular genotypes for those individuals. The MTI algorithm [15] is useful for finding the least number of such individuals that will be adequate to reveal the homozygous alleles for each of the L SNPs. In the proposed XHSD framework, we employ the MTI method to find which individuals should be replaced with regular genotypes and after replacing them the new data set is presented to the XHSD algorithm (Figure 2).
Figure 2. Ambiguity resolution in XHSD or in XORHAPLOGEN (XHAP). Informative regular genotypes
In most cases the xorgenotypes in X has empty intersection and for each run MTI outputs 2 or 3 individuals, i.e.,
Next we explain the necessary modifications to the XHSD algorithm for utilizing the regular genotypes.
XHSD with regular genotypes
The information provided by regular genotypes is used to reveal the type of allele
in homozygous sites of an individual so that we can improve the reconstruction accuracy
in (5), and build the dictionary
where Z is the set of haplotypes that are compatible with the ith individual’s genotype g_{i}, and
To exploit this fact, we can introduce a weight b_{i} in the cost function
The weight parameter b_{i} could be set as proportional to the average rate of homozygous SNPs per genotype, assuming that the more homozygous sites the regular genotype contains the more informative it will be. We experimentally set b_{i}=4 as it yielded good performance with both synthetic and real databases.
Extensions
Long xorgenotypes
Note that the size of Z grows exponentially with the lengthL due to the compatibility between haplotypes and xorgenotypes. That is, finding the solution of a lengthL xorgenotype requires to perform the greedy search over Z that consists of 2^{L} haplotypes. To mitigate the computational complexity we employ the partitionligation method [28] as in [25] where the block partitioning is based on identifying the recombination hot spots [29] existing between the haplotype blocks [30]. After partitioning the SNP sequences will be divided into blocks where within each block the haplotype diversity is as low as possible.
The haplotype diversity of a given block is measured by its Shannon entropy. The block
partitioning by minimizing the total Shannon entropy proceeds as follows. Let
The entropy of the haplotype block
and the total entropy of Q blocks, where each block
To determine the initial and ending loci of each block
Missing data
Genotyping errors often occur when the observed genotype of an individual differs from the original sequence for various reasons [31,32]. A particular type of genotyping error is the case when some loci are not observed/missed during sequencing or other application processes. Although methods dealing with some type of errors were proposed, often erroneous genotypes are produced with significant missing/error rates [33]. Therefore, it is of high importance for an xorhaplotyping technique to be adaptive for resolving such databases with missing sites. We next present a modification to XHSD in order to perform xorhaplotyping for the individuals exposed to missing data conditions.
Let
where
where
Different weight functions could be employed to exploit the distribution of missing sites. Since, in our experiments, the missing sites are uniformly distributed across the SNPs and individuals the function in (14) gave a good performance.
The proposed method does not account for the direct inference of the missing sites, i.e., imputing missing genotypes [34]. However, the missing values in each xorgenotype can be recovered from the solution by simply looking at the haplotype pairs which are specifically inferred for each individual. Since the proposed method has robust performance against missing data, as presented in the next section, the inferred solution will be sufficient to type missing genotype sites. An implementation of the proposed method –with aforementioned extensions– is provided in “Additional file 1”.
Additional file 1. Matlab implementation. This file includes the Matlab code of the proposed algorithm, and an implementation with the example database, CFTR.
Format: ZIP Size: 64KB Download file
Results and discussion
We tested the performance of several xorhaplotyping methods with a number of metrics.
First we measured the probability of error (P_{e}), i.e., the percentage of individuals whose inferred pair of haplotypes are different
from the original pair. This measure is sensible for assessing the inference quality
in regular haplotyping problem since the alleles corresponding to homozygous loci
are known and only the heterozygous loci are ambiguous thereby performance depends
on the inference accuracy on heterozygous loci. Nonetheless, in xorhaplotyping there
are a large number of equivalent solutions to original one up to bit flipping and
thereby it is very likely that a solution set differs from the original phasing on
at least one SNP. In particular, for a given xorgenotype even if there is a single
SNP difference (namely bit flip) between the pair of inferred haplotypes and the pair
of haplotypes that originally gave rise to that xorgenotype, it is counted as misinference.
A more sensible metric, therefore, would take into account the percentage of such
SNPs where the inference differs from the true phasing. In that sense, the switch error rate (swr) [35] is a proper metric that counts the minimum amount of required switches for heterozygous
loci to change to the correct alleles of the original haplotypes. It gives a sense
of how closely the inference was made, i.e., as a ratio of total misinferred heterozygous
loci
Moreover, to assess the accuracy on homozygous sites, we employ prediction error rate (err_{p}) [23] computed as the fraction of incorrectly predicted hiddenhomozygous sites out of all hiddenhomozygous sites, i.e.,
We performed xorhaplotyping on various data sets, with and without missing information on loci: synthetic data sets with different recombination rates simulated by a coalescence based program of [36], a database consisting of the SNPs in the CFTR gene that is associated with cystic fibrosis (CF) disorder [37], and another database (ANRIL) containing the SNPs that have relatively lower linkage disequilibrium (high polymorphism). We tested different xorhaplotyping methods that are based on different assumptions including the parsimony graph realization model PPXH [24], the parsimony genetic search model XORHAPLOGEN (XHAP) [23], the graph representation model GREAL [15], and an integer programming approach PolyIP [38]. Among the four methods the last two were ineffective for practical reasons. GREAL failed at finding solutions for data sets with reasonably long sequences (SNPs >30), and PolyIP method is often computationally inefficient when solving even a simple problem (e.g., it takes more than 24 hours to solve a set of 50 individuals with 30 SNPs).
Synthetic data
Based on different recombination rates three different scenarios are considered in synthetic data sets: no recombination (r = 0), and recombination with rates r=4 and r=40, respectively. The recombination rate is the rate that the haplotypes of an individual exchange the sequence fragments due to several reasons such as crossingover events. This fact is simulated by a model given in Hudson’s software [36]. For each scenario we generated 100 different data sets by random pairing of a set of simulated haplotypes of different lengths (5 ≤ L ≤ 46) for a given population size. This is repeated for different population sizes as well, N ∈ {10, 20, 30, 40, 50}.
In Figure 3, the performances of different methods on short data sets (L<14) are displayed which is based only on xorgenotypes. The quality of inference is exhaustively determined after removing all bit flip degrees of freedom by looking for the best equivalent set of a particular inference, i.e., performing an exhaustive search to find the best bit flipping that gives a result closest to the true phasing of xorgenotypes. Such evaluation shows the best inference performance of different methods without the help of regular genotypes. Compared to other methods, XHSD can potentially resolve a set of xorgenotypes with comparably low error rates. Moreover, XHSD achieves the lowest switch error rates, especially for large datasets, indicating a better accuracy (i.e., similarity with the true haplotypes) for the initial inference given only the xorgenotypes.
Figure 3. Potential inference quality on short (L <14) synthetic data.
To evaluate the inference quality when regular genotype data are available, we first determined only a limited number of regular genotypes by the MTI method, i.e., the smallest set of regular genotypes that have empty intersection on the heterozygous SNPs, then resolved the ambiguity by bit flipping on the initial inference according to these regular genotypes (Figure 4). This test evaluates how methods can deal with bitflip degree of freedom under very limited regular genotype data that –in theory– suffice to resolve all SNPs. Given the long xorgenotype data sets (5 ≤ L ≤ 46), block partitioning is applied in XHSD by limiting the maximum block size to W=8 SNPs. From Figure 4, we can say that XHSD has the best potential to make an inference with high accuracy when the regular genotypes are introduced. We also applied the proposed XHSD framework represented in Figure 2 to the same dataset where 2 xorgenotypes are replaced with the regular genotypes. Note that the Proposed XHSD achieves a significant decrease in P_{e} rates despite the small augmentation of data by only 2 regular genotypes, compared to using them in the postprocessing, i.e., XHSD (bit flipping).
Figure 4. Performance on long (5 ≤ L ≤ 46) synthetic data by bit flipping via 2 regular genotypes.
It is worthy of noting that the algorithms based on segmentation may deteriorate when processing long xorgenotype sequences, especially with increasing recombination rates where the detection of haplotype blocks is complicated [39]. We used block partitioning (segmentation) in XHSD to reduce complexity when processing long xorgenotype sequences. In Figure 4 the segmentation effect is noticeable particularly in very high recombination rates, i.e., r = 40. However, in general scenario, i.e., r ≤ 4, we can say that the segmentation effect is not significant for the proposed method’s performance, and it outperforms XORHAPLOGEN in most data sets containing typical recombination rates.
For more practical results we added regular genotypes in each method with different percentages of the population and allowed the methods to remove ambiguity by their own, except for PPXH. Since PPXH cannot make use of regular genotypes directly, we applied bit flipping using the MTI solver to remove ambiguity for this method. To regularly genotype a given percentage of the population, the regular genotypes are determined by running the MTI method several times until the number of distinct regular genotypes obtained achieves the given percentage of the total number of individuals.
Figure 5 shows performances on the synthetic data of a large population of 50 individuals with zero recombination rate, where cases are considered from 10% (5 individuals) to 100% (50 individuals) of the population are given by regular genotypes. XHSD overperforms other methods in almost all cases. Particularly after 20% of the population is given by regular genotypes, XHSD can immediately utilize regular genotypes and significantly improve the accuracy on both homozygous (err_{p}) and heterozygous sites (swr). We can conclude that the parsimony principle of XHSD method is wellsuited for inferring the heterozygous sites, and for predicting the homozygous sites it usually suffices to have a small percentage of regular genotypes.
Figure 5. Performance on long (5 ≤ L ≤ 46) synthetic data from 50 individuals by employing different numbers of regular genotypes.
Missing data
We investigated capability for dealing with missing data under different circumstances by various methods. Since the methods performed similarly under zero recombination rate we used the same data sets with no recombination to generate the database with missing entries. An SNP site of an individual is defined as “missing” with a probability of P_{miss} and the data sets for different percentages of missing SNPs are generated accordingly. PPXH method is excluded since it cannot handle missing data. In XHSD the block partitioning is applied as before with a maximum block size of W = 8 SNPs.
Figures 6 and 7 show the performances in different scenarios of partial regular genotyping under different rates of missing data. As in the previous plots, each point represents the average value of the corresponding metric over 100 realizations–100 different sets of varying SNP sizes between 5 and 46. In most cases, XORHAPLOGEN and XHSD are insensitive to the increased number of missing sites. XORHAPLOGEN is more accurate for small group of individuals. Nonetheless, when more individuals are available in the database (N>30) XHSD displays a better performance in all circumstances.
Figure 6. Performance under low rates of missing data, long (5 ≤ L ≤ 46) synthetic data.
Figure 7. Performance under high rates of missing data, long (5 ≤ L ≤ 46) synthetic data.
We examined the dependency of methods on percentage of the missing data rate for a population with large number of individuals. That is, we used the xorgenotypes from 50 individuals and replaced 30% and 50% of the population with regular genotypes, and performed xorhaplotype inference under different missing data rates ranging from 0.5% to 5%. As seen in Figure 8 both methods are robust against missing data. On the other hand, XHSD is less dependent on regular genotypes and it can achieve better error rates than XORHAPLOGEN by employing even less number of regular genotypes. XORHAPLOGEN needs approximately 20% more regular genotypes to reach the same P_{e} level with XHSD, e.g., regular genotyping by 30% in XHSD is comparable to that of 50% in XORHAPLOGEN.
Figure 8. Performance under different percentages of missing data.
CFTR gene database
Cystic fibrosis (CF) is an autosomal recessive disorder caused by mutations in the gene that encodes the cystic fibrosis transmembrane conductance regulator protein (CFTR). In [37], various mutations on 23 polymorphic locations from the chromosome 7 are detected as the disease loci for CF. We used this database corresponding to 29 distinct haplotypes to generate random xorgenotypes. By combining the haplotype pairs at random we generated the xorgenotypes for a given number of individuals N, and repeated the process for different population sizes, i.e., N ∈ {100, 200, 300, 400}. In this database, the data sets with small number of individuals present high haplotype diversities, i.e., many of the distinct haplotypes are only used once in the generation of individuals. Therefore, the larger data sets that have low haplotype diversities are expected to be solved with higher accuracy by biologicallyoriented methods, such as XORHAPLOGEN which obtains its inference according to a multilocus linkage disequilibrium (LD)based block identification model.
We tested the performance of each method on this database with/without missing sites {0,5%}. PPXH method was excluded from the missing data analysis since it cannot deal with missing data. XHSD is applied with block partitioning and the maximum block length of W=8 SNPs as before. It is seen in Figure 9 that XHSD outperforms for various population sizes with significantly low error rates. As the xorgenotypes are taken from more individuals, the inference accuracy is immediately improved in XHSD and XORHAPLOGEN, whereas PPXH do not have this ability to benefit from the additional data.
Figure 9. Performance on CFTR gene database with different population sizes with/without missing data.
Figure 10 shows the average running times of each method performing on this database. It is observed that XHSD has similar computational complexity as the size of data set grows, and it shows comparable running times with XORHAPLOGEN. Although PPXH performs significantly faster, it cannot mitigate the high error rates and is not able to provide accurate inferences.
Figure 10. Running times on CFTR gene database with different population sizes with/without missing data.
Typing errors
Combinatorial optimization techniques are known with their sensitivity to genotyping errors [40]. Thereby, we tested the effect of typing errors on the proposed algorithm using CFTR gene database. We defined a SNP site of an individual as erroneous with a probability of P_{err}, and typed the site as either homozygous or heterozygous with equal probabilities. We then run the algorithms without providing the knowledge of erroneous site positions. We excluded PPXH method due to its low performance on the CFTR database. Figure 11 illustrates the algorithms’ performance on typing errors with P_{err} = 2%. It is seen that XORHAPLOGEN is a more robust method against typing errors because of its statical nature. Nonetheless, the proposed XHSD algorithm can deal with erroneous data containing ∼2% typing errors, with a small increase in the error rates compared to the results without typing errors.
Figure 11. Performance on CFTR gene database for different population sizes, with P_{err}=2%, with/without missing data.
ANRIL database
The performance of haplotyping methods can deteriorate on databases with decreasing linkage disequilibrium (LD) rates. A SNP database with low pairwiseLD scores are investigated in an association study given in [41] for their susceptibility to certain types of leukemia. This database includes 16 SNPs from the chromosome 9p21 associated with several diseases and a SNP locus encoding for antisense noncoding RNA in the INK4 locus (ANRIL)[42]. We used the corresponding haplotype data from HAPMAP database (http://hapmap.ncbi.nlm.nih.gov/ webcite) collected from 90 European individuals. We generated the xorgenotypes for the individuals by using their haplotype pairs and tested the algorithms on this database. It is seen from the Figure 12 that the algorithms deteriorate when inferring the haplotypes with lowLD SNPs. XHSD shows very similar performance with XORHAPLOGEN, and both methods overperform PPXH on this database.
Figure 12. Performance on ANRIL gene database with different population sizes with/without missing data.
Notice that the algorithms cannot mitigate the error rates with increasing number of individuals. This can be explained by the occurrence of very high haplotype diversity in corresponding lowLD SNP regions. The number of distinct haplotypes explaining the given number of individuals presumably remains at high diversity as the number of individuals grows, whereas the methods based on maximum parsimony principle fail to incorporate this fact. They are tend to find parsimonious (lowdiversity) solutions in all population sizes, with a decreasing ratio (ρ) of “total number of distinct haplotypes explaining the given set of individuals” to “total number of given individuals” as the population size grows. It is worthy of noticing that, in XHSD results in Figure 12 (P_{miss} = 0), we observed that such ratio decreases as ρ = [1.3, 0.95, 0.83, 0.72, 0.66] in respect to the populations with 10,20,30,40,50 individuals; whereas the same ratio for the true phasing (ground truth data) is in fact much higher, i.e., ρ = [1.7, 1.48, 1.34, 1.27, 1.24], respectively, thereby causing the parsimonybased haplotyping methods to deteriorate on this database. On the other hand, in highLD CFTR database, the same ratio for the true phasing is very low due to low haplotype diversity, i.e., ρ = [0.29, 0.14, 0.1, 0.07], in respect to the populations with 100,200,300,400 individuals, and the XHSD method is good at achieving very similar rates,i.e., ρ = [0.43, 0.15, 0.1, 0.07], respectively.
Conclusions
In this paper, we have presented a new xorhaplotyping method XHSD based on the maximum parsimony principle that infers the haplotype pairs for each member of a group of unrelated individuals by observing their xorgenotypes. A dictionary selection method is utilized to find the smallest set of haplotypes selected from a candidate set that can explain the given set of xorgenotypes. The proposed approach requires regular genotypes from only a small percentage of individuals for the removal of ambiguity across all SNPs of the inferred haplotypes. The smallest subgroup of individuals having the most informative regular genotypes are efficiently determined by the minimum tree intersection algorithm. Although the inference accuracy was proportional to the percentage of the individuals given by regular genotypes, XHSD shows less dependency on regular genotypes compared to other methods. Experimental results have demonstrated that XHSD is a reliable method for xorhaplotyping under all circumstances including missing data and typing error cases. Low rates of missing values (≤ 10%) on the xorgenotypes has often insignificant contribution to the error rates, and the proposed method can deal with ∼ 2% typing errors. Particularly for large databases, XHSD produces the most accurate solution with significantly low error rates compared to other lowcomplexity xorhaplotyping methods. Experiments with CFTR gene database also proved that our approach can perform effectively on real data sets with/without missing sites. Another database with particularly lower LD rates indicates that the proposed algorithm can achieve the best performance with the stateoftheart algorithms. We expect that XHSD can serve as a practical tool for xorhaplotyping on realworld large instances, as the large data collections become more available in the era of nextgeneration DNA sequencing.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
XW and GJ conceived of the project. AE, GJ and XW participated in the design of the method. AE performed the computer experiments and contributed in the writing of the draft. All authors read and approved the final manuscript.
References

Brookes A: The essence of SNPs.
Gene 1999, 234:177186. PubMed Abstract  Publisher Full Text

Risch N, Merikangas K: The future of genetic studies of complex human diseases.
Science 1996, 273:15161517. PubMed Abstract  Publisher Full Text

Kwok PY, Gu Z: Single nucleotide polymorphism libraries: why and how are we building them?
Mol Med Today 1999, 5:538543. PubMed Abstract  Publisher Full Text

Gray IC, Campbell DA, Spurr NK: Single nucleotide polymorphisms as tools in human genetics.
Human Mol Genet 2000, 9:24032408. Publisher Full Text

Peters T, Sedlmeier R: Current methods for highthroughput detection of novel DNA polymorphisms.
Drug Discov Today: Technol 2006, 3(2):123129. Publisher Full Text

Xiao W, Oefner PJ: Denaturing highperformance liquid chromatography: a review.
Human Mutat 2001, 17:439474. Publisher Full Text

Bansal V, Bafna V: HapCUT: An efficient and accurate algorithm for the haplotype assembly problem.
Bioinformatics 2008, 24(16):i153i159. PubMed Abstract  Publisher Full Text

He D, Choi A, Pipatsrisawat K, Darwiche A, Eskin E: Optimal algorithms for haplotype assembly from wholegenome sequence data.
Bioinformatics 2010, 26(12):i183i190. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR: MaCH: Using sequence and genotype data to estimate haplotypes and unobserved genotypes.
Genet Epidemiol 2010, 34(8):816834. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gupta A, Manuch J, Stacho L, Zhao X: Algorithm for haplotype inference via galledtree networks with simple galls.
J Comput Biol 2012, 19(4):439454. PubMed Abstract  Publisher Full Text

Iliadis A, Anastassiou D, Wang X: A unified framework for haplotype inference in nuclear families.
Ann Hum Genet 2012, 76(4):312325. PubMed Abstract  Publisher Full Text

Lai E, Wang W, Jiang T, Wu K: A lineartime algorithm for reconstructing zerorecombinant haplotype configuration on a pedigree.
BMC Bioinformatics 2012, 13(Suppl 17):S19. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

He D, Han B, Eskin E: Hapseq: An optimal algorithm for haplotype phasing with imputation using sequencing data.
J Comput Biol 2013, 20(2):8092. PubMed Abstract  Publisher Full Text

Kimura M, Crow JF: The number of alleles that can be maintained in a finite population.

Barzuza T, Beckmann JS, Shamir R, Pe’er I: Computational problems in perfect phylogeny haplotyping: Typing without calling the allele.

Liberatore V: Matroid decomposition methods for the set maxima problem. In Proceedings of the Ninth Annual ACMSIAM Symposium on Discrete Algorithms. Philadelphia: Society for Industrial and Applied Mathematics; 1998:400409.

Gusfield D: Haplotyping as perfect phylogeny: conceptual framework and efficient solutions. In Proc. 6th Annual Conference on Research in Computational Molecular Biology (RECOMB 2002). New York: ACM; 2002:166175.

Barzuza T, Beckmann J, Shamir R, Pe’er I: Computational problems in perfect phylogeny haplotyping: XorGenotypes and tag SNPs. In Combinatorial Pattern Matching, Volume 3109. Edited by Dogrusoz U, Muthukrishnan S, Sahinalp S. Berlin, Heidelberg, New York: SpringerVerlag; 2004:1431.

Patil N, et al.: Blocks of limited haplotype diversity revealed by highresolution scanning of human chromosome 21.
Science 2001, 294:17191723. PubMed Abstract  Publisher Full Text

Lancia G, Pinotti MC, Rizzi R: Haplotyping populations by pure parsimony: Complexity of exact and approximation algorithms.
INFORMS J on Comput 2004, 16:348359. Publisher Full Text

Gusfield D: Haplotype inference by pure parsimony. In Comb Pattern Matching. Berlin, Heidelberg, New York: SpringerVerlag; 2003:144155.

Wang L, Xu Y: Haplotype inference by maximum parsimony.
Bioinformatics 2003, 19:17731780. PubMed Abstract  Publisher Full Text

Sazonova N, Sazonov E, Harner EJ: Algorithm for haplotype resolution and block partitioning for partial XORgenotype data.
J of Biomed Inform 2010, 43:5159. Publisher Full Text

Bonizzoni P, Della Vedova G, Dondi R, Pirola Y, Rizzi R: Pure parsimony xor haplotyping.

Jajamovich GH, Wang X: Maximumparsimony haplotype inference based on sparse representations of genotypes.

Nemhauser GL, Wolsey LA, Fisher ML: An analysis of approximations for maximizing submodular set functions—I.
Math Program 1978, 14:265294. Publisher Full Text

Krause A, Cevher V: Submodular dictionary selection for sparse representation. In Proceedings of 27th International Conference on Machine Learning (ICML 2010). Madison: Omnipress; 2010:567574.

Niu T, Qin Z, Xu X, Liu J: Bayesian haplotype inference for multiple linked singlenucleotide polymorphisms.
Am J Hum Genet 2002, 70:157169. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Liang K, Wang X: A deterministic sequential monte carlo method for haplotype inference.

Hey J: What’s so hot about recombination hotspots?
PLoS Biol 2004, 2:e190+. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bonin A, Bellemain E, Bronken Eidesen P, Pompanon F, Brochmann C, Taberlet P: How to track and assess genotyping errors in population genetics studies.
Mol Ecol 2004, 13:32613273. PubMed Abstract  Publisher Full Text

Pompanon F, Bonin A, Bellemain E, Taberlet P: Genotyping errors: causes, consequences and solutions.
Nat Rev Genet 2005, 6:847846. PubMed Abstract  Publisher Full Text

Hackett CA, Broadfoot LB: Effects of genotyping errors, missing values and segregation distortion in molecular marker data on the construction of linkage maps.
Heredity 2003, 90:3338. PubMed Abstract  Publisher Full Text

Marchini J, Howie B, Myers S, McVean G, Donnelly P: A new multipoint method for genomewide association studies by imputation of genotypes.
Nature Genet 2007, 39(7):906913. PubMed Abstract  Publisher Full Text

Tininini L, Bertolazzi P, Godi A, Lancia G: CollHaps: a heuristic approach to haplotype inference by parsimony.

Hudson RR: Generating samples under a WrightFisher neutral model of genetic variation.
Bioinformatics 2002, 18:337338. PubMed Abstract  Publisher Full Text

Kerem B, Rommens J, Buchanan J, Markiewicz D, Cox T, Chakravarti A, Buchwald M, Tsui L: Identification of the cystic fibrosis gene: genetic analysis.
Science 1989, 245:10731080. PubMed Abstract  Publisher Full Text

Halldorsson BV, Bafna V, Edwards N, Yooseph S, Istrail S: A survey of computational methods for determining haplotypes. In Lecture Notes in Computer Science (2983): Computational Methods for SNPs and Haplotype Inference. Berlin, Heidelberg, New York: SpringerVerlag; 2004:2647.

Stephens M, Scheet P: Accounting for decay of linkage disequilibrium in haplotype inference and missingdata imputation.
Am J Human Genet 2005, 76(3):449462. Publisher Full Text

Excoffier L, Laval G, Balding D: Gametic phase estimation over large genomic regions using an adaptive window approach.
Human Genomics 2003, 1:719. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Iacobucci I, Sazzini M, Garagnani P, Ferrari A, Boattini A, Lonetti A, Papayannidis C, Mantovani V, Marasco E, Ottaviani E, Soverini S, Girelli D, Luiselli D, Vignetti M, Baccarani M, Martinelli G: A polymorphism in the chromosome 9p21 ANRIL locus is associated to Philadelphia positive acute lymphoblastic leukemia.
Leukemia Res 2011, 35(8):10521059. Publisher Full Text

Pasmant E, Laurendeau I, Heron D, Vidaud M, Vidaud D, Bieche I: Characterization of a germline deletion, including the entire INK4/ARF locus, in a melanomaneural system tumor family: Identification of ANRIL, an Antisense Noncoding RNA whose expression coclusters with ARF.
Cancer Res 2007, 67(8):39633969. PubMed Abstract  Publisher Full Text