Email updates

Keep up to date with the latest news and content from BMC Genetics and BioMed Central.

Open Access Methodology article

iXora: exact haplotype inferencing and trait association

Filippo Utro1, Niina Haiminen1, Donald Livingstone24, Omar E Cornejo3, Stefan Royaert2, Raymond J Schnell24, Juan Carlos Motamayor4, David N Kuhn2 and Parida Laxmi1*

Author affiliations

1 Computational Biology Center, IBM T J Watson Research, Yorktown Heights, NY, USA

2 USDA-ARS SHRS, Miami, FL, USA

3 Stanford University, Stanford, CA, USA

4 MARS, Incorporated, Miami, FL, USA

For all author emails, please log on.

Citation and License

BMC Genetics 2013, 14:48  doi:10.1186/1471-2156-14-48


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2156/14/48


Received:7 October 2012
Accepted:14 May 2013
Published:6 June 2013

© 2013 Utro et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

We address the task of extracting accurate haplotypes from genotype data of individuals of large F1 populations for mapping studies. While methods for inferring parental haplotype assignments on large F1 populations exist in theory, these approaches do not work in practice at high levels of accuracy.

Results

We have designed iXora (Identifying crossovers and recombining alleles), a robust method for extracting reliable haplotypes of a mapping population, as well as parental haplotypes, that runs in linear time. Each allele in the progeny is assigned not just to a parent, but more precisely to a haplotype inherited from the parent. iXora shows an improvement of at least 15% in accuracy over similar systems in literature. Furthermore, iXora provides an easy-to-use, comprehensive environment for association studies and hypothesis checking in populations of related individuals.

Conclusions

iXora provides detailed resolution in parental inheritance, along with the capability of handling very large populations, which allows for accurate haplotype extraction and trait association. iXora is available for non-commercial use from http://researcher.ibm.com/project/3430 webcite.

Keywords:
Haplotype; Phasing; Phenotype Association; Trait Association; QTL; Randomization

Background

We address the task of extracting accurate haplotypes from genotype data of individuals of large F1 populations for mapping studies. Haplotypes are useful for inferring the underlying causal genetic basis of the traits in mapping populations as one can more efficiently evaluate the parental inheritance of the haplotype implicated in the determination of the trait [1,2]. iXora is specifically suited to plant (or animal) breeding, in which mapping populations of individuals of inbred (or non-inbred) parents are utilized. iXora uses a novel approach by effectively utilizing the large data size and exploiting the fortuitous combinatorial structure in the problem. The algorithm is outlined in the next section with a running example and the mathematical details are presented in Methods.

Given genotypes of n progeny on m loci, the general problem of constructing haplotypes from genotypes is NP-complete, under various models such as parsimony, maximum likelihood, phylogeny (see [3] for a detailed exposition). Both statistical and combinatorial frameworks have been used in the literature to solve the problem of haplotype extraction from genotype data. For instance, BEAGLE [4,5], fastPHASE [6], HAPI [7], HAPI-UR [8], MACH [9], and SHAPEIT [10,11] use a Hidden Markov Model; whereas Gusfield [12] proposes a combinatorial approach that is based on the parsimony principle. Merlin [13] uses pedigree data under a parsimony model to construct the haplotypes of F1 progeny based on their genotypes and the genotypes of the parents. A review of phasing methods, particularly applicable on human data, is presented in [14]. Based on the models of the input population, the existing methods can be categorized into the following scenarios: unrelated individuals, unrelated trios (two parents per one progeny), and related trios (two parents per several progeny, our F1 population of interest). These categories are discussed in more detail in Methods.

We compare and contrast iXora with existing phasing programs in literature, summarizing the results in Table 1. The results are described in the Section “Comparison with related methods”. The existing methods are unable to take advantage of the availability of large F1 population data without fragmenting it. Through simulation studies we show an improvement of about 15% accuracy in parental haplotype assignment over the next best method. Moreover, iXora runs in linear time and is robust enough to give the same level of accuracy even when the marker data of parents is completely absent.

Table 1. Definition and size of the classes

iXora provides a user-friendly, easy-to-use comprehensive environment for mapping studies. The analysis framework and user interface are described in Additional file 1: Figures A1 and A2. Its usage is described in the section “Using iXora” in Additional file 1, with a running example, where we also demonstrate that genomic regions can be associated with a phenotype at a much higher resolution with haplotypes than with genotypes. The iXora framework has been successfully applied to real data analysis [15], in which pod color phenotype was localized, with high resolution, to a single locus in the T. cacao genome.

Additional file 1. Additional text and figures. The file contains an example on using iXora on a simulated phasing and trait association scenario. Additionally, the file includes visualizations of the iXora framework and user interface. The file also contains technical details on the comparison with related methods.

Format: PDF Size: 2.2MB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Results and discussion

In this section we outline the main results: the iXora phasing algorithm and its comparison with related methods in literature.

Outline of the core algorithm

We give an overview of the iXora phasing algorithm here, while the details are described in Methods. The different steps of the algorithm, based on a parsimony principle, are shown in Figure 1. Note that each mathematical observation, with the details in Methods, is abbreviated as Obs in the figure and in the description below. To aid in the exposition, we use a concrete example and take the reader through the different steps of the algorithm. The following input progeny genotype matrix, <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M1">View MathML</a>, is a toy example for illustrative purposes only, with just four individuals (labeled i-iv) and four bi-allelic markers:

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M2">View MathML</a>

thumbnailFigure 1. Outline of the iXora phasing approach. The eight steps in the iXora haplotype extraction algorithm. Eqn and Obs refer to the Equations and Observations discussed in Methods. The task is to estimate the haplotypes of the two parents, say a and b, as well as those of the four progeny.

Phase I (Preprocessing)

In Step 1, we initialize two 4 × 4 matrices Ma and Mb based on input <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M3">View MathML</a>. Here a heterozygous position is represented by “X” while a homozygous position is chosen to be 0. Note that if a column has two types of homozygous genotypes, such as TT and CC (as in first column of <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M4">View MathML</a>), then the first is represented by 0 and the second by 1.

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M5">View MathML</a>

In this running example, we assume that parent genotypes are missing. That is, the entire genotypes of both the parents are unavailable. So, in Step 2, we take a first stab at guessing the parents’ haplotypes, labeled Va and Vb, each with two haplotypes indexed as 0 and 1. Since the first marker (j = 1) has two types of homozygous genotypes (i.e., TT and CC), it is assumed to be heterozygous in both parents. Also since TT is encoded as 0 in Step 1, the corresponding parent haplotype (index 0) in <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M6">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M7">View MathML</a> is T, i.e., <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M8">View MathML</a>. CC is encoded as 1, thus <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M9">View MathML</a>. The second marker (j = 2) has only one homozygous genotype (i.e., TT), and is assumed to be homozygous in only one of the parents. Similarly, the third marker (j = 3) is homozygous in only one of the parents, but the fourth (j = 4) marker is assumed to be homozygous in both parents. Thus only the second and the third marker need to be resolved further and this is summarized as follows (currently unresolved values denoted by “?” and homozygous by “H”):

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M10">View MathML</a>

Without loss of generality, let the second marker be homozygous in parent b. The third marker is resolved relative to the second marker based on a global polarization rule (Obs 2 in Methods). This heuristic works on column 2 (j) and column 3 (<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M11">View MathML</a>) to compute <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M12">View MathML</a> that tracks the four counts corresponding to the four haplotype pairs 00, 01, 10 and 11 labeled as c00, c01, c10 in c11 respectively (count cxy is the number of individuals where marker at j is inherited from haplotype x and the marker at j is inherited from haplotype y of the same parent). In this rather simple example, all the four counts are zero, and thus <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M13">View MathML</a> is not polarized. Then the second and third markers are homozygous in different parents. Hence the third marker is homozygous in parent a. This constitutes Step 3 and the results are encoded in the two parents as follows (here “X” denotes heterozygous, “H” denotes homozygous locus and “-” denotes homozygous in both parents):

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M14">View MathML</a>

and

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M15">View MathML</a>

With these assignments of the parent haplotypes, the progeny haplotype assignment matrices Ma and Mb are updated in Step 4 as shown below. Note that, if the genotypes of the parents are available then Steps 2-3 are redundant and Va, Vb are initialized directly using the input genotypes of the parents.

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M16">View MathML</a>

Phase II (Detecting crossovers)

A marker j is homogenous in matrix Mp if the entire column j in Mp is marked as H. In Step 5, the polarization rule is applied again to pairs of adjacent non-homogenous markers (j and jnxt) in each of the matrices Ma and Mb separately. These result in possible switching of marker values in the parent haplotypes Vp and the progeny at that marker in Mp. The reader will observe that no switching is made in Ma. However, in Mb, for j = 1 and jnxt = 3, the four counts are c01 = 1, c10 = 1 and c00 = c11 = 0. Since <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M17">View MathML</a>, marker jnxt is switched in Vb and Mb to yield the following:

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M18">View MathML</a>

Some systematic transitions (Obs 5-6 in Methods) are applied to the non-numeric elements of the Ma and Mb, in Step 6, to obtain the following:

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M19">View MathML</a>

Phase III (Staging output)

In this toy example, we can simply transform e0 to 0 and e1 to 1 to obtain the phasing result in Ra and Rb (Steps 7-9):

The parent haplotypes are encoded in Va, Vb respectively and the progeny haplotypes in Ra and Rb. The solution shows no recombinations in the progeny, but has two errors between the phased sequences and the observed genotypes, shown in red. We omit the evaluation of the results (precision measures) here, and direct the reader to Methods.

Comparison with related methods

Here we describe the results from a simulation study on a F1 population with 200 individuals and 300–600 markers. The parameters of the simulation were chosen to reflect real data and the details are described in the section “Using iXora” in Additional file 1.

We compare iXora with the existing phasing methods BEAGLE [5], fastPHASE [6], FMPH [12], HAPI [7], HAPI-UR [8], MACH [9], Merlin [13], and SHAPEIT2 [11]. Each existing method solves a slightly different phasing problem, such as not providing a parental haplotype assignment for the progeny, or not processing the entire population in one run, as discussed in Methods. Hence we used evaluation criteria that enable a meaningful comparison of this wide-spectrum of methods, when applied on simulated data. The evaluation criteria are described in Methods, and the results are summarized in Table 1.

Accuracy of Parent Assignments (PA)

This accuracy is measured on a marker-by-marker basis. We post-process the output of the systems that do not directly provide parental assignment. The best parental assignment is seen with BEAGLE and HAPI-UR, followed by iXora. In the two former cases, there are no unassigned markers. HAPI and SHAPEIT2 show moderate accuracy while Merlin, fastPHASE, and MACH perform poorly. Note that Merlin’s performance deteriorates with the increase in number of markers, while HAPI and iXora display similar levels of accuracy.

Handling missing data/imputation

All the methods, except HAPI, show some capability of handling missing data. Merlin has about a third of the missing data unresolved, while iXora has less than 10% unresolved. The rest of the methods resolve all the missing data. BEAGLE, HAPI-UR and iXora display levels of accuracy in the imputed data larger than 90% while the rest perform poorly. Note that these values only account for missing data in the progeny. We found that missing data in the parents were debilitating for all the trio based methods, except Merlin and iXora. These two methods were the only ones that produced some results when all the marker data of both parents were missing. Since Merlin can handle only a small number of individuals per parent, about 15% of the parent haplotypes remained unresolved. We observed that iXora is the only method robust enough to be unaffected by completely missing parent genotypes. We attribute this resilience to its ability to handle large families of individuals without splitting into smaller sets.

Accuracy of Parent Haplotype Assignments (PHA)

Note that PHA is the most important computation since this crucially contributes to the improvement in accuracy and resolution in genomic region assignment to traits (see “Using iXora” in Additional file 1 for an example). In Table 1 the corresponding column, labeled PHA, is shown in bold and is the focus of the comparison study.

The PHA accuracy is measured on a marker-by-marker basis. Only HAPI, Merlin, and iXora provide an assignment of the parental haplotypes. Note that although SHAPEIT2 utilizes trios, it did not give us any means to extract parent haplotype information from the output. Both HAPI and Merlin perform poorly, with accuracy under 85% and 70% respectively. In contrast, iXora yields an accuracy of over 95%.

Although, HAPI and Merlin give means of identifying the parent haplotypes, they suffer a severe scaling problem, and are unable to handle more than about ten progeny per family. Thus it is not obvious how these systems can be coaxed to exploiting the availability of large progeny to improve the accuracy of the parental haplotype assignments.

Conclusions

From the comparison with related methods, we conclude that while methods to the problem of inferencing parental haplotype assignments on large F1 populations exist in theory, these approaches do not work in practice at high levels of accuracy (say > 90%). Moreover, iXora is the only algorithm that is robust enough to accurately extract the parental haplotypes in the absence of any parental genotype information. In practice, when the genotypes of the parents were known, we used this capability of iXora to match the estimated parent genotypes against the true genotypes to confirm the integrity of the phasing results. iXora additionally outputs several intrinsic measures of preciseness (the triplet <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M21">View MathML</a>), and all the equally-likely phasing solutions with annotations (q/Q/*), see Methods. These added capabilities make iXora and its output particularly attractive, over existing methods, for trait association and inferencing studies.

Methods

In this section we describe the mathematical details of the iXora haplotype inference algorithm, the measures used to quantify the precision of the output, and the different downstream processing of the output. We conclude the section with the description of the measures used to compare the results from different phasing algorithms.

iXora algorithm: haplotype inferencing

The outline of the three phases in the iXora algorithm is shown in Figure 1. In this section what follows is a more precise mathematical description of the steps which is presented as a sequence of key observations: thus this describes as well as provides the rationale for each of the steps. Note that Figure 1, annotated by equation and observation numbers, is a road map for the exposition in this section to help the reader understand the description.

Notation

Let <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M22">View MathML</a> be an n × m matrix that encodes n (> 1) progeny with m (> 1) markers. Each row i represents a progeny and each column j represents a marker. The order of the markers is also captured in the matrix, i.e., <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M23">View MathML</a> if and only if marker j is located between markers j and j on the chromosome. The jth marker of the ith progeny, denoted <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M24">View MathML</a> or simply 〈ij〉, also referred to as the position 〈ij〉, is a pair of alleles: each individual allele can be accessed as <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M25">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M26">View MathML</a>. Thus if <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M27">View MathML</a> or simply written as AC, <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M28">View MathML</a> A and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M29">View MathML</a> C. The two observed alleles at marker j (across all individuals) will be denoted as Z and Z. A marker j is polymorphic if Z ≠ Z. We make the assumption that all the markers in <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M30">View MathML</a> are polymorphic. When <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M31">View MathML</a> or <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M32">View MathML</a> the jth marker at individual i, or position 〈ij〉, is said to be homozygous. Similarly, when <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M33">View MathML</a>, position 〈ij〉 is said to be heterozygous.

We next introduce a definition and notation for conjugacy. Let conjugate of z be written as <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M34">View MathML</a>, where z is a matrix or a discrete value as defined below. Note that the conjugate of conjugate of z is z, i.e., <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M35">View MathML</a> always holds. For parents p = a,b, if p = a, then <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M36">View MathML</a> and vice-versa. Similarly, if k = 0, then <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M37">View MathML</a> and vice-versa. Thus <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M38">View MathML</a> (and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M39">View MathML</a>). Also, <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M40">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M41">View MathML</a> for the parents p = a,b.

Solvability of a given genotype matrix

Assume that there is no more than one crossover, in an individual, between two adjacent positions. If dt is the true number of such crossovers, then 0≤dt ≤ 2(m - 1)n. Let the estimate of dt be dc ≥ 0. Consider a scenario where each position in <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M42">View MathML</a> is heterozygous. Then there exist an exponentially large number (2mn) of distinct and equally-likely haplotype configurations of the progeny, each with no crossovers. While informative markers in general are chosen for their heterozygosity in data sets, it is observed that the same marker is also homozygous in many a progeny. Thus, in practice, due to Mendelian inheritance [16] it is very unlikely to have a run of markers that are heterozygous in all the progeny. It is this random sprinkling of homozygous positions in <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M43">View MathML</a>, that makes dc estimation possible. We conducted simulations to study the relationship between dc and the extent of homozygosity in <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M44">View MathML</a> in realistic data scenarios.

Let e denote the mean number of crossovers in a progeny. We used simulations with values 2 ≤ e ≤ 15. We observed that for this wide range of crossover profiles, the required fraction of homozygous sites in <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M45">View MathML</a> to get a good estimate of dt (i.e., within 5% of the true value) was bound. The empirical observation is summarized below.

Observation 1.When at least 28% of each subsample of randomly chosen positions in <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M46','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M46">View MathML</a>is homozygous, the estimated dcis within 5% of the exact value dt, i.e., dc0.95dt.

In practice, we have encountered values of n in the range of 50 to 400 and m in the range of 30 to 600. We observe that consistently, at least 50% positions are homozygous and the solutions, obtained using methods of the following sections, displayed e ≈ 2. With decrease in cost and in increase in accessibility of sequencing and genotyping technologies, m could be orders of magnitude larger in the coming years. Note that e is not expected to increase with m. Thus the lower bound estimation scales well with m, since the observation above is independent of m and the algorithm discussed in the later sections is linear in m.

Phase I: Preprocessing

Since all the individuals have the same two parents, let the two parents be a and b. Let Va(0) encode haplotype 0 of parent a and Va(1) encode haplotype 1 of parent a. Similarly, Vb(0) and Vb(1). The two distinct allele values of parent p at marker j are represented by <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M47">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M48','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M48">View MathML</a>.

At Step 1, Ma and Mb are initialized, for each i and j and p = a,b, as follows. Let the two alleles at marker j be written as Zj and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M49','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M49">View MathML</a>.

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M50">View MathML</a>

(1)

Also, Va and Vb is initialized as <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M51">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M52">View MathML</a>, for each j and p = a,b.

If marker j is homozygous in both parents, then position 〈i,j〉 is heterozygous, for all progeny i. If marker j is not homozygous in both parents, then possible progeny values are ZjZj, <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M53','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M53">View MathML</a>, and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M54">View MathML</a>. When both parents are homozygous, it may not be apparent what the allelic values of the individual parents are, but as is seen in the running example of the Overview section, this does not affect the solution. Such loci are marked as “-” in the parents, i.e., <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M55','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M55">View MathML</a>“-” for p = a,b. Also the column j of the matrices are updated as <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M56">View MathML</a>, for each i. If exactly one parent (either a or b) is homozygous, then only ZjZj, but not <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M57','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M57">View MathML</a>, can be observed in some progeny, while the rest are <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M58','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M58">View MathML</a>. In Step 2, we identify all such markers.

Note that while it is easy to estimate if a marker is homozygous in both parents or heterozygous in both, it is not obvious to estimate the heterozygous parent when exactly one of the parents is so. In Step 3 we identify markers which are homozygous in exactly one parent (i.e., either a or b). The execution of this process is illustrated in the Overview section through the running example and is described in detail here.

Recall that a markerj is homozygous in parent p if all the entries in Mp are H. For a fixed p, for a pair of non-homozygous markers j and j four counts, <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M59','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M59">View MathML</a> are computed for all combinations of k1,k2 ∈ {0,1} as:

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M60','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M60">View MathML</a>

In words, c01 is the count of individuals where marker at j is inherited from haplotype 0 of the parent p and the marker at j is inherited from haplotype 1 of the same parent. Similarly, c10, c11 and c00. Let

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M61','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M61">View MathML</a>

Let t = c00 + c11 and t = c01 + c10. Also, let tmax = max{t,t} and tmin = min{t,t}. Then this is interpreted as tmax individuals with no crossovers while tmin individuals have a crossover between locations j and j. In practice, we use a stricter condition where the values tmax and tmin need to be well separated. We use two threshold fractions: 0 < α1,α2 ≤ 1. <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M62','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M62">View MathML</a> is said to be polarized if and only if the following hold:

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M63','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M63">View MathML</a>

In our implementation we use α1 = 0.3 and α2 = 0.15, which we have observed to yield accurate phasing results on real data, confirmed by the precision measures discussed later. When the polarization condition is violated, it is considered to be an error in data <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M64','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M64">View MathML</a>. In practice, we observed that in all such cases this was due to experimental errors at one of the markers. Hence, we make the following assumption. For parentp and for the pair of non-homozygous markers j and j, <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M65','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M65">View MathML</a>must be polarized. In practice, when this condition is violated, we flag the marker for closer scrutiny in the experiments. A consequence of this assumption is the following.

Observation 2.Let Ja(resp Jb) be the set of markers with heterozygous parent a (resp b). If <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M66','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M66">View MathML</a>is not polarized, then jJpand <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M67','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M67">View MathML</a>.

This observation states that it is possible to computationally obtain two non-overlapping sets of markers, where one set represents the markers that are homozygous in only parent a and the other set represents the markers that are homozygous in only parent b. Thus, when the parent genotypes are not available, this observation is utilized in Step 3 to resolve the markers that are homozygous in exactly in one parent.

Let marker j be homozygous only in parent a based on the above process, then <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M68','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M68">View MathML</a> and column j in Ma and Mb are updated (Step 4) as follows: <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M69','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M69">View MathML</a>, and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M70','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M70">View MathML</a>. Next, column j in Mp is updated as follows:

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M71">View MathML</a>

(2)

Note that the above is equivalent to: For a marker j that is homozygous only in parent a: <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M72','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M72">View MathML</a> is set to H for all i and if <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M73','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M73">View MathML</a> was X, then it is updated to 1. Similarly, for a marker j that is homozygous only in parent b: <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M74">View MathML</a> is set to H for all i and if <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M75','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M75">View MathML</a> was X, then it is updated to 1.

Rationale. Note parent a is homozygous at j (but parent b is not). The reason for switching X to 1 in column j of Mb is that no matter what the value of <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M76','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M76">View MathML</a> is, or will be eventually updated to, 1 encodes for the allele Z’.

When marker j is homozygous only in parent b, a similar update as above is performed. Note that in the implementation, as a marker j is resolved (Step 3), the marker is immediately updated in Vp and Mp (Step 4).

Phase II: Detecting crossovers

Without loss of generality, for each non-homozygous j in Mp, let jnxt be the adjacent non-homozygous marker in Mp. Substituting j with jnxt in the definitions from the last section, let tmax and tmin be redefined. Then in Step 5, Vp is phased as follows: For parent p, if <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M77','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M77">View MathML</a>, then the values of <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M78','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M78">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M79','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M79">View MathML</a> are interchanged. Further, column j in Mp is updated by replacing an existing entry of 0 with 1 and an entry of 1 with 0, to reflect the updated <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M80','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M80">View MathML</a>.

Let <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M81','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M81">View MathML</a>. Then the minimal number of crossovers (or recombinations) between j and jnxt in haplotypes inherited from parent p is <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M82">View MathML</a>. Thus:

Observation 3.If dcis the total number of estimated crossovers, then

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M83">View MathML</a>

Note that dc could be larger than the sum on the left, since some further crossovers may be introduced in the next phase. Since only adjacent markers need to be considered for detecting crossovers, the following holds.

Observation 4.Given <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M84','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M84">View MathML</a>, the haplotype matrices Ma,Mband the parent haplotypes Va,Vbcan be constructed in <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M85','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M85">View MathML</a>time.

Missing values There are often missing values in the data, sometimes as high as 20%. When the value is missing at position 〈ij〉, we make the assignment <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M86','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M86">View MathML</a>, for p = a,b. However, if after the computation of V, marker j is homozygous at parent p, then <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M87','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M87">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M88','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M88">View MathML</a> is appropriately updated. It can be verified that this treatment does not introduce any extra (spurious) crossovers.

Selfed progeny For selfed progeny, not only are the parents the same but even the haplotypes of the parents are deemed identical. Then one of the following conditions holds for p = a,b and a numeric k ∈ {0,1}:

(1) Va=Vb and if <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M89','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M89">View MathML</a> then <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M90','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M90">View MathML</a>, for all i and j, or,

(2) Va ≠ Vb and if <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M91','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M91">View MathML</a> then <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M92','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M92">View MathML</a>, for all i and j.

Thus Mb is defined completely by Ma and thus the task is to estimate only Ma.

Monotonic state transitions Next, the matrices Mp are transformed as follows. A position 〈ij〉 is a heterozygous trio, if the two parents at j are heterozygous and so is <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M93','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M93">View MathML</a>.

Let k represent a numeric value. Define two functions Lt(·) and Rt(·) on an element of the matrix as follows:

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M94','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M94">View MathML</a>

Note that a marker j can never be both leftmost and rightmost since the number of markers is at least two, hence both Lt(·) and Rt(·) are well defined. Let Mij = x. Then the transition SxSy is applied to assign the value y to Mij (written as Mijy) as follows.

o <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M95','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M95">View MathML</a>

o If Lt(Mij) = Rt(Mij) = k, then Mij ← ek.

o <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M96','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M96">View MathML</a>

o If Lt(Mij) ≠ Rt(Mij) with Lt(Mij) = k1 and Rt(Mij) = k2, then <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M97','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M97">View MathML</a>

o <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M98','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M98">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M99','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M99">View MathML</a> (note that k1 ≠ k2)

o If Lt(Mij) = Rt(Mij) = k, then Mij ← ek.

o <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M100','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M100">View MathML</a>

o Mij ← k.

o <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M101','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M101">View MathML</a>

o If <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M102','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M102">View MathML</a> is numeric, then <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M103','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M103">View MathML</a>

To estimate the running time of the algorithm, we classify the transitions as intra-transitions and inter-transitions, depending on whether M or <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M104','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M104">View MathML</a> is used respectively. Thus <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M105','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M105">View MathML</a> is the only inter-transition. The above transitions are applied to the elements of Ma and Mb till no more transition is applicable. This final form of matrix is written as F(Ma) or simply Fa (similarly, F(Mb) or simply Fb).

The permissible transitions are captured in the transition diagram in Figure 2. The three possible start states are SH, SX and Sk and the three possible final states are shown boxed. The curly arrow represents the inter-transition while the straight arrows represent the intra-transitions. Note that each Mij undergoes no more than three transitions since there are no cycles in the state transition diagram. Also, each state with multiple outgoing edges have non-overlapping conditions, leading to a unique choice of transition.

thumbnailFigure 2. Transition diagram for computing the final phasing output. The diagram shows the permissible state transitions for computing the phasing result matrices F. The states S are discussed in detail in Methods.

Each element of Fp is in the final state, i.e., for each position 〈ij〉, <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M106','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M106">View MathML</a>, p = a,b. A numeric value of -1 indicates that no information regarding the haplotypes can be deduced. It can be verified that these state transitions induce the following properties on F and V.

Observation 5.For a fixed i and p = a,b, the following hold:

(1) If <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M107','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M107">View MathML</a> is not numeric but <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M108','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M108">View MathML</a> is, then <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M109','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M109">View MathML</a> must hold.

(2) If <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M110','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M110">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M111','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M111">View MathML</a> are both not numeric, then 〈ij〉, must be a heterozygous trio. The converse however is not true.

(3) If <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M112','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M112">View MathML</a> -1, for some j, then <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M113','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M113">View MathML</a> -1, for all j.

Observation 6.

(1) Given Ma and Mb, Fa and Fb are unique.

(2) Fa and Fb can be constructed from Ma and Mb in <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M114','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M114">View MathML</a> time.

To show that Fa and Fb are unique, the iterations can be viewed as of two types: one where only inter-transitions and the other where only intra-transitions are applicable. In the very first iteration, due to the possible start states, no inter-transition can be applied. Thus using uniquely applicable intra-transitions, each <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M115','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M115">View MathML</a> and each <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M116','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M116">View MathML</a> is transformed. When no intra-transition can be applied, the uniquely applicable inter-transitions are applied. Thus the iterations alternate between intra- and inter-transitions. Hence the final forms Fa and Fb are unique. Next, since each entry can go through no more than three transitions, it is possible to obtain Fa and Fb from Ma and Mb in <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M117','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M117">View MathML</a> time using suitable list data structures.

Phase III: Staging output

An optimization problem (e.g., minimizing an appropriate error function) whose solution is associated with an output configuration, such as alignment of multiple sequences or a phylogeny topology or landscape of crossovers in chromosomes, has the added burden of proving stability in the solution. In other words, how distinct in configuration are the next closest solutions? This is usually very difficult to answer, and most methods are inadequate in addressing this. However, due to the very specific nature of our problem, we provide an agglomerate of “best” solutions, so that its stability can be gauged. Our focus is not just on a maximum likely or a most parsimonious solution, but on an “agglomerate” of all (equally-likely) feasible solutions. This is a characteristic not just of the method but, in a sense, that of the data as well.

Suppose there are L > 0 feasible distinct solutions for a given <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M118','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M118">View MathML</a>. Is it possible to generate an agglomerate in a single pair of matrices Ra and Rb that captures all the L solutions ? In the following paragraphs we describe how to construct the agglomerate (Ra and Rb) based on the encodings in Fa and Fb.

Let the conjugate of F be <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M119','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M119">View MathML</a> defined as <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M120','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M120">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M121','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M121">View MathML</a>.d <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M122','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M122">View MathML</a> that encodes all the possible solutions, is constructed from <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M123','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M123">View MathML</a>, p = a,b, as follows. We use new symbols ∗, q and Q in addition to the numeric values (0 and 1) in Rp. For each progeny i and p = a,b:

1. If <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M124','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M124">View MathML</a> for some j, then without loss of generality <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M125','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M125">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M126','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M126">View MathML</a>, for all j.

2. If <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M127','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M127">View MathML</a> then <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M128','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M128">View MathML</a>

3. For each j, if <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M129','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M129">View MathML</a> is numeric and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M130','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M130">View MathML</a> is not, then <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M131','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M131">View MathML</a>.

4. For each j, where <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M132','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M132">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M133','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M133">View MathML</a> are both not numeric, then

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M134','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M134">View MathML</a>

(3)

This is illustrated in the following example.

Example 1.Let <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M135','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M135">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M136','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M136">View MathML</a>, for some j. Further let <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M137','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M137">View MathML</a>and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M138','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M138">View MathML</a>for some i. Then by Equation 4, <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M139','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M139">View MathML</a>. However if <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M140','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M140">View MathML</a>, then <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M141','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M141">View MathML</a>.

The non-numeric values in Rp encode multiple solutions, possibly with multiple crossovers. Hence we call the contiguous blocks of non-numeric values as dispersion intervals. The details of interpreting these intervals is discussed in the following sections.

Precision measures of iXora output

Note that dc is the number crossovers seen in the result matrices, thus an average number of crossovers for the individuals is dc/n. But these also correspond to an output configuration. Then how precise is the output of iXora for an input genotype matrix <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M142','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M142">View MathML</a>?

The matrices Ra and Rb enable the elicitation of the parameters for the various haplotype distributions across the markers as well as the precision measures. The agglomerate structure aids in defining a stability measure summarized as the metric <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M143','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M143">View MathML</a> (Equation 11). The other inadequacies, such as insufficient information and errors due to imperfect experiments or data-acquisition, are evaluated by the methodology and summarized as <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M144','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M144">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M145','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M145">View MathML</a> respectively (Equations 12-13). The former is the extent of dispersion in position of each crossover over the agglomerate, and the latter is the observed error in the R matrices. The trio define the haplotype precision in the given genotypes.

Stability measure, <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M146','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M146">View MathML</a>

Let r be a run of contiguous values in a row (progeny) of Rp. A dispersion interval is a non-numeric run, sandwiched by the chromosome boundary or numeric value. Let Ra and Rb be runs in the two haplotypes of the progeny and are aligned in the examples discussed below. It is possible that only one of Ra or Rb is a numeric run. Then it is called a asymmetric. But if both are non-numeric, then by construction (Equation 4) ra = rb and the run is symmetric. A switch is defined to occur between q and Q or between Q and the numeric value that sandwiches the run. The number of switches is written as sw(r). The value ∗ is a wild card and can be treated either as q or Q. The switch detection is succinctly described by the following two examples. In the illustrations the dispersion interval is shown in green sandwiched between numeric values (in black) and each switch is marked as a numbered (red) down-arrow.

Example 2.Consider the following run.

The number of switches is 4 as shown.

The wild card may result in different positions of the same switch corresponding to whether it was interpreted as a q or a Q. These distinct positions of the switch is termed the wild card count of a switch. If a switch position is not affected by any wild card, its count is 1.

Example 3.The following run has three wild cards.

The first is treated as a q, the second as a Q and the third can be treated as a q or Q with two possible positions for the third switch.

The wild card counts of the four switches are 1, 1, 2 and 1 respectively.

When sw(r) > 0, the location of the switches may define additional positions 〈ij〉 in R that can updated from non-numeric to numeric. These are the positions that are to the left of the leftmost switch and to the right of the rightmost switch positions. Thus in the run of Example 2, the two q’s on the left cannot take the value 1, hence they can be assigned 0, thus the length of the dispersion interval shrinks from 8 to 6.

Similarly the length of the dispersion interval of Example 3 shrinks from 11 to 9.

The transformed values are shown in bold above. The same process is applied to every dispersion interval to transform the matrices Ra and Rb. The observations from the transformations above is summarized as:

Observation 7.R is such that for each dispersion interval r, if s w(r) > 0, then (1) r begins and ends with Q and (2) the positions of the first and the last switch sandwich the dispersion interval.

In fact, the following is easily verified:

Observation 8.Let r be a dispersion interval and let s = s w(r). Then (1) the total number of crossovers in the interval, across both the haplotypes, is exactly s and each crossover in each haplotype of the configuration is at a position of the switch.(2) If r is asymmetric, then s is zero. In general, s is even and the number of crossovers in each haplotype of each distinct configuration is odd.

In practice, we observed that in all data sets all the dispersion intervals had no switches. There was exactly one instance where sw(r) was 2. Also, the following is easily verified.

Observation 9. Let s = sw(r) for a dispersion interval r. Then if s ≤ 2, there is no additional crossover introduced by r. If s > 2, the number of additional crossovers is s-2.

Let U be the set of all dispersion intervals in R. Then

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M153','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M153">View MathML</a>

(4)

Dispersion index,<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M154','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M154">View MathML</a>

The dispersion index, <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M155','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M155">View MathML</a>, is a measure of the uncertainty in the position of the crossovers. This is computed as an average over all predicted crossovers. Thus

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M156','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M156">View MathML</a>

(5)

with

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M157','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M157">View MathML</a>

where cl is the wild card count of each switch l in r. When the location of each crossover is known precisely, this index is 0, while value 1 indicates maximum uncertainty in the location.

Error estimate,<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M158','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M158">View MathML</a>

If <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M159','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M159">View MathML</a> then the position 〈ij〉 has a mismatch if <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M160','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M160">View MathML</a> Note that the mismatch at this position could be flagged as an error or one of <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M161','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M161">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M162','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M162">View MathML</a> can be modified to potentially introduce additional crossover(s). These are undetectable during the lower bound computation and do not overlap with the dispersion intervals. Let N be the number of such mismatches. Then, <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M163','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M163">View MathML</a>, the error estimate in <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M164','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M164">View MathML</a> is defined as

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M165','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M165">View MathML</a>

(6)

Also, in our experiments these mismatches were extremely low (less than 0.01) and when followed up turned out to be experimental errors. Hence we have followed the convention that such a mismatch be flagged as a potential error. Then an error, if any, is at 〈ij〉 in F that underwent the transitions <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M166','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M166">View MathML</a>. Note that the converse is not true. Also, an additional crossover, if any, is at 〈ij〉 in F that underwent the transitions <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M167','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M167">View MathML</a>. Again, the converse is not true.

To summarize, the trio (<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M168','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M168">View MathML</a>) define the haplotype precision in the given genotypes <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M169','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M169">View MathML</a>. Across our experiments (30 distinct data sets) with real data [15], the mean precision trio were observed to be (0,0.002165,0.000299).

Downstream processing of iXora output

Since iXora associates the parent haplotypes (not just the inherited alleles) to each marker in a progeny, it is possible to study the distributions of the inherited parent haplotypes independent of or in association with a phenotype. The details of these and other related postprocessing available in the iXora framework are described here.

Haplotype frequency distributions

One of the important consequences of haplotype inferencing is obtaining the haplotype frequency distribution across the chromosomes. A marker j of progeny i has two alleles, one derived either from haplotype 0 or haplotype 1 of parent a and the other either from haplotype 0 or haplotype 1 of parent b. Since R is an agglomerate, it also contains the non-numeric values that encode multiple configurations. Based on the encoding in R we estimate the expected value of the frequency count and its variance. First, we enumerate the feasible solutions encoded by a non-numeric run r.

Example 4.In each of the following the length of the dispersion interval is 3 and the number of additional crossovers is zero. However, the number of configurations are different based on the structure of the switches. Two runs with zero switches in each:

A run with sw(r) = 2:

The 8 distinct configurations for Example 2 are:

Let wt(r) be the number of distinct solutions encoded by r.

Observation 10.Let s = s w(r). Let c1,c2,…csbe the wild card counts of the switch positions. Let w t(r) be the number of distinct feasible solutions due to r. Then

(1) If  < 2, w t(r) = l e n(r) + 1.

(2) If s  ≥ 2, <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M173','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M173">View MathML</a>, where

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M174','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M174">View MathML</a>

(7)

(1) above is easily verified and for (2), Equation 17 is explained through the following example.

Example 5.Consider the following r with s w(r) = 6.

Note that the 6 switches can be shared by the two haplotypes as 1 and 5 (the first column), or as 3 and 3 (second column) as follows.

There are six distinct configurations for the first case and since the two haplotypes can be switched it gives 2 × 6 distinct configurations. For the second there are<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M177','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M177">View MathML</a>distinct configuration, giving a total of 12+20 distinct solutions due to r.

Expected frequency and variance

Based on R, we estimate the expected value of the frequency count of the haplotype pairs and its variance. If Rij is non-numeric, let Rij = α hold. Thus for each marker j consider the following, for x,y = 0,1,α,

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M178','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M178">View MathML</a>

In the absence of any other external information, each of the alternative solutions in R is equally likely. Under this assumption, it is easy to verify the following:

Observation 11.The expected count, <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M179','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M179">View MathML</a>, of each haplotype pair, k1,k2 = 0,1, is:

<a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M180','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M180">View MathML</a>

The variance <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M181','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M181">View MathML</a>of the count of the haplotype pair is approximated as <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M182','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M182">View MathML</a>.

Haplotype-phenotype association analysis

Below, we describe the iXora methodology for associating discrete traits with genomic locations using haplotypes. The same approach can be used for continuous traits, using different statistical tests and randomizations. In general, the phasing output can be used in other types of statistical tests, for example to test for associations between a pair of markers and a phenotype. In the following, let L the number of discrete phenotype values.

Combination of parents We can test the effect of each haplotype pair at a marker with a phenotype as follows. In the case of discrete phenotype with L possible values, iXora constructs for each marker a 4 × L contingency table of haplotype pair & phenotype counts. In the current iXora implementation, we use Fisher’s exact test fisher.test in R [17] to test for association between haplotype pairs & phenotype. The test outputs a p-value for each marker, denoting the significance of the association.

Individual parents We can also investigate the contribution to phenotype of each parent individually. The contingency table in this case is a 2 × L table of haplotype & phenotype counts (a separate table for each parent at each marker). Fisher’s Exact test is used to test for association between haplotypes & phenotype. The test outputs a p-value for each marker and for each parent, denoting the significance of the association.

Significance thresholds via randomization We include a method for directly estimating the background distribution of p-values in the haplotype-phenotype data by randomizing the phenotype labels. If p-values observed in the randomized data are always larger than in the real data, then the findings on real data may indeed be significant. When working with categorical traits, we take into account the size (number of individuals) of each phenotype category in the permutation test. Let S be the size of the smallest category. For the permutation test, we randomly select S individuals from each category and permute their phenotype values. The permutation test is repeated T times, running the same statistical test on the randomized data as on the real data, for each marker. The smallest p-value observed in the randomized data is generated as output and becomes the threshold for significance in the real results.

iXora output visualization

The agglomerate solution from the phasing algorithm can be directly visualized to detect distortions in the data, with or without using phenotypic information. Approaches for visualizing the phasing solution are demonstrated in the following two paragraphs, while the third paragraph describes visualization of haplotype-phenotype associations. The figures shown here as examples stem from the use case described in detail in the Section “Using iXora” in Additional file 1. Therein is detailed the phasing an trait association process for locating the genomic region and specific haplotype that determines the simulated phenotype (height).

Individual haplotypes The individual haplotypes can be directly visualized, for example as the colored haplotype blocks shown in Additional file 1: Figure A3. The visualization indicates the locations of crossovers in each individual, including uncertainty in the crossover locations. As an alternative visualization, Additional file 1: Figure A4 shows iXora output on the frequencies of individual haplotypes per parent. In this case, the individuals are divided into two groups by phenotype, and clear distortion is observed regarding frequency of haplotypes in each group.

Haplotype pairs The agglomerate structure capturing all the equally-likely solutions enables estimation of the possible dispersion of the crossover locations. iXora visualization of the expected frequency distributions of the progeny haplotype pairs is shown in Figure 3 for two phenotype groups. Clear under-representation of two distinct haplotype pairs is observed for each group. This visualization can be used to spot unexpected distortions, whose significance can be further evaluated using statistical tests.

thumbnailFigure 3. Expected haplotype distributions visualization. Expected haplotype frequencies <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M183','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M183">View MathML</a> are shown for the simulated use case detailed in Additional file 1, for the two phenotypic groups: A) tall progeny, B) short progeny. The variance <a onClick="popup('http://www.biomedcentral.com/1471-2156/14/48/mathml/M184','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/14/48/mathml/M184">View MathML</a> due to uncertainty in crossover locations is shown as shaded regions. Clear distortion is visible near marker 30 (marked by the dashed rectangle), evident from under-representation of haplotype combinations involving paternal haplotype H2 in the short progeny (green and yellow lines in B).

Phenotype association Phenotype association for each parent individually is shown in Figure 4. The p-value threshold obtained via randomizations is also shown. In this case it is evident that only one parent is associated with the phenotype, in one genomic region. An example of iXora results from statistical testing for haplotype pairs’ association with phenotype is shown in Additional file 1: Figure A5, and a comparison between genotype and haplotype association results is shown in Additional file 1: Figure A6.

thumbnailFigure 4. Results from Fisher’s exact test for phenotype-haplotype association for A) Father and B) Mother, including the p-value significance thresholds from randomizations, for the simulated use case detailed in Additional file 1. In this case only one region of the genome from the father is significantly associated with the phenotype (marked by the dashed rectangle), according to the Fisher’s exact test and the randomization thresholds. [Legend: real data (red), randomized data (blue), smallest value in randomized data (green)].

Comparison with related methods

Here we first elaborate on the three distinct categories of population models for phasing, and then give details on the comparison of iXora with related phasing methods in literature. Technical details on the settings for each compared method can be found in Additional file 1.

Unrelated individuals (no parent information) These methods treat the input genotypes as samples from a population of unrelated individuals, and do not assign the progeny to parental haplotypes. While they may be applicable to human population data, they are not suitable for our purposes. fastPHASE can be adapted to our problem setting by treating the input as n + 2 individuals from a population originating from four founder haplotype clusters. We adapt MACH similarly, though it has a much larger default value on the number of haplotypes. FMPH (Integer Programming Formulations To Solve Maximum Parsimony Haplotyping) [12] is closest in spirit to iXora, but is computationally very expensive and suitable only for small data sets (up to ≈50×30), although a hybrid approach is suitable for slightly larger data sets (up to ≈150×100). While the limitation on number of individuals can be tackled by splitting the data (as we do for Merlin, HAPI, SHAPEIT2 below), the limitation on the number of markers is debilitating, so we were unable to run FMPH on our data sets.

Unrelated trios These methods allow the definition of family relationships between parents and progeny in the input, with the limitation that each parent has only one progeny. BEAGLE and HAPI-UR (HAPlotype Inference for UnRelated samples) are two such methods. The methods phase the progeny individually in terms of sequences that were transmitted from each parent. Therefore the progeny are not necessarily assigned to a consistent set of parent haplotypes.

Related trios These methods allow defining several progeny originating from the same two parents, thus the underlying algorithms utilize the full sibling information. However, unlike iXora, none of the existing methods was able to use the entire set of progeny per two parents. In our application this size is in hundreds. HAPI and Merlin produce results only on families of about 10 progeny while SHAPEIT2 can only process sizes up to 50. We therefore randomly divided the progeny into sets of appropriate sizes and phased the sets independently. However, we observed that the phasing results for the parents between sets were often inconsistent, affecting the overall accuracy. HAPI and Merlin produce an assignment of progeny to parental haplotypes while SHAPEIT2 does not.

Comparison measures Here we describe the measures that we used to compare the different methods. Since existing phasing methods generate various types of output, we use two different measures so that all the methods are comparable on at least one measure. Our interest was not simply restricted to phasing the progeny genotypes by assigning each allele to the correct parent (PA), but also assigning them to the correct haplotype of that parent (PHA).

First, the phasing accuracy (PA) of progeny is examined, on a marker-by-marker basis, of only the heterozygous positions. We report the number of correctly assigned and the unassigned (unknown) positions as a percentage. BEAGLE, HAPI, HAPI-UR, Merlin, SHAPEIT2 and iXora can be directly compared on this measure for progeny, because they report the parental origin (maternal, paternal) of each allele. To incorporate fastPHASE and MACH also in this comparison, we post-processed their output as follows: progeny haplotypes are labeled as ‘maternal’ and ‘paternal’, using the labeling that minimizes mismatches compared to true maternal and paternal haplotypes. After the post-processing, all methods can be compared on this measure for progeny. The same accuracy evaluation is used to report imputation accuracy, by examining only the phasing for the missing input values.

Second, the accuracy of assigning the correct parental haplotype (PHA) for each progeny allele is examined, again on a marker-by-marker basis. All markers, including homozygous positions are used. For the output of programs where the input had to be split into smaller families, we consider only those subsets of progeny whose parents’ phasing are consistent with the majority parents (see Additional file 1 for details). Again, we report the number of correctly assigned and unassigned (unknown) positions, deeming a progeny position to be correct only when both alleles of the genotype are assigned to the correct parental haplotype. Only HAPI, Merlin, and iXora can be directly compared on this measure for the progeny.

All the simulated data sets are available at the iXora website http://researcher.ibm.com/project/3430 webcite.

Competing interests

The authors declare no competing interests.

Authors’ contributions

FU and NH designed and implemented the framework. FU designed and implemented the user interface. LP designed and implemented the core iXora algorithm for haplotype extraction. NH and FU designed the experiments and performed the analysis described in this paper. OEC contributed to the comparison with existing phasing methods (fastPHASE, SHAPEIT2). DL and SR were instrumental in the algorithm verification on real (Cacao) data. JCM, RJS and DNK contributed to identifying the scope of the iXora pipeline. LP, NH, and FU wrote the paper. All authors read and approved the final manuscript.

Acknowledgements

FU is partially funded by MARS, under the joint cacao project with IBM Research. OEC is partially funded by MARS, Incorporated. The authors would like to thank Dr. Seth Findley and Joseph Conrad Stack, and the anonymous reviewers for providing valuable comments on the manuscript.

References

  1. Browning BL, Browning SR: Efficient multilocus association testing for whole genome association studies using localized haplotype clustering.

    Genetic Epidemiol 2007, 5:365-375. OpenURL

  2. Schaid DJ: Evaluating associations of haplotypes with traits.

    Genet Epidemiol 2004, 27:348-364. PubMed Abstract | Publisher Full Text OpenURL

  3. Bafna V, Edwards N, Lippert R, Yooseph S, Istrail S, Halldórsson B: A survey of computational methods for determining haplotypes. In Computational Methods for SNPs and Haplotype Inference,Volume 2983 of Lecture Notes in Computer Science. Edited by Clark A, Waterman M, Istrail S, Istrail S, Waterman M, Clark A. Berlin: Heidelberg: Springer; 2004:613-614. OpenURL

  4. Browning SR, Browning BL: Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering.

    Am J Human Genet 2007, 81(5):1084-1097. Publisher Full Text OpenURL

  5. Browning BL, Browning SR: A unified approach to genotype imputation and haplotype phase inference for large data sets of trios and unrelated individuals.

    Am J Human Genet 2009, 84(2):210-223. Publisher Full Text OpenURL

  6. Scheet P, Stephens M: A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.

    Am J Human Genet 2006, 78(4):629-644. Publisher Full Text OpenURL

  7. Williams A, Housman DE, Rinard MC, Gifford DK: Rapid haplotype inference for nuclear families.

    Genome Biol 2010, 11(10):R108. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  8. Williams AL, Glessner J, Hakonarson H, Reich D: Phasing of many thousands of genotyped samples.

    Am J Human Genet 2012, 91(2):238-251. Publisher Full Text OpenURL

  9. Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR: MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes.

    Genetic Epidemiol 2010, 34(8):816-834. Publisher Full Text OpenURL

  10. Delaneau O, Marchini J, Zagury JF: A linear complexity phasing method for thousands of genomes.

    Nature Methods 2012, 9(2):179-181. OpenURL

  11. Delaneau O, Zagury JF, Marchini J: Improved whole-chromosome phasing for disease and population genetic studies.

    Nature Methods 2013, 10:5-6. PubMed Abstract | Publisher Full Text OpenURL

  12. Gusfield D: Haplotype inference by pure parsimony. In Combinatorial Pattern Matching, Volume 2676 of Lecture Notes in Computer Science. Edited by Chávez E, Baeza-Yates R, Chávez E, Crochemore M. Berlin: Heidelberg: Springer; 2003:144-155. OpenURL

  13. Abecasis GR, Cherny SS: Merlin–rapid analysis of dense genetic maps using sparse gene flow trees.

    Nature Genet 2002, 30:97-101. PubMed Abstract | Publisher Full Text OpenURL

  14. Browning SR, Browning BL: Haplotype phasing: existing methods and new developments.

    Nature Rev Genet 2011, 12(10):703-714. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  15. Motamayor JC, Mockaitis K, Schmutz J, Haiminen N, Livingstone DIII, Cornejo O, Findley SD, Zheng P, Utro F, Royaert S, Saski C, Jenkins J, Podicheti R, Zhao M, Scheffler BE, Stack JC, Feltus FA, Mustiga GM, Amores F, Phillips W, Marelli JP, May GD, Shapiro H, Ma J, Bustamante CD, Schnell RJ, Main D, Gilbert D, Parida L, Kuhn DN: The genome sequence of the most widely cultivated cacao type and its use to identify candidate genes regulating pod color.

    Genome Biology 2013, 14(6):R53.

    http://genomebiology.com/2013/14/6/R53 webcite

    PubMed Abstract | BioMed Central Full Text OpenURL

  16. Hartl DL, Clark AG: Principles of Population Genetics. Sinauer Associates: Inc.; 2006. OpenURL

  17. Development Core Team R: R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2006.

    [http://www.R-project.org webcite] [ISBN 3-900051-07-0]

    OpenURL