Computational Biology Center, IBM T J Watson Research, Yorktown Heights, NY, USA
USDAARS SHRS, Miami, FL, USA
Stanford University, Stanford, CA, USA
MARS, Incorporated, Miami, FL, USA
Abstract
Background
We address the task of extracting accurate haplotypes from genotype data of individuals of large F_{1} populations for mapping studies. While methods for inferring parental haplotype assignments on large F_{1} populations exist in theory, these approaches do not work in practice at high levels of accuracy.
Results
We have designed iXora (Identifying crossovers and recombining alleles), a robust method for extracting reliable haplotypes of a mapping population, as well as parental haplotypes, that runs in linear time. Each allele in the progeny is assigned not just to a parent, but more precisely to a haplotype inherited from the parent. iXora shows an improvement of at least 15% in accuracy over similar systems in literature. Furthermore, iXora provides an easytouse, comprehensive environment for association studies and hypothesis checking in populations of related individuals.
Conclusions
iXora provides detailed resolution in parental inheritance, along with the capability of handling very large populations, which allows for accurate haplotype extraction and trait association. iXora is available for noncommercial use from
Background
We address the task of extracting accurate haplotypes from genotype data of individuals of large F_{1} populations for mapping studies. Haplotypes are useful for inferring the underlying causal genetic basis of the traits in mapping populations as one can more efficiently evaluate the parental inheritance of the haplotype implicated in the determination of the trait
Given genotypes of
We compare and contrast iXora with existing phasing programs in literature, summarizing the results in Table
Accuracy %
Trait
Time
PA (ua)
Imp. (ua)
PHA %
Assoc.
Sec
Remarks
The first row for each method corresponds to 300 markers while the second to 600 markers. The results are averaged over multiple data sets of 200 individuals. Parental haplotype assignment (PHA) was found to be critical in the task of trait association in
Unrelated individuals (no parent information)
fastPHASE
60.07(0.00)
59.77(0.00)
NA
No
78
58.01(0.00)
56.55(0.00)
158
FMPH


NA
No

Up to 30100 markers
MACH
52.89 (0.00)
52:16(0:00)
NA
Yes
567
52.49 (0.00)
50.91 (0.00)
1144
Unrelated trios
BEAGLE
99.90 (0.00)
98.61 (0.00)
NA
Yes
5
99.90 (0.00)
98.28 (0.00)
10
HAPIUR
99.69 (0.00)
94.75 (0.00)
NA
No
3
99.67 (0.00)
94.88 (0.00)
7
Related trios
HAPI
90.75 (9.17)
0.00 (100.0)
83.88
No
0.1
< 15 progeny/parent
90:63(9:29)
0:00(100:0)
83.87
0.2
Merlin
70:59(29:38)
69:60(29:47)
69.74
Yes
299
< 15 progeny/parent
64:80(35:18)
63:72(35:09)
63.81
604
SHAPEIT2
87:20(0:00)
57:61(0:00)
NA
No
70
< 50 progeny/parent
90:46(0:00)
64:05(0:00)
148
iXora
95:89(4:05)
92:11(7:75))
95.55
Yes
0.3
95:73(4:21)
91:43(8:40)
95.43
0.8
iXora provides a userfriendly, easytouse comprehensive environment for mapping studies. The analysis framework and user interface are described in Additional file
Additional text and figures. The file contains an example on using iXora on a simulated phasing and trait association scenario. Additionally, the file includes visualizations of the iXora framework and user interface. The file also contains technical details on the comparison with related methods.
Click here for file
Results and discussion
In this section we outline the main results: the iXora phasing algorithm and its comparison with related methods in literature.
Outline of the core algorithm
We give an overview of the iXora phasing algorithm here, while the details are described in Methods. The different steps of the algorithm, based on a parsimony principle, are shown in Figure
Outline of the iXora phasing approach
Outline of the iXora phasing approach. The eight steps in the iXora haplotype extraction algorithm. Eqn and Obs refer to the Equations and Observations discussed in Methods. The task is to estimate the haplotypes of the two parents, say
In Step 1, we initialize two 4 × 4 matrices
In this running example, we assume that parent genotypes are missing. That is, the
Without loss of generality, let the second marker be homozygous in parent
and
With these assignments of the parent haplotypes, the progeny haplotype assignment matrices
A marker
Some systematic transitions (Obs 56 in Methods) are applied to the nonnumeric elements of the
In this toy example, we can simply transform
The parent haplotypes are encoded in
Comparison with related methods
Here we describe the results from a simulation study on a F_{1} population with 200 individuals and 300–600 markers. The parameters of the simulation were chosen to reflect real data and the details are described in the section “Using iXora” in Additional file
We compare iXora with the existing phasing methods BEAGLE
This accuracy is measured on a markerbymarker basis. We postprocess the output of the systems that do not directly provide parental assignment. The best parental assignment is seen with BEAGLE and HAPIUR, followed by iXora. In the two former cases, there are no unassigned markers. HAPI and SHAPEIT2 show moderate accuracy while Merlin, fastPHASE, and MACH perform poorly. Note that Merlin’s performance deteriorates with the increase in number of markers, while HAPI and iXora display similar levels of accuracy.
All the methods, except HAPI, show some capability of handling missing data. Merlin has about a third of the missing data unresolved, while iXora has less than 10% unresolved. The rest of the methods resolve all the missing data. BEAGLE, HAPIUR and iXora display levels of accuracy in the imputed data larger than 90% while the rest perform poorly. Note that these values only account for missing data in the progeny. We found that missing data in the parents were debilitating for all the trio based methods, except Merlin and iXora. These two methods were the only ones that produced some results when
Note that PHA is the most important computation since this crucially contributes to the improvement in accuracy and resolution in genomic region assignment to traits (see “Using iXora” in Additional file
The PHA accuracy is measured on a markerbymarker basis. Only HAPI, Merlin, and iXora provide an assignment of the parental haplotypes. Note that although SHAPEIT2 utilizes trios, it did not give us any means to extract parent haplotype information from the output. Both HAPI and Merlin perform poorly, with accuracy under 85% and 70% respectively. In contrast, iXora yields an accuracy of over 95%.
Although, HAPI and Merlin give means of identifying the parent haplotypes, they suffer a severe scaling problem, and are unable to handle more than about ten progeny per family. Thus it is not obvious how these systems can be coaxed to exploiting the availability of large progeny to improve the accuracy of the parental haplotype assignments.
Conclusions
From the comparison with related methods, we conclude that while methods to the problem of inferencing parental haplotype assignments on large F_{1} populations exist in theory, these approaches do not work in practice at high levels of accuracy (say > 90%). Moreover, iXora is the only algorithm that is robust enough to accurately extract the parental haplotypes in the absence of any parental genotype information. In practice, when the genotypes of the parents were known, we used this capability of iXora to match the estimated parent genotypes against the true genotypes to confirm the integrity of the phasing results. iXora additionally outputs several intrinsic measures of preciseness (the triplet
Methods
In this section we describe the mathematical details of the iXora haplotype inference algorithm, the measures used to quantify the precision of the output, and the different downstream processing of the output. We conclude the section with the description of the measures used to compare the results from different phasing algorithms.
iXora algorithm: haplotype inferencing
The outline of the three phases in the iXora algorithm is shown in Figure
Let
We next introduce a definition and notation for conjugacy. Let
Assume that there is no more than one crossover, in an individual, between two adjacent positions. If
Let
Observation 1.
In practice, we have encountered values of
Since all the individuals have the same two parents, let the two parents be
At Step 1,
Also,
If marker
Note that while it is easy to estimate if a marker is homozygous in both parents or heterozygous in both, it is not obvious to estimate the heterozygous parent when exactly one of the parents is so. In Step 3 we identify markers which are homozygous in exactly one parent (i.e., either
Recall that
In words,
Let
In our implementation we use
Observation 2.
This observation states that it is possible to computationally obtain two nonoverlapping sets of markers, where one set represents the markers that are homozygous in only parent
Let marker
Note that the above is equivalent to: For a marker
When marker
Without loss of generality, for each nonhomozygous
Let
Observation 3.
Note that
Observation 4.
Missing values There are often missing values in the data, sometimes as high as 20%. When the value is missing at position 〈
Selfed progeny For selfed progeny, not only are the parents the same but even the haplotypes of the parents are deemed identical. Then one of the following conditions holds for
(1)
(2)
Thus
Monotonic state transitions Next, the matrices
Let
Note that a marker
o
o If Lt(
o
o If Lt(
o
o If Lt(
o
o
o
o If
To estimate the running time of the algorithm, we classify the transitions as
The permissible transitions are captured in the transition diagram in Figure
Transition diagram for computing the final phasing output
Transition diagram for computing the final phasing output. The diagram shows the permissible state transitions for computing the phasing result matrices
Each element of
Observation 5.
(1) If
(2) If
(3) If
Observation 6.
(1) Given
(2)
To show that
An optimization problem (e.g., minimizing an appropriate error function) whose solution is associated with an output configuration, such as alignment of multiple sequences or a phylogeny topology or landscape of crossovers in chromosomes, has the added burden of proving stability in the solution. In other words, how distinct in configuration are the next closest solutions? This is usually very difficult to answer, and most methods are inadequate in addressing this. However, due to the very specific nature of our problem, we provide an agglomerate of “best” solutions, so that its stability can be gauged. Our focus is not just on
Suppose there are
Let the conjugate of
1. If
2. If
3. For each
4. For each
This is illustrated in the following example.
Example 1.
The nonnumeric values in
Precision measures of iXora output
Note that
The matrices
Let
Example 2.
The number of switches is 4 as shown.
The wild card may result in different positions of the same switch corresponding to whether it was interpreted as a
Example 3.
The first is treated as a
The wild card counts of the four switches are 1, 1, 2 and 1 respectively.
When
Similarly the length of the dispersion interval of Example 3 shrinks from 11 to 9.
The transformed values are shown in bold above. The same process is applied to every dispersion interval to transform the matrices
Observation 7.
In fact, the following is easily verified:
Observation 8.
In practice, we observed that in all data sets all the dispersion intervals had no switches. There was exactly one instance where
Observation 9. Let
Let
The dispersion index,
with
where
If
Also, in our experiments these mismatches were extremely low (less than 0.01) and when followed up turned out to be experimental errors. Hence we have followed the convention that such a mismatch be flagged as a potential error. Then an error, if any, is at 〈
To summarize, the trio (
Downstream processing of iXora output
Since iXora associates the parent haplotypes (not just the inherited alleles) to each marker in a progeny, it is possible to study the distributions of the inherited parent haplotypes independent of or in association with a phenotype. The details of these and other related postprocessing available in the iXora framework are described here.
One of the important consequences of haplotype inferencing is obtaining the haplotype frequency distribution across the chromosomes. A marker
Example 4.
A run with
The 8 distinct configurations for Example 2 are:
Let
Observation 10.
(1) above is easily verified and for (2), Equation 17 is explained through the following example.
Example 5.
Based on
In the absence of any other external information, each of the alternative solutions in
Observation 11.
Below, we describe the iXora methodology for associating discrete traits with genomic locations using haplotypes. The same approach can be used for continuous traits, using different statistical tests and randomizations. In general, the phasing output can be used in other types of statistical tests, for example to test for associations between a pair of markers and a phenotype. In the following, let
Combination of parents We can test the effect of each haplotype pair at a marker with a phenotype as follows. In the case of discrete phenotype with
Individual parents We can also investigate the contribution to phenotype of each parent individually. The contingency table in this case is a 2 ×
Significance thresholds via randomization We include a method for directly estimating the background distribution of
The agglomerate solution from the phasing algorithm can be directly visualized to detect distortions in the data, with or without using phenotypic information. Approaches for visualizing the phasing solution are demonstrated in the following two paragraphs, while the third paragraph describes visualization of haplotypephenotype associations. The figures shown here as examples stem from the use case described in detail in the Section “Using iXora” in Additional file
Individual haplotypes The individual haplotypes can be directly visualized, for example as the colored haplotype blocks shown in Additional file
Haplotype pairs The agglomerate structure capturing all the equallylikely solutions enables estimation of the possible dispersion of the crossover locations. iXora visualization of the expected frequency distributions of the progeny haplotype pairs is shown in Figure
Expected haplotype distributions visualization
Expected haplotype distributions visualization. Expected haplotype frequencies
Phenotype association Phenotype association for each parent individually is shown in Figure
Results from Fisher’s exact test for phenotypehaplotype association for A) Father and B) Mother, including the pvalue significance thresholds from randomizations, for the simulated use case detailed in Additional file
Results from Fisher’s exact test for phenotypehaplotype association for A) Father and B) Mother, including the pvalue significance thresholds from randomizations, for the simulated use case detailed in Additional file 1. In this case only one region of the genome from the father is significantly associated with the phenotype (marked by the dashed rectangle), according to the Fisher’s exact test and the randomization thresholds. [Legend: real data (red), randomized data (blue), smallest value in randomized data (green)].
Comparison with related methods
Here we first elaborate on the three distinct categories of population models for phasing, and then give details on the comparison of iXora with related phasing methods in literature. Technical details on the settings for each compared method can be found in Additional file
Unrelated individuals (no parent information) These methods treat the input genotypes as samples from a population of unrelated individuals, and do not assign the progeny to parental haplotypes. While they may be applicable to human population data, they are not suitable for our purposes. fastPHASE can be adapted to our problem setting by treating the input as
Unrelated trios These methods allow the definition of family relationships between parents and progeny in the input, with the limitation that each parent has only one progeny. BEAGLE and HAPIUR (HAPlotype Inference for UnRelated samples) are two such methods. The methods phase the progeny individually in terms of sequences that were transmitted from each parent. Therefore the progeny are not necessarily assigned to a consistent set of parent haplotypes.
Related trios These methods allow defining several progeny originating from the same two parents, thus the underlying algorithms utilize the full sibling information. However, unlike iXora, none of the existing methods was able to use the entire set of progeny per two parents. In our application this size is in hundreds. HAPI and Merlin produce results only on families of about 10 progeny while SHAPEIT2 can only process sizes up to 50. We therefore randomly divided the progeny into sets of appropriate sizes and phased the sets independently. However, we observed that the phasing results for the parents between sets were often inconsistent, affecting the overall accuracy. HAPI and Merlin produce an assignment of progeny to parental haplotypes while SHAPEIT2 does not.
Comparison measures Here we describe the measures that we used to compare the different methods. Since existing phasing methods generate various types of output, we use two different measures so that all the methods are comparable on at least one measure. Our interest was not simply restricted to phasing the progeny genotypes by assigning each allele to the
First, the phasing accuracy (PA) of progeny is examined, on a markerbymarker basis, of only the heterozygous positions. We report the number of correctly assigned and the unassigned (unknown) positions as a percentage. BEAGLE, HAPI, HAPIUR, Merlin, SHAPEIT2 and iXora can be directly compared on this measure for progeny, because they report the parental origin (maternal, paternal) of each allele. To incorporate fastPHASE and MACH also in this comparison, we postprocessed their output as follows: progeny haplotypes are labeled as ‘maternal’ and ‘paternal’, using the labeling that minimizes mismatches compared to true maternal and paternal haplotypes. After the postprocessing, all methods can be compared on this measure for progeny. The same accuracy evaluation is used to report imputation accuracy, by examining only the phasing for the missing input values.
Second, the accuracy of assigning the correct parental haplotype (PHA) for each progeny allele is examined, again on a markerbymarker basis.
All the simulated data sets are available at the iXora website
Competing interests
The authors declare no competing interests.
Authors’ contributions
FU and NH designed and implemented the framework. FU designed and implemented the user interface. LP designed and implemented the core iXora algorithm for haplotype extraction. NH and FU designed the experiments and performed the analysis described in this paper. OEC contributed to the comparison with existing phasing methods (fastPHASE, SHAPEIT2). DL and SR were instrumental in the algorithm verification on real (Cacao) data. JCM, RJS and DNK contributed to identifying the scope of the iXora pipeline. LP, NH, and FU wrote the paper. All authors read and approved the final manuscript.
Acknowledgements
FU is partially funded by MARS, under the joint cacao project with IBM Research. OEC is partially funded by MARS, Incorporated. The authors would like to thank Dr. Seth Findley and Joseph Conrad Stack, and the anonymous reviewers for providing valuable comments on the manuscript.