Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, USA

Computer Science Department, Brown University, Providence, RI, USA

Tepper School of Business, Carnegie Mellon University, Pittsburgh, PA, USA

Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA, USA

Abstract

Background

Maximum parsimony phylogenetic tree reconstruction from genetic variation data is a fundamental problem in computational genetics with many practical applications in population genetics, whole genome analysis, and the search for genetic predictors of disease. Efficient methods are available for reconstruction of maximum parsimony trees from haplotype data, but such data are difficult to determine directly for autosomal DNA. Data more commonly is available in the form of genotypes, which consist of conflated combinations of pairs of haplotypes from homologous chromosomes. Currently, there are no general algorithms for the direct reconstruction of maximum parsimony phylogenies from genotype data. Hence phylogenetic applications for autosomal data must therefore rely on other methods for first computationally inferring haplotypes from genotypes.

Results

In this work, we develop the first practical method for computing maximum parsimony phylogenies directly from genotype data. We show that the standard practice of first inferring haplotypes from genotypes and then reconstructing a phylogeny on the haplotypes often substantially overestimates phylogeny size. As an immediate application, our method can be used to determine the minimum number of mutations required to explain a given set of observed genotypes.

Conclusion

Phylogeny reconstruction directly from unphased data is computationally feasible for moderate-sized problem instances and can lead to substantially more accurate tree size inferences than the standard practice of treating phasing and phylogeny construction as two separate analysis stages. The difference between the approaches is particularly important for downstream applications that require a lower-bound on the number of mutations that the genetic region has undergone.

Background

The sequencing of the human genome has made it possible to conduct genome-wide studies on genetic variations in human populations. Most of these variation data are in the form of single nucleotide polymorphisms (SNPs), single DNA bases that have two common variants in a population, of which several million have now been identified

Unfortunately, the haplotype input data these methods assume, also known as "phased" data, are not easily available for autosomal genetic regions. Large-scale genetic studies usually instead must gather unphased, or genotype, data, in which haplotype contributions from two homologous chromosomes are conflated with one another.

To illustrate the problem, it will be helpful to arbitrarily denote the minor allele at each SNP site by 1 and the major allele by 0. In a genotype data set, we only observe the number of minor alleles present at each SNP site, which we will denote by 0 for homozygous major, 1 for heterozygous and 2 for homozygous minor. For example, see Figure ^{m }while a haplotype sequence is a string of the form {0, 1}^{m}. A pair of haplotype sequences is

Phasing: computationally inferring genotypes from haplotypes

**Phasing: computationally inferring genotypes from haplotypes**. Although DNA sequences consist of four bases, single nucleotide polymorphisms (SNPs) are biallelic. Therefore, the sequence variation can be expressed using binary symbols. The observed genotype sequences consist of conflated combinations of two true haplotype sequences. Programs that computationally infer haplotypes attempt to minimize switch errors.

While mitochondrial and Y chromosome data can serve for tracking population histories on broad scales

Phylogenetic error

**Phylogenetic error**. Although switch errors in phase inference can be small, in this case 1, the phylogeny size could be significantly altered. Therefore estimates such as mutation rates could be significantly affected if performed on computationally inferred haplotypes as opposed to genotypes. Moreover, in current methods it is impossible to say if the inferred phylogeny size is larger or smaller than that of the phylogeny from the true haplotypes.

A limited amount of prior work has examined the prospect of inferring maximum parsimony phylogenies directly from genotype data. Notice that in such problems, we wish to determine a pair of haplotypes for each input genotype sequence such that the maximum parsimony phylogeny size on the set of haplotypes is minimized. Gusfield showed that the problem can be efficiently solved when the genotype data are consistent with a perfect phylogeny

We note that the parsimony based approach described above is different from finding haplotypes corresponding to the given genotypes based on 'pure parsimony,' an objective that minimizes the number of distinct haplotypes needed to explain the observed genotypes as opposed to minimizing the number of mutations. The pure parsimony problem is NP-complete as well and there are integer program based approaches that solve problem instances of reasonable size

In this paper, we provide the first general, practical methods for maximum parsimony phylogeny inference from genotypes and use these methods to assess the inaccuracies introduced by phasing genotypes prior to phylogeny inference. Our algorithm relies on solving integer linear programs and allows for efficient solution of moderate-sized problem instances but large imperfection. As an immediate application, our method can be used to infer the minimum number of recurrent mutations required to explain the given set of genotypes. We apply the resulting methods to a selection of real and simulated data, where we compare the true imperfection, imperfection from haplotypes computationally inferred from genotypes and imperfection directly obtained from genotypes. This analysis shows that the phasing step often increases inferred phylogeny size, overestimating the true maximum parsimony. Motivated by our observations, we introduce a new

Results and Discussion

We now present the results of a series of empirical tests to assess the utility of the method on real and simulated genetic data. With both kinds of data, we begin with known haplotypes and then artificially pair them to produce genotypes. For each problem instance, we reconstruct maximum parsimony (MP) phylogenies in three ways: directly from the genotypes using the algorithm presented in this paper, from the original (true) haplotypes and from haplotypes computationally inferred from the genotypes using fastPHASE _{min}, _{true}, and _{phase }to denote the MP phylogeny from the genotypes, true haplotypes and inferred haplotypes (either using fastPHASE or haplotyper) respectively. We further denote the parsimony score (number of mutations) of a phylogeny _{phase }or _{min}, we define a _{true}) as follows.

Definition 1

_{min }_{phase}_{true}) - length(_{true}) _{true}).

Note that it is impossible for _{min }to have positive phylogenetic error. This is because our algorithm optimizes over all possible haplotypes consistent with the given set of genotypes and selects the one that minimizes the size of the phylogenetic tree. In contrast, _{phase }can suffer from both types of errors and it is impossible to know if the size of the true phylogeny is larger or smaller than _{phase}. The following definition of an

Definition 2

Simply stated, the imperfection is the minimum number of

Simulated Data

Due to difficulty of obtaining phase-known autosomal data, we begin by examining simulated data. We used coalescent simulations to generate recombination-free haplotypes and genotypes for varying mutation rates and used these for a series of tests on how the accuracy of our method and the two comparative haplotype-based approaches varied with different parameter values. Each test measured the total number of errors of each method in 200 independently generated data sets. We first varied the mutation rate parameter

Phylogenetic error as a function of mutation rate for varying dataset sizes

**Phylogenetic error as a function of mutation rate for varying dataset sizes**. Each plot shows the phylogenetic errors for inferences from our direct inference method (black circles), indirect inference using fastPHASE (light grey triangles), and indirect inference using haplotyper (dark grey squares) as a function of the mutation rate

Positive and negative errors for indirect phylogeny inference with varying mutation rate

SNPs

_{
H
}

Method

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

5

30

fastPHASE positive errors

2

8

8

13

7

15

14

8

14

5

30

fastPHASE negative errors

1

8

1

5

11

16

19

19

26

5

30

haplotyper positive errors

1

10

2

10

3

6

12

5

12

5

30

haplotyper negative errors

1

8

2

5

11

17

22

21

26

10

30

fastPHASE positive errors

7

12

27

38

28

38

38

27

42

10

30

fastPHASE negative errors

10

12

9

11

24

29

24

23

26

10

30

haplotyper positive errors

7

14

15

23

28

24

37

41

40

10

30

haplotyper negative errors

11

11

12

11

21

31

20

18

25

5

60

fastPHASE positive errors

6

7

7

15

16

13

16

11

14

5

60

fastPHASE negative errors

2

3

3

7

7

7

12

12

21

5

60

haplotyper positive errors

5

3

3

9

7

8

7

7

6

5

60

haplotyper negative errors

2

2

3

9

10

9

14

15

24

10

60

fastPHASE positive errors

24

25

25

29

28

32

43

54

41

10

60

fastPHASE negative errors

4

11

20

14

12

17

27

23

36

10

60

haplotyper positive errors

11

13

14

13

22

25

27

42

34

10

60

haplotyper negative errors

7

12

24

18

11

21

28

22

29

The table separates the phylogenetic errors from the experiments of Figure 2 into positive and negative errors for indirect phylogeny inference using fastPHASE and haplotyper.

We next tested variation in accuracy with the number of haplotype sequences sampled for fixed mutation rate with

Phylogenetic error as a function of population size

**Phylogenetic error as a function of population size**. Each plot shows the phylogenetic errors for inferences from our direct inference method (black circles), indirect inference using fastPHASE (light grey triangles), and indirect inference using haplotyper (dark grey squares) as a function of number of input haplotypes. Plots are provided for two window sizes (5 and 10 SNPs). Each data point in the plot was computed by running each algorithm over 200 randomly generated data-sets, (a) window size 5. (b) window size 10.

Positive and negative errors for indirect phylogeny inference with varying sample sizes

SNPs

Method

_{H }= 30

40

50

60

70

80

90

100

110

120

5

fastPHASE positive errors

10

12

15

14

11

15

25

15

10

10

5

fastPHASE negative errors

16

12

14

14

10

12

10

13

13

8

5

haplotyper positive errors

5

6

7

10

3

11

12

7

5

4

5

haplotyper negative errors

18

12

15

17

9

15

14

14

14

11

10

fastPHASE positive errors

38

32

43

34

31

41

46

53

41

38

10

fastPHASE negative errors

26

25

19

18

14

17

23

22

24

22

10

haplotyper positive errors

32

26

31

28

35

23

36

30

23

14

10

haplotyper negative errors

27

27

22

22

17

25

28

28

34

26

The table separates the phylogenetic errors from the experiments of Figure 3 into positive and negative errors for indirect phylogeny inference using fastPHASE and haplotyper.

Mitochondrial DNA

The next step in our analysis used mitochondrial DNA (mtDNA), which is naturally haploid. Although one would not normally need to phase mitochondrial DNA, we use it in our validation because it provides a source of large numbers of known haplotypes and because it provides a good model of recombination-free DNA. The lack of recombination in the mitochondrial DNA means that if the most parsimonious phylogeny on the genotypes is

Figure

Mitochondrial DNA D-loop

**Mitochondrial DNA D-loop**. Imperfection around two high-variation segments (bp 1:800 and 16100:16350) of the D-loop of the mtDNA. Each position on the x-axis denotes the central nucleotide of the window examined. The y-axis shows the inferred imperfection by our direct method (solid grey line), imperfections inferred by the indirect method using fastPHASE (dotted black line), and the true imperfection (dashed black line), (a) bp 1 to 800. (b) bp 16100 to 16350.

Phase-known Autosomal DNA

Only a very limited amount of true phase-known autosomal data is available. We chose to examine a set taken from the lipoprotein lipase (LPL) gene

The results are shown in Figure

Phylogenetic errors on lipoprotein lipase (LPL)

**Phylogenetic errors on lipoprotein lipase (LPL)**. Imperfections for 22 blocks from LPL. Each data point has an x-coordinate corresponding to the central SNP of a given block and a y-coordinate corresponding to the imperfection of the inferred phylogeny on that block. Data is shown for our direct method (solid grey line with squares), indirect inference with fastPHASE (dotted grey line with X's), indirect inference with haplotyper (dash-dot grey line with triangles), and the true imperfection (dashed black line with diamonds).

Resource Usage

We have, finally, examined the performance of our method in run time and space usage using additional simulation tests. We examined a range of data set sizes from 30 to 120 genotypes for fixed mutation rate

Run time and space performance as a function of input size

**Run time and space performance as a function of input size**. Each plot measures performance for fixed mutation rate

We further assessed space usage of our method based on the maximum linear program relaxation size examined over the course of a given problem instance, averaging this value over the 200 trials. Here size is expressed as the product of the number variables and constraints. Figures

Conclusion

We have developed the first practical, general methods for finding maximum parsimony haplotypes from unphased genotype data and have used them to assess the costs introduced by computational phasing prior to phylogenetic inference. Our methods used a collection of heuristics based on the theory of Steiner trees, a variant of a flow-based ILP, and a branch-and-bound approach to solve problem instances with high imperfection that were not solvable by any prior method. While the method presented here is specific to the problem of inferring purely mutational phylogenies, similar approaches may prove productive for inference of ancestry by more general models of molecular evolution, such as ancestral recombination graphs (ARGs). Empirical tests on simulated and semi-simulated data show that direct phylogeny inference from genotypes leads to fewer errors than does the standard practice of building phylogenies from phased data. Methods for this problem have several practical applications. Most important is to estimate the minimum number of recurrent mutations required to explain a set of observed genotypes. A large such value may indicate frequent recurrent mutation or gene conversion or a selective pressure to recurrently alter a given allele. Researchers trying to establish such effects need to ensure that the size of the phylogeny is not an artifact of phase inference. The method should similarly be useful for improving estimates of local mutation rates. Other applications include improving the power of association tests by eliminating spurious effects from recurrent mutation, and providing alternative methods for detecting recombination-free autosomal regions and performing phase inference from genotype data.

Methods

We implemented two versions of the integer linear program both of which were competitive in practice. The first is a direct integer linear program implementation and the second is a branch-and-bound algorithm that wraps over a second integer linear program. We describe the direct implementation first followed by the branch-and-bound method. Both methods were implemented in C++ using the Concert Technology of CPLEX 10.0 for integer linear program (ILP) solutions. We found the branch-and-bound method to give generally lower run times in practice than the direct ILP method. We therefore used the branch-and-bound method exclusively in generating the empirical results presented here.

Direct Integer Linear Programming Approach

This section introduces our ILP algorithm to solve the Genotype MP Phylogeny Problem. In the first subsection, we introduce pre-processing techniques that typically reduce the problem size after which we describe the ILP.

Preprocessing

Preprocessing techniques form an integral part of any solution method based on integer programming. We now describe the major preprocessing methods used.

Let

For all sites _{k }be initialized to 1. We then iteratively perform the following operation: for any pair of redundant sites _{i }:= _{i }+ _{j}, and remove site

Definition 3

_{(u,v)∈E}∑_{i∈D(u,v)}_{i},

The following lemma justifies the preprocessing step:

Lemma 1

Proof 1

_{I }_{I }_{I∪{j} }_{I∪{j}}) = length(_{I}) + _{I}. _{G }_{G}. _{G }_{G\{j} }_{G\{j}}) = length(_{G}) - _{G}) - _{G}).

Due to these preprocessing steps, we assume from now on that the input genotype matrix _{i }≥ 1 associated to sites. For each genotype ^{p-1 }pairs of haplotypes.

Now, consider the matrix _{g∈G}

For a binary input matrix _{1}(_{1}), _{2}(_{2}),..._{m}(_{m})] (_{j }= 0 or 1 for each 1 ≤ _{j}(_{j}) ∩ _{k}(_{k}) has nonempty intersection. There is an edge between two vertices in _{1}(_{1}),...,_{m}(_{m})] translates to haplotype (_{1},...,_{m}). Buneman graphs are very useful due to the following theorem:

Theorem 1

Using Theorem 1, we first construct the Buneman graph on

Lemma 2

The Buneman graph is simply a method to reduce the size of the underlying graph from an ^{m }vertices to a (typically significantly) smaller sub-graph. Putting together these methods, we can summarize our preprocessing steps as follows:

1. Create a weighted genotype matrix

2. Create a set

3. Construct the underlying graph _{u,v }= _{i }where

We apply some additional heuristic preprocessing steps that have proven very effective in practice. One of these steps identifies a subset of haplotypes that must occur in any optimal solution and then removes from the input any genotypes that can be produced from pairs of these obligatory haplotypes. As any optimal output can produce these genotypes, their absence will not change the final output. We can also eliminate certain possible haplotypes because they would imply high-weight edges and therefore cannot occur in any low-cost solution.

Once all preprocessing steps have been applied, we have a weighted Buneman graph

ILP Formulation

We now develop an ILP formulation for the problem based on multicommodity flows

The high-level idea of the method is to send flow from a designated root to each haplotype that is used to explain an input genotype. Each of these haplotypes acts as a sink for one unit of flow. The program must select a subset of edges that accommodate all flow while minimizing the cost of the edges selected. This flow formulation guarantees that every haplotype is connected to the root and the minimization prevents formation of cycles. The formulation thus forces the output to be a tree. For the sake of simplicity, we assume that the all-zeros haplotype is present in all the solutions. We can treat this as the

Let _{k }be an indicator variable denoting the presence or absence of haplotype _{k }= 1, _{i,j }that denote the presence of both haplotypes _{i }and _{j}. All the present haplotypes act as a sink for one unit of flow from the root vertex. On the other hand, all non-present haplotype vertices and Steiner vertices satisfy perfect flow conservation. To enforce this, we use two types of binary variables _{i,j }for each edge (_{i,j }are binary variables that denotes if edge (

We now have the following integer linear program:

_{i,j}_{i,j}_{i,j}

_{ij }≤ _{i} ∀

_{ij }≤ _{j} ∀

_{(i,j)∈R(g)}_{ij }≥ 1 ∀input

In constraints (2) and (3), variable _{ij }indicates the presence of the haplotype pair (_{i}, _{j}). Constraint (4) guarantees that each genotype is explained by at least one pair of haplotypes. Constraint (5) imposes inflow/outflow constraints on haplotypes as well as enforcing the condition that there is positive flow to a haplotype _{k }only if _{k }is selected. Constraint (6) imposes flow conservation at all non-present haplotype vertices as well as Steiner vertices and constraint (7) imposes the condition that flow can only be sent along edges present in the solution. Note that all integer variables of the above linear program are binary. Finally, we observe that the solution of the ILP is the size of the most parsimonious phylogeny on

Branch and Bound Algorithm

We developed an alternative method for the problem that uses a simpler integer linear program embedded in a branch-and-bound routine. The high-level idea behind the method is to first guess the set of haplotypes that would phase the given input genotypes and then construct a most parsimonious phylogeny on the haplotypes. Note that all the pre-processing techniques outlined in the previous sub-section still apply for this method.

We use

function genBB(genotypes

1. for all row vectors

(a) if ∃_{1}, _{2 }∈

2. if (|

3. if (hapMP(

4. let

5. for all

(a)

(b) _{1}, _{2}}

(c)

(d) if

6. return

The _{1}, _{2}. Integer

In the above method, the height of the branch-and-bound tree is at most ^{k }where ^{m}. Although the running-time of the final branch-and-bound method is super-exponential, we find that its run time is competitive with and often superior to the ILP described in the previous section.

Data Generation and Analysis

In order to generate simulated data, coalescent trees were created using Hudson's ms program _{h}. The ms program can also use this tree to produce haplotype sequences, but does so under the infinite-sites model (without any recurrent mutations). We therefore instead used the seq-gen program of Rambaut and Grassly _{h }haplotypes using the ms coalescent tree. We varied the number of SNPs _{0}_{0 }is the effective population size. We relate the simulation parameter _{0 }= 10, 000 (a reasonable estimate for humans ^{-6 }for

seq-gen was used under the GTR model, a generic time reversible Markov model. Mutation rates between

Each data point was generated from 200 independently generated simulated data sets, with the reported error rates summed over the 200 replicates. In our first set of simulations, designed to test the effect of mutation rate on accuracy, we varied

Mitochondrial data was extracted from of a set 63 complete mitochondrial DNA sequences of 16,569 bases each produced by from Fraumene et al.

Autosomal DNA was extracted from a lipoprotein lipase (LPL) data set due to Nickerson et al.

Availability and requirements

**Project name: Direct Imperfect Phylogeny Reconstruction from Genotypes**

**Project home page: **

The implementation of the algorithm that was used in our empirical study is accessible through a web form at the project web page. Instructions are provided at the site. Requirements below are for use of this web server. Source code in C ++ will be provided upon request, but requires that the user have access to ILOG CPLEX 10 and a CPLEX-supported operating system and compiler.

**Operating system(s): **Linux Redhat, Windows, Mac OS X

**Other requirements: **Web browser: software has been tested on Mozilla 1.6, Firefox 2.0.0.4, Internet Explorer 6.0, Internet Explorer Mac 5.2, and Safari 2.0.4.

**Any restrictions to non-academics: **Web-based access to the analysis tools is freely available.

Authors' contributions

S.S. developed the software used in the paper and performed all the experiments. All authors participated in the design of the algorithms and suggesting methods for validation. All authors read and approved the final manuscript.

Acknowledgements

The authors would like to thank Dannie Durand for reading and suggesting improvements to the manuscript. This work was supported in part by U.S. National Science Foundation award #0346981.