Center for Computational Biology and Bioinformatics and Department of Electrical Engineering, Columbia University, New York, NY, USA

Abstract

Background

In genome-wide association studies, thousands of individuals are genotyped in hundreds of thousands of single nucleotide polymorphisms (SNPs). Statistical power can be increased when haplotypes, rather than three-valued genotypes, are used in analysis, so the problem of haplotype phase inference (phasing) is particularly relevant. Several phasing algorithms have been developed for data from unrelated individuals, based on different models, some of which have been extended to father-mother-child "trio" data.

Results

We introduce a technique for phasing trio datasets using a tree-based deterministic sampling scheme. We have compared our method with publicly available algorithms PHASE v2.1, BEAGLE v3.0.2 and 2SNP v1.7 on datasets of varying number of markers and trios. We have found that the computational complexity of PHASE makes it prohibitive for routine use; on the other hand 2SNP, though the fastest method for small datasets, was significantly inaccurate. We have shown that our method outperforms BEAGLE in terms of speed and accuracy for small to intermediate dataset sizes in terms of number of trios for all marker sizes examined. Our method is implemented in the "Tree-Based Deterministic Sampling" (TDS) package, available for download at

Conclusions

Using a Tree-Based Deterministic sampling technique, we present an intuitive and conceptually simple phasing algorithm for trio data. The trade off between speed and accuracy achieved by our algorithm makes it a strong candidate for routine use on trio datasets.

Background

Large genetic association studies involving thousands of individuals are becoming increasingly available, providing opportunities for biological and medical discoveries using sophisticated computational and statistical analysis

Rather than examining SNPs independent of each other, simultaneously considering the values of multiple SNPs within haplotypes (combinations of alleles at multiple loci in individual chromosomes) can improve the power of detecting associations with disease and is helpful in several applications, such as evolutionary genetics

An algorithm (PHASE

It is important for phasing methods that they scale well with the number of SNPs as well as the number of individuals. It is also important in terms of computational time that when new data is inserted in phased datasets, we do not have to reuse the previous data in the estimation of the new data. A deterministic sequential Monte Carlo (DSMC) - based phasing algorithm

In this paper, we propose a related new TDS algorithm for haplotype phasing of trio data, in which trios are processed sequentially. All possible solutions for each haplotype are examined. Our algorithm uses the idea that within haplotype blocks there is limited haplotype diversity and thus attempts to phase each new trio using haplotypes that have already been encountered in the previously seen trios. The TDS framework allows us to effectively perform this search in the space of all possible solution combinations. The procedure can be seen as an efficient tree search procedure where in each step only "the most probable" solution streams are kept. Each of them contains one and only one solution for each trio already encountered. We show that our algorithm demonstrates an excellent tradeoff of speed and accuracy, making it ideal for routine use.

Results

The structure of this section is as follows: First we describe the datasets and figures of merit used to evaluate the method. Then we present the results from comparing our method to PHASE v2.1, BEAGLE v3.0.2 and 2SNP v1.7.

Datasets

We used a set of simulated datasets produced with the "COSI" software as provided in

We also used the "COSI" software to create our own realistic simulated data sets to assess the performance of our method on large datasets. We created 20 datasets, each of them consisting of 4000 haplotypes with 20 Mb of marker data using the "best-fit" parameters obtained from fitting a coalescent model to the real data. Samples were taken from a European population and each simulated data set has a recombination rate sampled from a distribution matching the deCODE map, with recombination clustered into hotspots. For each simulated data set, we initially selected only those markers with minor allele frequency greater than 0.05. Markers were then randomly selected to obtain a density of about 1 SNP per 3 kb. In each dataset two sample sizes were created: 100 and 1000 trios. In each trio, each parent was randomly assigned a haplotype from the population so that no two individuals had the same haplotype and one of the haplotypes of each parent was selected to be transmitted to the child.

Definitions of criteria

Transmission Error Rate: The proportion of non-missing parental genotypes with ambiguous phase that were incorrectly phased

Incorrect Trios (IT): The number of trios for which phasing was not completely correct.

Computational Time: The average time to complete phasing. Our algorithm was implemented in Java for portability, memory efficiency and speed. For each method we recorded the average computational time in each dataset on a 3.66 GHz Xeon Intel PC with 8 GB of RAM.

Memory: The memory required by the software to complete haplotype inference.

Transmission Error Rate and Incorrect Trios

The performance of the methods on the simulated data sets is shown in Tables

Average Transmission Error Rate For Phasing Trios

**Average Transmission Error Rate**

**(%)**

**ST1**

**ST2**

**ST3**

PHASE

0.0013

0.0013

0.0145

BEAGLE

R = 1

0.0235

0.0318

0.0426

R = 4

0.0150

0.0148

0.0344

TDS

0.0039

0.0065

0.0320

2SNP

0.4377

0.4868

0.4861

Average number of Incorrect Trios per dataset

**Incorrect Trios**

**ST1**

**ST2**

**ST3**

PHASE

0.3

0.4

2.45

BEAGLE

R = 1

3.75

5.8

6.4

R = 4

1.95

2.9

5.45

TDS

0.95

1.6

5.4

2SNP

25.9

28.6

28

We set 1% of the genotypes to missing values and we re-evaluated the performance of the algorithms in these datasets and the results are shown in Tables

Average Transmission Error Rate For Phasing Trios with 1% Missing Rate

**Average Transmission Error Rate**

**(%)**

**ST1**

**ST2**

**ST3**

PHASE

0.0031

0.0023

0.0161

BEAGLE

R = 1

0.0213

0.0248

0.0354

R = 4

0.0093

0.0133

0.0278

TDS

0.0094

0.0116

0.0348

2SNP

0.3038

0.3486

0.3169

Average number of Incorrect Trios per dataset with 1% Missing Rate

**Incorrect Trios**

**ST1**

**ST2**

**ST3**

PHASE

0.6

0.475

2.653

BEAGLE

R = 1

3.6054

5.25

6.4661

R = 4

1.7464

3.1321

4.8893

TDS

1.7521

2.7018

5.7768

2SNP

26.05

28.55

28.2

We demonstrated the accuracy of our method with increasing dataset size by varying the number of trios and markers and evaluated the performance by means of the Transmission Error Rate as shown in Table

Average Transmission error rate for 100 and 1000 Trios as a function of the number of markers

**Markers**

**200**

**400**

**1000**

**6000**

TDS

100

0.00063

0.00075

0.0015

0.0023

1000

0.00042

0.0008

0.0015

0.0023

Beagle

100

0.0013

0.0013

0.0021

0.0024

1000

0.00011

0.00033

0.0005

0.0007

2SNP

100

0.1094

0.2855

0.3916

0.4315

1000

0.1733

0.2524

0.3836

0.4117

Timing Results

The computational times for datasets ST1, ST2 and ST3 are displayed in Table

Timing Results

**Time(s)**

**ST1**

**ST2**

**ST3**

PHASE

8452

4932

5464

BEAGLE

R = 1

2.59

2.73

2.95

R = 4

2.80

3.18

3.27

TDS

1.99

2.48

2.61

2SNP

0.63

0.6

0.59

Timing Results with 1% Missing Rate

**Time(s)**

**ST1**

**ST2**

**ST3**

PHASE

8613

5220

5831

BEAGLE

R = 1

2.6744

2.9873

3.2409

R = 4

2.9233

3.2858

3.4429

TDS

2.0643

2.5815

2.7484

2SNP

0.67

0.63

0.6

In Table

Average Timing Results in seconds for 100 and 1000 Trios as a function of the number of markers

**Markers**

200

400

1000

6000

TDS

100

2.8

5

14.4

113.6

1000

31.8

63.3

156.2

1257.4

Beagle

100

3.7

5.6

15.2

118.4

1000

12.7

31.6

291.8

1952.4

2SNP

100

3

8.9

28.7

180.7

1000

33.4

116.2

399.8

3008.2

Memory Requirements

All methods could complete the experiments within the preallocated 1.5 Gb of RAM.

Discussion

An important feature of our algorithm is the partition of the whole genotype sequence in smaller blocks that exhibit limited haplotype diversity. We currently identify these haplotype blocks based on the genotype sequences (see Haplotype Block Partitioning section). However, we can have significant gain in the accuracy of our algorithm if we improve the accuracy in the estimation of the boundaries of the haplotype blocks. To achieve that, either the haplotype blocks should be already known from outside sources, or a set of phased haplotypes from the region at interest should be already available. In real applications, it is very often the case that studies are performed in populations that are already studied in the HapMap project. This means that for these populations we have accurately phased samples, which can be used as a basis for accurate definition of the haplotype blocks. Our methodology offers a unique framework that can easily incorporate prior knowledge in the form of haplotypes or trio genotypes from the same population as that from which the target samples were drawn. In the case of haplotypes (such as those available from the HapMap), they are introduced in the form of a prior for the counts in the TDS algorithm. In the case of unphased trio genotypes, the trios can be phased along with the target samples, with the result discarded at the end. The presence of the extra information will improve the phasing accuracy on the target samples.

A related problem to haplotype inference is imputation of missing SNPs. Several algorithms have been specifically developed to address this problem

In datasets with missing SNPs such as the ones used in Tables

Average Allelic Imputation Error Rate For Simulated datasets

**Average Allelic Imputation Error Rate**

**(%)**

**ST1**

**ST2**

**ST3**

PHASE

0.0063

0.0145

0.0133

BEAGLE

R = 1

0.0124

0.0255

0.0249

R = 4

0.0101

0.0224

0.0223

TDS

0.0124

0.0271

0.0266

2SNP

0.0741

0.0855

0.0983

Average Allelic Imputation Error Rate and Timing Results for HapMap datasets

**Allelic Imputation Error Rate**

**Time(s)**

PHASE

0.0051

5360

BEAGLE

R = 1

0.0129

3.156

R = 4

0.0112

3.339

TDS

0.0134

2.53

2SNP

0.0831

0.685

Conclusions

We have introduced a new algorithm for inferring haplotype phase in nuclear families using a Tree-Based Deterministic sampling scheme. PHASE, which is the most accurate algorithm for haplotype inference in trio families, is prohibitively slow for routine use, and 2SNP, which is the fastest algorithm for datasets up to 100 trios, is inaccurate. We have demonstrated that TDS is faster and more accurate than BEAGLE in almost all scenarios considered in small and intermediate dataset sizes in terms of trios and for all marker sizes. From a user's perspective, our implementation is friendlier since it is parameter - free, as all parameters are optimized inside the algorithm. This makes it ideal for routine tasks even for non specialized users. Furthermore, our TDS implementation provides a comprehensive, solid and straightforward framework to build upon for more complex phasing and imputation scenarios.

Methods

Brief Description

We first give an intuitive description of our algorithm highlighting its major concepts without going into detailed mathematical formalization. Suppose that we denote the major allele in a particular SNP locus in a haplotype as "0" and the minor allele as "1". Similarly in a genotype we denote by 0 that the individual is homozygous to the major allele at that SNP and with "1" that the individual is homozygous to the minor allele. We denote by "2" the heterozygous case. For example, the haplotype pair "10110" and "00100" would produce the genotype "20120".

In nuclear families, each parent transmits a chromosome to a child. In most cases we can detect which parent transmitted which SNP to the offspring based on the genotypes of the parents and the offspring. The only case where we cannot infer that information is when both parents and the offspring are all heterozygous to that SNP (i.e., at that SNP all three genotypes are "2"). In that case, either parent can have transmitted the major or the minor allele, so we have two possibilities for the origin of each allele. This means that if a genotype of a trio has ^{L }

Example of TDS

**Example of TDS**. We process three trios sequentially. In each trio the first two genotypes are the genotypes of the parents and the third genotype is the genotype of the child. The possible solutions of each trio are given exactly next to it and numbered 1, 2. In each of the possible solutions for each trio the first two genotypes are the transmitted and the untransmitted haplotype from the first parent and similarly the remaining two for the second parent. At each step we are willing to keep only K = 2 streams which would be called "surviving streams". 1) The first trio has two possible solutions. 2) a) The second trio has two possible solutions. We have four possible combinations of a solution from the first trio to a solution from the second. The indices below the solutions show from which solutions from each trio this stream was created. For example stream s_{1-2 }as illustrated, was created from the first solution in the first trio and from the second in the second. In each stream we associate a weight as described in method section. b) We keep only the K = 2 streams with the highest weights (surviving streams) so at this point we consider them as the most probable and keep them. 3) The third trio has 2 possible solutions. a) Each one of them is appended in the end of each of the two solutions that we have kept. The definition of the streams is similar as before with stream s_{2-1-1 }coming from appending solution 1 of the third trio to stream s_{2-1}. b) Again we keep only two of the streams the ones with the highest weights s_{2-1-1 }and s_{2-1-2.}

Our algorithm processes nuclear families sequentially (Figure

Suppose we had _{l}
_{n}

To further explain this procedure, suppose that after processing a trio we have ^{ext }
^{ext }

The idea for weighting the different streams is based on the concept that within a haplotype block we expect to have limited diversity and find only a subset of all the possible haplotypes. This means that most haplotypes should be encountered more than once. In terms of our procedure we would like to phase each new trio based on haplotypes that we have already encountered in that stream. Since the weight we assign to each node should capture this feature, it is a function of the weight that this node had prior to attaching one of the possible solutions of the current trio and of a factor that represents how the currently appended solution includes haplotypes that have already been seen (see Eq. (4) in Methods section).

Definitions and Model Selection

Let us assume that we have _{
t
}are the genotypes of the ^{th }trio, i.e., _{t }
_{t,f}, g_{t,m}, g_{t,c}
_{t,f}, g_{t,m}, g_{t,c }
_{t }
_{1}, ..., _{t}
_{t }
_{
t,1}, _{
t,2}, _{
t,3}, _{
t,4}}, where {_{
t,1}, _{
t,2}} are the two haplotypes of the first parent and {_{
t,3}, _{
t,4}} are the two haplotypes of the second parent and similarly define _{t }
_{1}, ..., _{t}
_{1}, ..., _{M }
_{1}, ..., _{y}

Let us consider the following dynamic model

• Initial state model _{θ }
_{0})

• State transition model _{θ }
_{t}
_{
t-1}) for

• Measurement model _{θ }
_{t}
_{t}

where _{θ}

In the next subsection, for the convenience of the reader, we present the form that the system update equations would have should the system parameters were known. Then we move forward and make the connection to the real scenario were the system parameters are not known.

TDS ESTIMATOR with known system parameters θ

We assume that by the time we have processed genotype _{t-1 }
_{θ }
_{
t-1}|_{
t-1}). When we process the individual _{t }
_{t}

Given the set of solution streams and the associated weights we approximate the distribution _{θ }
_{
t-1}|_{
t-1}) as follows:

where

and

From the previous relationships, if we knew the system parameters ^{ext }
^{th }trio, we would be able to approximate the distribution of _{θ }
_{t}
_{t}

where [

and

TDS Estimator with unknown system parameters θ

However, the system parameters are not known. Suppose now that their posterior distribution given _{
t
}and _{
t
}only depends on a set of sufficient statistics _{t }
_{t }
_{t}
_{t}
_{t }
_{
t-1}, _{t}
_{t}

Similarly to (1) we have:

Conditional on the haplotype of the ^{th }trio the genotype of that trio is unique and is independent of all the previous observations _{t-1 }
_{t-1 }
_{θ }
_{t}
_{t},G_{
t-1}) and consequently the integral ∫ _{t}
_{t},θ_{t}
_{
t-1}, _{
t-1}, _{t }

The recursion now lies only in computing the integral in (2).

In order to calculate the integral in the previous equation we will define the prior distribution for the parameters

Prior and Posterior Distribution for θ

Assuming random mating in the population it is clear that the number of each unique haplotype in

With mean

Next we will show that the posterior distribution for _{θ }
_{t}
_{t}

where we denote _{m}
^{th }trio and _{m }
_{t,i}
_{m }
_{t,i }

TDS-Estimator

We have that _{
t-1}) = _{1}(_{M}

and therefore we can calculate the integral in (2) as follows:

where

Having calculated the integral, we can go back to the recursion and assuming that we have approximated _{
t-1}|_{
t-1}), we can approximate _{t}
_{t}

The weight update formula is given by

Haplotype Block Partitioning

Again, we use the idea that haplotypes exhibit block structures so that within each block the haplotype blocks exhibit limited diversity compared to the whole haplotype vectors. To define these blocks we use a Dynamic Programming (DP) algorithm similar to the one used in

Let us define ^{th }SNP, where total block entropy is the sum of the entropies of all the blocks. If _{i:j }
_{i:j }

More specifically if there are n distinct genotypes in _{i:j}
_{1}, g_{2}, ..., g_{n}} each one of them with counts {a_{1}, a_{2}, ..., a_{n}} then

for

When the DP algorithm was applied to the ST1,ST2 and ST3 datasets with the maximum allowed block size being 12, we obtained an average of 6 markers per block with the smallest block being a single marker and the largest equal to W. On average, we had 22 distinct haplotypes per block with their number ranging from 1 to 30.

Our algorithm is based on genotypes as opposed to haplotypes that were used in

Average Transmission Error Rate for Equal Block Partitioning TDS (Equal TDS)

**Average Transmission Error Rate**

**(%)**

**ST1**

**ST2**

**ST3**

TDS

0.0039

0.0065

0.0320

Equal TDS

0.0113

0.0085

0.0360

Average number of Incorrect Trios per dataset for Equal Block Partitioning TDS (Equal TDS)

**Incorrect Trio**

**ST1**

**ST2**

**ST3**

TDS

0.95

1.6

5.4

Equal TDS

1.6

1.7

5.6

Partition-Ligation

In the partition phase the dataset is divided into small segments of consecutive loci using the haplotype block partitioning method described above. Once the blocks are phased, they are ligated together using the following method (an extension of the original method described in

The result of phasing for each block is a set of haplotype solutions, paired with their associated weights. Two neighbouring blocks are ligated by creating merged solutions from all combinations of the block solutions, each associated with the product of the individual weights, called the

Furthermore, the order in which the individual blocks are ligated is not predetermined. We first ligate the blocks that would produce in each step the minimum entropy ligation. This procedure allows us to ligate first the most homogenous blocks so that we have more certainty in the solutions that we produce while moving in the ligation procedure.

Summary of the proposed algorithm

In the partition phase the dataset is divided into small segments of consecutive loci using the haplotype block partitioning.

**Routine 1**:

• Enumerate the set of all possible haplotype vectors,

• Initialization: Find all possible haplotype assignments for each trio and rearrange the trios in ascending order according to the number of distinct haplotype solutions each one of them has. Use the first

• Update: For

∘ Find the ^{ext }
^{th }trio.

∘ For ^{ext}

■ Enumerate all possible stream extensions _{j }
_{
j,1}, _{
j,2}, _{
j,3}, _{
j,4}}

■ ∀_{j }

∘ Select and preserve ^{ext}

∘ ∀k, update the sufficient statistics

**TDS ALGORITHM**

• Partition the genotype dataset

• For

• Until all blocks are ligated

∘ Find the blocks that if ligated would produce the minimum entropy

∘ Ligate the blocks, following the procedure described in the Partition-Ligation section

Authors' contributions

XW and DA conceived of the study. AI, JW, DA and XW participated in the design of the study. AI performed the computer experiments and wrote the first draft of the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We would like to thank the referees for their helpful suggestions that let to important improvement of the manuscript.