Department of Mathematics and Computer Science Education, Taipei Municipal University of Education, Taipei 10048, Taiwan

Department of Statistics and Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina 27695, USA

Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan

Department of Public Health and Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei 10055, Taiwan

Bioinformatics and Biostatistics Core, NTU Center for Genomic Medicine, National Taiwan University, Taipei 10055, Taiwan

Research Center for Gene, Environment, and Human Health, College of Public Health, National Taiwan University, Taipei 10055, Taiwan

Abstract

Background

With the completion of the international HapMap project, many studies have been conducted to investigate the association between complex diseases and haplotype variants. Such haplotype-based association studies, however, often face two difficulties; one is the large number of haplotype configurations in the chromosome region under study, and the other is the ambiguity in haplotype phase when only genotype data are observed. The latter complexity may be handled based on an EM algorithm with family data incorporated, whereas the former can be more problematic, especially when haplotypes of rare frequencies are involved. Here based on family data we propose to cluster long haplotypes of linked SNPs in a biological sense, so that the number of haplotypes can be reduced and the power of statistical tests of association can be increased.

Results

In this paper we employ family genotype data and combine a clustering scheme with a likelihood ratio statistic to test the association between quantitative phenotypes and haplotype variants. Haplotypes are first grouped based on their evolutionary closeness to establish a set containing core haplotypes. Then, we construct for each family the transmission and non-transmission phase in terms of these core haplotypes, taking into account simultaneously the phase ambiguity as weights. The likelihood ratio test (LRT) is next conducted with these weighted and clustered haplotypes to test for association with disease. This combination of evolution-guided haplotype clustering and weighted assignment in LRT is able, via its core-coding system, to incorporate into analysis both haplotype phase ambiguity and transmission uncertainty. Simulation studies show that this proposed procedure is more informative and powerful than three family-based association tests, FAMHAP, FBAT, and an LRT with a group consisting exclusively of rare haplotypes.

Conclusions

The proposed procedure takes into account the uncertainty in phase determination and in transmission, utilizes the evolutionary information contained in haplotypes, reduces the dimension in haplotype space and the degrees of freedom in tests, and performs better in association studies. This evolution-guided clustering procedure is particularly useful for long haplotypes containing linked SNPs, and is applicable to other haplotype-based association tests. This procedure is now implemented in R and is free for download.

Background

High-density sets of SNPs, especially haplotypes, have been used widely in genetic research to explore possible association with complex diseases. Haplotypes are considered to be the biological units containing more information about transmission, and thus may be better biomarkers to use in examining the disease susceptible region. However, haplotype phase is often unknown when only genotype data are observed. This linkage phase ambiguity often leads to large degrees of freedom in statistical tests, and may result in estimation of many haplotypes with rare frequencies. Collection of family genotype data may help in determination of haplotype phase if information from other family members can be incorporated and cross-referenced. Additionally, the use of family data can avoid spurious association arising from population admixture. Nevertheless, the statistical analysis of family data may not be straightforward. For instance, the nonparametric transmission disequilibrium test (TDT) and other similar tests utilize the transmitted and non-transmitted alleles (or haplotypes) to detect association. For uncertainty both in transmission and in phase, most procedures adopt an expectation-maximization (EM) algorithm in computation of score statistics or likelihood functions, such as FBAT

As for addressing the problems that result when a large number of different haplotypes are involved in analysis and when certain haplotypes occur with very low frequency, several approaches have been adopted. Some studies have deleted haplotypes with small estimated frequencies

Under the assumption of random mating and with the use of transmitted and non-transmitted haplotypes from parents, we construct for family data a core set of haplotypes based on estimates of haplotype frequencies. In the following sections, we start with the notation used in Becker and Knapp

Methods

Notation

Following the same notation used in Becker and Knapp _{i }
_{i }
_{i }
_{i}

where

It is worth noting that, under the null hypothesis of no association, _{i }

Step 1 Clustering haplotypes

With the frequency approximating the "age" of haplotypes, a cladistic clustering approach is conducted based on their evolutionary relation. Similar to the haplotype clustering method for case-control studies in Tzeng

The selected core haplotypes are the leading

The superscript is the number of generations in the evolutionary tree, and the subscript stands for the order of haplotype frequency from large to small. Similarly, the set of haplotypes with one step mutation from ^{(0) }is denoted as ^{(1)}, and those ^{(m)}. For each set ^{(m)}, the corresponding haplotype frequencies are denoted as ∏^{(m)}, where ^{(m) }and ^{(m-1)}, an allocation matrix **B**
^{(m) }is defined to represent the probability that a certain haplotype in ^{(m) }is a direct descendant of a haplotype in ^{(m-1) }

where (∏^{(m)})^{
t
}is the transpose of ∏^{(m)}. Detailed explanation is provided in Additional file

**Derivation of haplotype frequency estimates and haplotype explanation set for each family based on genotype data**.

Click here for file

Note that the above derivations do not require the information of haplotype phase for each family member. In other words, any software which provides estimates of haplotype frequencies can be applied at this stage. However, we prefer FAMHAP because it also computes for each family the compatible transmitted and non-transmitted haplotypes, along with the weights, which will be utilized in the following steps of recoding and testing.

Step 2 Recoding transmission and non-transmission haplotypes for analysis

After determining the core haplotypes with updated frequencies, we begin to **B**
^{(m) }defined earlier. Let matrix Γ^{
m
}be the product of **B**
^{(m)},**B**
^{(m-1)},... and **B**
^{(1)}, thus the row dimension of Γ^{
m
}indicates the number of rare haplotypes in ^{(m)}, and the column dimension of Γ^{
m
}stands for the number of core haplotypes in ^{(0)}. For instance, the ^{
m
}is

where ^{(m) }is to be clustered with the ancestor haplotypes in the core. Therefore, in the following the original frequency

The transmission status, along with its probability, haplotype phase ambiguity, and evolution uncertainty, can now be re-arranged, before statistical analysis, using equation (2). For instance, under the alternative hypothesis of association, the frequency of the ^{(m) }for the transmission group

and ^{(m)}. For the non-transmission group

Step 3 Likelihood ratio test with clustered haplotypes (LRT-C)

We now derive the likelihood ratio test statistic with all parameters rewritten in terms of the core haplotype frequencies. This amounts to rewriting Becker and Knapp's

where

The _{j.Tr }
_{u.Tr }
_{k.NTr }
_{v.NTr }
_{l'}

Results

To evaluate the performance of the likelihood ratio test with clustered haplotypes in family studies, we conduct simulations to first examine the reconstruction and identification of core haplotypes, and next to evaluate the impact of this clustering scheme on the likelihood ratio test. The results are compared with three family-based association methods, FAMHAP

Sampling scheme for simulations

The SNP haplotype sequences were first simulated based on a coalescent-based whole genome simulator GENOME ^{-6}/bp, and with 1,000 base pairs per each fragment. The recombination rate between ten consecutive fragments was assumed to be 10^{-4}. Default settings were used for other parameters such as the mutation and migration rates of 10^{-6 }and 2.5 × 10^{-4}, respectively. This resulted in 100 sequences with 972 SNPs. After deleting alleles with minor allele frequency (MAF) less than 5%, 536 SNPs were left. Haploview

LD plot

**LD plot**. LD plot of the simulated region.

**The complete plot of LD for all tag SNPs**.

Click here for file

Minor allele frequencies

**Minor allele frequencies**. Minor allele frequencies of the 13 SNPs consisting of the haplotype region.

Haplotype frequencies

**Haplotype frequencies**. Frequencies of the 15 haplotypes considered in simulation studies.

In the following simulations the number of families _{i }
_{1}
_{0 }

Penetrance values for simulation settings

**additive model**

**dominant model**

**recessive model**

**10**
^{
4
}
**× f**

**10**
^{
4
}
**× f**

**10**
^{
4
}
**× f**

**10**
^{
4
}
**× f**

**10**
^{
4
}
**× f**

**10**
^{
4
}
**× f**

**10**
^{
4
}
**× f**

**10**
^{
4
}
**× f**

**10**
^{
4
}
**× f**

71

214

357

72

217

217

98

98

294

50

150

250

53

160

160

89

89

267

33

100

167

40

120

120

67

67

200

77

192

308

78

195

195

99

99

246

57

143

229

60

151

151

91

91

229

40

100

160

47

118

118

73

73

182

83

167

250

84

168

168

99

99

198

67

133

200

70

139

139

94

94

188

50

100

150

57

114

114

80

80

160

Numbers listed are scaled penetrance values (10^{4 }× _{0}, 10^{4 }× _{1}, 10^{4 }× _{2}) corresponding to different allele frequencies (

The prevalence

Identification of core haplotypes and tests of association

Once the family genotype data were simulated, they were used to estimate haplotype compositions, corresponding frequencies, and also the set of core haplotypes. Figure

Average number of identified haplotypes under the additive model

**Average number of identified haplotypes under the additive model**. The number of haplotypes identified in the simulations under various numbers of families (

Average of percentage of identified core haplotypes

**Average of percentage of identified core haplotypes**. The average percentage of identified core haplotypes among the 10 true core haplotypes under different allele frequencies

After the core haplotypes are determined, the next step is to construct the likelihoods under the null and the alternative hypotheses, respectively, with all haplotype frequencies replaced by the revised core frequencies. This new modified likelihood ratio test with clustered haplotypes, LRT-C, is compared with a nonparametric score test in FBAT, an LRT with all rare haplotypes grouped into a single class (LRT-G), and an original test in FAMHAP. The resulting powers for

Number are the power of four family-based association tests at 5% significance level with

**Additive model**

**Dominant model**

**Recessive model**

**LRT-C**

**LRT-G**

**FAMHAP**

**FBAT**

**LRT-C**

**LRT-G**

**FAMHAP**

**FBAT**

**LRT-C**

**LRT-G**

**FAMHAP**

**FBAT**

**0.956**

0.914

0.931

0.916

**0.898**

0.835

0.860

0.864

0.058

0.059

0.062

**0.087**

**0.997**

0.991

0.990

0.987

**0.926**

0.888

0.863

0.888

**0.382**

0.334

0.341

0.156

**0.944**

0.900

0.896

0.887

**0.409**

0.382

0.353

0.338

**0.941**

0.920

0.909

0.117

**0.815**

0.734

0.786

0.743

**0.737**

0.643

0.683

0.652

0.067

0.063

0.071

**0.083**

**0.952**

0.917

0.912

0.891

**0.804**

0.748

0.730

0.753

**0.256**

0.209

0.212

0.089

**0.852**

0.794

0.798

0.773

**0.322**

0.295

0.277

0.267

**0.805**

0.753

0.726

0.094

**0.535**

0.438

0.473

0.422

**0.446**

0.381

0.400

0.368

0.049

0.052

**0.054**

**0.054**

**0.769**

0.696

0.676

0.645

**0.519**

0.439

0.452

0.426

**0.153**

0.138

0.141

0.060

**0.698**

0.646

0.607

0.575

**0.201**

0.193

0.192

0.165

**0.490**

0.438

0.444

0.068

Type I errors of the four family-based association tests at the 5% significance level

**LRT-C**

**LRT-G**

**FAMHAP**

**FBAT**

0.044

0.045

0.051

0.038

The above Table _{2 }. ^{2 }+ _{1 }.

**Derivation of P(A|D)**.

Click here for file

Plots of

**Plots of P(A|D) versus p under three genetic models**. The values of

Performance evaluation under population admixture

Another issue of note concerns the effect of population stratification on the power of LRT-C. To investigate this effect, we performed further simulation studies with data generated from two populations via GENOME. Similar to the procedures described above, we extracted tag SNPs with Haploview and selected a block with 13 tagSNPs to construct genotype data for

Numbers are the power of four family-based association tests for population stratification data at the 5% significance level with

**Additive model**

**Dominant model**

**Recessive model**

**LRT-C**

**LRT-G**

**FAMHAP**

**FBAT**

**LRT-C**

**LRT-G**

**FAMHAP**

**FBAT**

**LRT-C**

**LRT-G**

**FAMHAP**

**FAT**

(0.09, 0.08)

**0.918**

0.903

0.868

0.793

**0.860**

0.855

0.771

0.716

0.057

0.066

**0.091**

0.061

(0.43, 0.11)

**0.908**

0.850

0.880

0.838

**0.644**

0.573

0.638

0.580

**0.348**

0.309

0.343

0.036

(0.51, 0.55)

**0.847**

0.838

0.819

0.710

0.273

**0.279**

0.268

0.188

0.877

**0.884**

0.857

0.065

(0.09, 0.08)

**0.739**

0.727

0.678

0.519

**0.651**

0.640

0.564

0.464

0.046

0.055

**0.076**

0.054

(0.43, 0.11)

**0.778**

0.702

0.752

0.682

0.442

0.380

**0.470**

0.352

0.217

0.198

**0.232**

0.046

(0.51, 0.55)

0.747

**0.748**

0.728

0.595

0.195

0.208

**0.224**

0.133

**0.724**

0.705

0.686

0.069

(0.09, 0.08)

**0.446**

**0.446**

0.400

0.261

**0.361**

0.346

0.337

0.233

0.040

**0.044**

0.078

0.046

(0.43, 0.11)

0.448

0.404

**0.459**

0.354

0.234

0.210

**0.267**

0.171

**0.109**

0.101

0.143

0.028

(0.51, 0.55)

**0.544**

0.527

0.507

0.355

**0.142**

0.137

0.141

0.091

**0.370**

0.362

0.359

0.041

The two proportions in parentheses (in the first column) indicate the two frequencies under each population, and

Discussion

In this paper, we have constructed a family-based association test using clustered haplotypes. The four key steps are: (1) to determine the core set on the basis of haplotype frequencies, (2) to perform the clustering procedure based on a haplotype cladogram, (3) to represent the rare haplotype frequencies in terms of the revised core frequencies, and (4) to incorporate the phase ambiguity, transmission uncertainty, and core-representation variability via likelihood weights. Our simulations show that both haplotype reconstruction and core identification perform well with more than 91% accuracy for cases where number of families

One issue that merits discussion concerns the number of haplotypes in the core set. As a rule of thumb, we selected the several leading haplotypes with 0.9 cumulative frequency. This choice is somewhat arbitrary. In fact, there is a trade-off between the increase in information (represented via frequency) and the reduction in dimensionality. A possible alternative, depending on the sample size and number of dimensions under consideration, would be to use Shannon's information with a penalty function. This criterion works by finding

There are several potential applications for the association test presented here. In many association tests, the chi-square approximation can be poor due to the existence of many haplotypes and/or rare haplotypes. Our clustering procedure may improve the performance of such statistical methods. Although we only demonstrate its impact on the likelihood ratio test, we believe other tests would benefit from this clustering procedure as well. For instance, after the haplotype phase and transmission status are identified and recoded via the core for each family, the tests in TRANSMIT or FBAT or other kinds of haplotype inference

Conclusions

For family genotype data, we consider an evolution-guided clustering tool that clusters rare haplotypes in order to achieve dimensional reduction, and a parametric likelihood ratio test that accounts for the uncertainty associated with transmission phase. This procedure is able to preserve biological information and to improve statistical testing power. Simulation studies of long haplotypes with SNPs in LD show that the proposed likelihood ratio test with clustered haplotypes (LRT-C) outperforms FAMHAP, FBAT, and a naïve LRT-G.

Authors' contributions

MHL implemented the algorithm, performed the simulations, and drafted the manuscript. JYT suggested the evolution idea in family-based association study and revised the manuscript. SYH interpreted the concept of likelihood ratio tests and revised the manuscript. CKH conceived the research, supervised the study and finalized the manuscript. All authors read and approved the final manuscript.

Acknowledgements

MHL was supported in part by NSC 99-2118-M-113-001. JYT was supported by NIH grants R01MH84022 and P01CA142538. SYH, and CKH were supported in part by NSC 97-2314-B-002-040-MY3.