| Generalizing Terwilliger's likelihood approach: a new score statistic to test for genetic association1Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, PO Box 9600, 2300 RC Leiden, The Netherlands 2Department of Thrombosis and Hemostasis, Leiden University Medical Center, Leiden, The Netherlands 3Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
BMC Genetics 2007, 8:63doi:10.1186/1471-2156-8-63 The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2156/8/63
©
2007 el Galta et al; licensee BioMed Central Ltd. AbstractBackground:In this paper, we propose a one degree of freedom test for association between a candidate gene and a binary trait. This method is a generalization of Terwilliger's likelihood ratio statistic and is especially powerful for the situation of one associated haplotype. As an alternative to the likelihood ratio statistic, we derive a score statistic, which has a tractable expression. For haplotype analysis, we assume that phase is known. Results:By means of a simulation study, we compare the performance of the score statistic to Pearson's chi-square statistic and the likelihood ratio statistic proposed by Terwilliger. We illustrate the method on three candidate genes studied in the Leiden Thrombophilia Study. Conclusion:We conclude that the statistic follows a chi square distribution under the null hypothesis and that the score statistic is more powerful than Terwilliger's likelihood ratio statistic when the associated haplotype has frequency between 0.1 and 0.4 and has a small impact on the studied disorder. With regard to Pearson's chi-square statistic, the score statistic has more power when the associated haplotype has frequency above 0.2 and the number of variants is above five. BackgroundHaplotype analysis is a popular strategy to analyze multiple single nucleotide polymorphisms (SNPs) typed within one candidate gene [1,2]. Most of the genetic variation of a gene is described by 4 to 8 relatively common haplotypes (frequencies > 0.05). To compare the distribution of observed haplotype frequencies between cases and controls, the global test statistic of Schaid et al. [3] can be used when haplotype ambiguity exists. When no uncertainty in phase exists, for example in regions with little recombinations, the simple Pearson's chi square statistic is used to test for association. Both statistics test for any difference in haplotype frequencies between cases and controls and the degrees of freedom depend on the number of observed haplotypes. However, researchers often aim to identify one ancestral haplotype [4]. For this situation one may consider a test which summarizes the difference between the null hypothesis and the alternative hypothesis in one parameter, yielding a more powerful one degree of freedom test. This paper aims to derive such a method. For multi-allelic markers, Terwilliger [5] proposed a model with one parameter (λ) for the excess frequency of the associated allele in the cases. To test for association he proposed a likelihood ratio test, i.e. comparing the likelihood under the alternative to the likelihood under no association, for testing. Since it is unknown which allele is associated, the likelihood is a weighted sum of conditional likelihoods of the data given that the allele is associated over the observed marker alleles. As for weights Terwilliger [5] proposed to use the allele frequencies in the population from which the cases and controls were sampled. Thus the most common allele is assigned the highest weight in the analysis. However, for haplotype analysis it is usually not the most common haplotype in the population which is associated to the disease. Therefore in this paper, we propose a model which assigns constant weight to each haplotype. We derive a score statistic from the likelihood of this model. The log likelihood function of Terwilliger's model also appears to have some unusual statistical features. For example, the allele frequencies that are used as weights are also unknown parameters in the conditional likelihood functions. Maximizing the log likelihood function appears not always straightforward. Further its distribution under the null hypothesis is unknown. Terwilliger proposed to approximate the distribution of the likelihood ratio (TLR) statistic under the null hypothesis with the 50 : 50 mixture of chi square distributions of null and one degrees of freedom. This distribution appeared to yield conservative p-values [6]. Another property of the likelihood function is that the score function, the first derivative of the log likelihood function with respect to λ evaluated at λ = 0 is a constant zero for any observed data for the weights that Terwilliger proposed. Hence the corresponding score statistic does not exist. Score statistics are often preferred over likelihood ratio statistics, because they have often a tractable expression which provides insight in the data and are robust against model deviations [7]. When constant weights instead of allele frequencies as weights are used the score statistic exists. Under the null hypothesis, the statistic follows a normal distribution. In this paper we derive a method for the extreme situation of phase unambiguity, which means that we study genetic regions with almost no recombination [8]. This assumption is reasonable for many genes, for example fibrinogen alpha, beta and gamma [9] and CRP [10]. When some uncertainty about phase exists, the p-values should be adapted. Our program [11] is able to compute valid p-values for the situation of phase ambiguity by permutation of case and control status. The program calls the publicly available software TAGSNPS [8] to derive the haplotypes in each permutation step (see also Becker et al. [12] for similar methods). The methods derived in this paper are meant to detect a common haplotype. In general genetic association studies have power to detect common variants. Thus these methods should detect associations under the common disease-common variant hypothesis. However, it is also possible that a rare causal mutation is sometimes present on the identified haplotype. For example Hylckama Vlieg et al. [13] detected a common haplotype that was overrepresented in patients with deep venous thrombosis compared to healthy controls. Many homozygous carriers of this haplotype carried the factor V Leiden mutation, the most important genetic risk factor for deep venous thrombosis, which had been discovered before by other methods [14]. Thus if homozygous carriers of the associated haplotype were sequenced the factor V Leiden mutation would have been detected. We illustrate the method by an association analysis of three candidate genes, fibrinogen alpha (FGA), beta (FGB) and gamma (FGA)in the Leiden Thrombophilia Study (LETS) [9]. In these data, no phase ambiguity was present and the pair of haplotypes could be derived for each individual. The number of common haplotypes varied from 3 to 5. The frequency of the associated haplotype varied from 0.2 to 0.3. In this paper, we consider the score statistic as an alternative to the classical chi-square and the original TLR statistic of [5] in the context of haplotypes. We also include the likelihood ratio statistic corresponding to the log likelihood using equal weights. We carried out a simulation study to examine the performance of the four statistics under the null hypothesis and to compare the power of the four statistics under various alternatives. Finally we illustrate the proposed statistics by analysis of the haplotype distributions of the FGG, FGB and FGA genes. ResultsSimulation studyBy means of a simulation study we compared the performance of the new score statistic Table 1. Type I error rates and power of the statistics X2, For all m, the type I error rates of the score statistic To study the power of the statistics, we generated 10,000 samples of 100 case chromosomes and 100 control chromosomes from the multinomial distributions with probabilities p1(1 - λ) + λ and pj(1 - λ) for j = 2⋯m for cases and pj j = 1⋯m for controls, respectively. First, we considered the model used by Terwilliger [5]. The most common haplotype frequency p1 in controls was again set to 0.5 and this haplotype was more frequent in cases. The parameter λ was fixed to 0.5 which corresponds to a haplotype frequency of 0.75 in the cases. The number of haplotypes m was again set to 4, 5, 8, 10, 15 and 20. The results are shown in the right columns of Table 1. For m < 8, the power of Pearson chi square statistic was good. The Second, we studied the power of the Pearson's chi square,
For p1 = 0.5, the score statistic Especially for common variants with frequency p1 of 0.3 or 0.2 and a small impact on the disease (λ ≤ 0.1), the score statistic performed better than TLR statistic. For m = 5 and m = 8 and p1 = 0.3, the gain in power of the score statistic compared to TLR statistic was about 4% and 8% for λ of 0.05 and 0.1 respectively. For m = 5 and p1 = 0.2, both statistics performed similar. For m = 8 and p1 = 0.2 the gain in power of the score statistic was large, namely 7% and 12% for λ of 0.05 and 0.1 respectively. Data exampleWe applied the three statistics to a study on haplotype association of fibrinogen alpha (FGA), beta (FGB) and gamma (FGG) with risk of deep venous thrombosis in LETS [15,16]. Fifteen haplotype tagging SNPs were typed in 474 cases and 474 controls [9]. Within the three genes, the SNPs were in high linkage disequilibrium ( Table 2. Descriptives and results of genetic association on LETS For all genes, haplotype H2 appeared to be more frequent in the cases than in the controls. For FGG, FGA and FGB the allelic odds ratios of presence of H2 versus the rest was 1.34, 1.29 and 1.28 respectively. Note that these odds ratios were rather similar while the p-values of the corresponding chi square statistics were different namely, 0.005, 0.034 and 0.090 respectively. The difference in p-values was caused by the difference in degrees of freedom of the chi square statistics and the frequencies of other haplotypes. From the results of the standard chi-square statistics we concluded that only FGG was significantly associated to thrombophilia. The p-values of the TLR were respectively 0.003, 0.012 and 0.072. These p-values were in line with the estimates of λ, namely 0.08, 0.07 and 0.05 respectively. Since FGA and FGB both had 5 variants, the frequency of the associated haplotype H2 was 0.3 and 0.2 respectively and the λ's were rather small, the score statistic should have more power than TLR for these genes. On the other hand TLR may perform better on the FGG gene, because of the larger difference in frequency between cases and controls and because the frequencies in the pooled set are around 1/m. For this situation, the score statistic The p-values of the score statistic Although for each subject we were able to derive the pair of haplotypes, we also estimated the p-values of the statistics by 1000 permutations using TAGSNPS to estimate the haplotype counts in each permutation step. The results were similar. DiscussionIn this report we have derived a new score statistic to test for association between a candidate gene and a binary trait. Score statistics are easy to compute, locally most powerful, and robust to small model deviations under the alternative [7]. From our simulations it appears that under the null hypothesis the statistic follows a chi-square distribution. For candidate genes with a small impact on the disease (λ small), five to eight observed variants, and a frequency of the associated variant in the range of 0.2 to 0.4, the new statistic performs better than the Terwilliger's LR statistic. For larger frequencies (around 0.5) or for a large number of variants (m > 8), Terwilliger's LR statistic appears to perform better, because the weight corresponding to the associated haplotype is larger. Compared to Pearson's chi-square statistic, the score statistic has more power when the associated haplotype has frequency larger than 0.2 and the number of observed variants m is larger than 5. Especially when the frequency of the associated haplotype is small (< 0.2), Pearson chi-squared statistic performs similar or better than the score statistic. We used a mixture of two chi-square distributions to approximate the distribution of Terwilliger's LR statistic under the null hypothesis. It is known that this distribution appears to give conservative p-values [6]. Permutations of the case-control status may result in better p-values. However, permutations are not attractive when no simple formulae for the statistic exists. When the sampled population is not in Hardy Weinberg equilibrium, the considered statistics are not valid since they assume that the haplotypes of a subject are independent. For Pearson's chi-square statistic on haplotypes, a multi-allelic version of Armitage's trend test on pairs of haplotypes is available [17,18]. Under Hardy Weinberg equilibrium these statistics are asymptotically equivalent. To derive a one degree of freedom score statistic for pairs of haplotypes, a logistic regression model can be used with the number of associated haplotypes as covariate. Then the total likelihood is the average of the likelihoods over these logistic regression models of possible associated haplotypes. Here more research is needed. In this paper we used a smoothing approach to deal with the fact that the associated haplotype is unknown. As alternative, one may consider taking the maximum over each haplotype. A disadvantage of the latter approach is that permutations are needed to derive a p-value [19]. In this paper we assumed that the haplotypes could be derived from the typed tagging SNPs. When uncertainty in phase exists the haplotype frequencies have to be estimated using for example an EM algorithm [20] and the uncertainty has to be taken into account when the p-values are derived. This can easily be done by using permutations. Our program [11] is able to compute valid p-values in case of ambiguous phase by calling TAGSNPS [8] to derive the haplotype counts in each permutation step. It may be more efficient to incorporate phase unambiguity in the likelihood function. The gain in efficiency will probably be small. Analyzing genes with little information on phase yields many relatively rare haplotypes and interpretation of the results is often hard. Using permutations to deal with haplotype uncertainty is in the line with recent developments in weighted regression analysis [21] and in imputation of haplotypes [22]. The new score statistic has no power when the frequency of the associated haplotype in the pooled set of cases and controls is around ConclusionWe conclude that by choosing alternative weights, in particular constant weights, in the likelihood of Terwilliger, a set of new powerful and robust statistical tests was derived. For genetic association studies aiming to identify common associated variants, the new statistic should be used in addition to the standard Pearson's chi square statistic. By using both statistics more insight in the data can be obtained. A program [11] which computes the statistics and corresponding p-values is freely available. MethodsLet m be the number of haplotypes describing most of the genetic variation in a gene. Assume that the haplotype frequencies are in Hardy-Weinberg equilibrium proportions. Let p = (p1,⋯, pm) be the vector of haplotype frequencies in controls. Assume that only one haplotype denoted with index i is over-represented in the cases, then the haplotype frequencies in the cases can be modelled as qi = pi + λ(1 - pi) and qj = pj - λpj for j ∈ (1,⋯,i - 1, i + 1,⋯,m). Let x = (x1,⋯,xm) and y = (y1,⋯,ym) be vectors of haplotype counts in the cases and the controls, respectively, and let n1 and n2 be the total number of case chromosomes and of control chromosomes, respectively, and n = n1 + n2. Then the conditional likelihood Li given that haplotype i carries the mutation is given by and the likelihood proposed by Terwilliger is equal to with Lj given in formula (1). It is easy to see that likelihood function (2) can be generalized to the following likelihood function: with Lj given in formula (1) and w = (w1,⋯,wm) a vector of known positive weights restricted by For known allele frequencies pj, the distribution of the Uw under H0 can be approximated by the normal distribution with zero mean and variance Often the haplotype frequencies are unknown and have to be estimated from the data. Under the null hypothesis we can estimate the frequencies from the combined sample of cases and controls by under H0, Note that the variance where Under the alternative hypothesis of the presence of one positively associated haplotype i, the expectation of with qi = pi + λ(1 - pi). Note that AbbreviationsLR: Likelihood ratio with equal weights TLR: Terwilliger's likelihood ratio FGA: fibrinogen alpha FGB: fibrinogen beta FGG: fibrinogen gamma SNP: single nucleotide polymorphism LETS: the Leiden thrombophilia study Authors' contributionsREG participated in method development, wrote a C code, carried out the simulation study and data analysis, participated in interpreting results, and in drafting of the manuscript. SUdeW and MCHdeV provided the data and participated in interpreting the results. QH designed the web interface and extended the C code to account for phase uncertainty. LH participated in method development and in interpreting results. JJH-D participated in method development and in interpreting results, and participate in drafting of the manuscript. All authors read and approved the final version. AcknowledgementsThis study was supported by Program Grants from the Netherlands Organization for Scientific Research (NWO 912-03-014 (REG, QH) and NWO 912-02-036 (SUdeW, MCHdeV)). References
Have something to say? Post a comment on this article! |



on Google Scholar







author email
corresponding author email
Figure 1.




















