Abstract
Background
The genetic association analysis using haplotypes as basic genetic units is anticipated to be a powerful strategy towards the discovery of genes predisposing human complex diseases. In particular, the increasing availability of highresolution genetic markers such as the singlenucleotide polymorphisms (SNPs) has made haplotypebased association analysis an attractive alternative to single marker analysis.
Results
We consider haplotype association analysis under the populationbased casecontrol study design. A multinomial logistic model is proposed for haplotype analysis with unphased genotype data, which can be decomposed into a prospective logistic model for disease risk as well as a model for the haplotypepair distribution in the control population. Environmental factors can be readily incorporated and hence the haplotypeenvironment interaction can be assessed in the proposed model. The maximum likelihood estimation with unphased genotype data can be conveniently implemented in the proposed model by applying the EM algorithm to a prospective multinomial logistic regression model and ignoring the casecontrol design. We apply the proposed method to the hypertriglyceridemia study and identifies 3 haplotypes in the apolipoprotein A5 gene that are associated with increased risk for hypertriglyceridemia. A haplotypeage interaction effect is also identified. Simulation studies show that the proposed estimator has satisfactory finitesample performances.
Conclusion
Our results suggest that the proposed method can serve as a useful alternative to existing methods and a reliable tool for the casecontrol haplotypebased association analysis.
Background
Genetic association analysis aims to detect genedisease association through linkagedisequilibrium of a disease susceptibility gene with adjacent genetic markers. Historically, association analysis was limited to single markers. It is anticipated that greater power may be gained by utilizing linkagedisequilibrium information from multiple markers simultaneously. This anticipation, together with recent advances of the availability of highresolution genetic markers, in particular the singlenucleotide polymorphisms (SNPs), has motivated the use of haplotypes, which are specific combinations of closely linked genetic markers on a chromosome, as the basic genetic units for association analysis. In addition, the biological advantage for haplotypebased analysis is that it can directly identify unique chromosomal segments that contain disease susceptibility genes by assessing the haplotypespecific risk for disease. Schaid [1] provided a detailed and excellent review for haplotypebased association analysis.
The populationbased casecontrol study design has been popular for genetic association analysis due to its costefficiency in collecting the data. If the haplotype pair for each individual is directly observable, traditional logistic regression analysis can be applied to assess the haplotypedisease association, possibly adjusting for environmental/demographical factors, and to evaluate the haplotypeenvironment interactions. According to Prentice and Pyke [2], with casecontrol data, maximum likelihood estimation of the association (oddsratio) parameters in logistic regression model can be simply carried out by fitting a prospective logistic model and ignoring the casecontrol design. Note that, in traditional logistic regression analysis, no modeling assumptions are made on the distribution of the covariates (haplotype and/or environmental factors), that is, the covariate distribution is treated fully nonparametrically.
Usually, the haplotype information is not directly observed and is subject to ambiguity because we can only observe the "unphased" genotype data where the "phase information", i.e., the arrangement of alleles on each of the two chromosomes, is unavailable. There has been rich literature of haplotype inference in general populations, see for example the EM algorithms by Excoffier and Slatkin [3] and Li et al. [4], and the Bayesian methods in Stephens et al. [5] and Niu et al. [6]. To recover the phase information and to ensure the identifiability of the association parameters, in general we need to impose some modeling assumptions on the distribution of haplotype pairs; see Epstein and Satten [7] for related issues when no environmental covariates are considered. One common assumption for such a model is HardyWeinberg equilibrium (HWE) in the general population. When environmental covariates are considered, further assumptions regarding relationship between haplotype pairs and environmental factors may be required. One convenient and generally reasonable assumption for this is the haplotypeenvironment independence [8,9], which assumes that a subject's haplotypepair features are independent of his/her environmental exposures.
In this work, to assess the haplotypeenvironment associations with disease phenotype, we will first propose a novel modeling setup that is based on a multinomial logistic regression model, where the various combinations of disease and haplotypepair categories are treated as multinomial outcomes, and the environmental/demographical factors are used as covariates. We show that the proposed multinomial logistic model can be decomposed into a prospective logistic disease model relating the haplotype and environmental factors with disease, as well as a parametric model for the haplotypepair distribution, conditional on the environmental covariates, among the control population. Compared to some existing methods such as the one proposed by Spinka et al. [9], our proposal differs from theirs in the way to model the haplotypepair distribution: In our proposal the model for the haplotypepair distribution is specified only for the control population, while in the method of Spinka et al. such a model is specified for the whole population. Epstein and Satten [7] uses the same modeling strategy as ours, but their proposal cannot allow for environmental covariates. Our proposal is equivalent to the method of Epstein and Satten in the absence of environmental covariates and extends their method to incorporate environmental covariates.
The major advantage of the proposed approach lies in its computational convenience; it can be simply implemented by an iterative reweighted fit of a multinomial logistic regression model. Note that, when the model for the haplotypepair distribution is specified for the whole population as in the method of Spinka et al. [9], the true intercept parameter in the prospective disease model, which quantifies the baseline population disease risk, appears as a separate parameter and needs to be estimated. Since usually there is little information on this parameter in a casecontrol sample, Spinka et al. [9] suggested using external information on the population disease prevalence or the raredisease approximation to avoid estimating the true intercept parameter. In contrast, in our proposal the true intercept parameter does not appear as a separate parameter and hence needs not be estimated even when the disease is common and the population disease prevalence is unknown. Hence the proposed method has wide applicability. Simulation results reveal that the finite sample properties of the proposed estimates are satisfactory.
Methods
Model
Let D denote the disease phenotype with D = 1 indicating disease and D = 0 indicating no disease, G the unphased multilocus genotype for a series tightly linked SNPs, and X the demographical/environmental covariates. Suppose that there are J possible haplotypes denoted by {ħ_{0},..., ħ_{J1}} with ħ_{0 }indicating the "reference" haplotype (usually the most frequent haplotype). Let (H^{1}, H^{2}) be the haplotype pair (diplotype) for a subject and label all possible haplotype pairs by H = 0, 1, ..., M  1 so that "0" represents the reference haplotype pair (ħ_{0}, ħ_{0}) and M is the number of all possible haplotype pairs in the sample.
The multinomial logistic model is a useful tool for regression analysis with multinomial responses [10,11]. Considering the various combinations of (D, H) as multinomial responses, we propose to base the haplotype association analysis on the following multinomial logistic model:
with β_{0 }= β_{0X }= 0 and m(0, X; γ) = 0 for all X, where m(H, X; γ) is a known but arbitrary function of (H, X). Throughout the paper, a prime denotes matrix transposition.
To see the meanings of the parameters involved, rewrite (1)(3) as
β_{0 }= β_{0X }= 0 and m(0, X; γ) = 0 for all X. We can see that the parameters α and β_{X }quantify respectively the baseline disease risk and the effect of environmental exposures on the disease risk for subjects with the reference haplotype pair. The function m(H, X; γ) specifies the distribution of the haplotype pairs among the control population given the covariates X. The parameters β_{h }and β_{hX }(h = 1,..., M  1) measure, in the retrospective oddsratio scale, respectively the main effect of the haplotype pair h (relative to the reference pair) and the interaction between the haplotype pair h and the covariates X. Note that although β_{h }and β_{hX }are specified through the retrospective model Pr(HX, D), they are equivalent to the corresponding parameters specified in the prospective disease model Pr(DX, H). In fact, according to (1)(3), the joint probability of (D, H) given X is obtained as
where
Λ_{ih }(X; θ) = i (α + β_{h }+ X' β_{X }+ X' β_{hX}) + m(h, X; γ),
θ = (α, β_{X}, β_{h}, β_{hX}, γ; h = 0,..., M  1) with β_{0 }= β_{0X }= 0 and m(0, X; γ) = 0 for all X. The prospective disease model Pr(DH, X) is then given by
Therefore, all the association (odds ratio) parameters regarding the associations between haplotypeenvironment factors and disease phenotype in the proposed model are equivalent to those specified in a prospective disease risk model. Note that the models (4)–(6) are equivalent to the prospective disease risk model (8) together with the model (5) for the haplotypepair distribution in the controls.
The proposed modeling setup can allow flexible modeling of the effects of haplotype pairs. Take β_{h }= β_{hap }where Z_{h }is a vector of coded values for the haplotype pair h according to a certain genetic law, and β_{hap }is the vector of associated regression coefficients quantifying the marginal haplotype effects relative to the reference haplotype. For example, to evaluate the effect of a specific haplotype ħ_{*}, we can set Z_{h }= I (h^{1 }= ħ _{*}) I (h^{2 }= ħ_{*}) for a recessive genetic law, Z_{h }= I (h^{1 }= ħ_{*})+ I (h^{2 }= ħ_{*})  I (h^{l }= ħ_{*}) I (h^{2 }= ħ_{*}) for a dominant law, and Z_{h }= I (h^{l }= ħ_{*}) + I (h^{2 }= ħ_{*}) for a multiplicative law, where I (A) is the indicator function which equals one if the event A occurs and equals zero otherwise. More examples for the specification of the genetic effects can be found in Epstein and Satten [7] and Zhao et al. [12]. Similarly, by taking = X_{*}β_{int }we can evaluate the interactions β_{int }between haplotypes and a chosen environmental covariate X_{*}. Although the above examples focus on effects associated with one single haplotype, using a collection of Z_{h }variables for multiple causal haplotypes we can fit extensive models with multiple causal haplotypes; see Results, Application to the hypertriglyceridemia study for an illustration.
Various models for the haplotypepair distribution in the control population can also be incorporated into the proposed model with suitable specifications of the function m(H, X; γ). Note that a saturated model for the haplotypepair distribution is not identifiable from the unphased genotype data [7], so certain modeling constraints must be imposed to ensure identifiability. One convenient specification is to assume that, in the control population, the haplotypepair distribution is independent of the covariates X and satisfies the Hardy Weinberg equilibrium (HWE), namely.
Pr(H^{1 }= h_{j}, H^{2 }= ħ_{k}X, D = 0) = Pr(H^{1 }= h_{j}, H^{2 }= ħ_{k}D = 0) = π_{j}π_{k}, j, k = 0, ..., J  1,
where π_{j }is the marginal frequency of haplotype ħ_{j }in the control population. Such a distribution corresponds to the specification m(H, X; γ) = γ, where W_{H }is a Jvector with the jth component given by I (H^{1}= ħ_{j}) + I (H^{2 }= ħ_{j}), and the jth component of γ is related to π by γ_{j }= log (π_{j}/π_{0}), j = 0,..., J  1. A more general assumption is to allow the HWE holds only within each of the strata defined by some categorical covariate S, and the corresponding specification of m(H, X; γ) can be expressed as m(H, X; γ) = m(H, S; γ) = γ_{S}, where W_{H }is denned as above and γ_{S }is a stratumspecific parameter with the strata determined by S. For example, when population stratification exists, then the strata S can be defined by the subpopulations or ethnicity groups to account for the violation of HWE caused by population stratification. Another way to relax the HWE assumption is to introduce the fixation parameter [13] into the model m(H, X) for Pr(HX, D = 0); see Satten and Epstein [14] for details.
Maximum likelihood estimation
Let N_{1 }and N_{0 }be respectively the number of cases and controls in the sample, and N = N_{1 }+ N_{0 }the total sample size. For the uth subject in the casecontrol sample, u = 1,..., N, suppose that the observed data are (D_{u}, G_{u}, X_{u}), including the disease status D_{u}, the unphased genotype G_{u}, and the environmental covariates X_{u}. The haplotype data H_{u }cannot be uniquely determined from the unphased genotype data G_{u }if the uth subject is heterozygous at more than one SNP.
Let S(G) be the set of labels of haplotype pairs that are consistent with unphased genotype G. Using (7), the probability of (D, G) given X can be expressed as
In the following discussions we will assume that models for β_{H}, β_{HX }and m(H, X; γ) are specified in such a way that all the parameters involved are identifiable from prospective studies. In Appendix 1 we provide identifiability conditions of β_{H }and β_{HX }for a given model m(H, X; γ).
The loglikelihood for the observed data in a casecontrol sample is given by
where Pr(D, HX; θ) is given by (7), p(X) denotes the marginal density function of X, and We leave p(X) fully unspecified. Using the profile likelihood approach to profile out the nuisance parameter p(X), the retrospective loglikelihood (10) can be translated into a prospective loglikelihood.
The profile likelihood
Let α^{* }= α + log(ν_{1}/ν_{0}) with ν_{i }= N_{i}/{NPr(D = i)}, i = 0,1. Define θ^{* }as θ with α replaced by α^{*}. Under the multinomial logistic model Pr(D, HX; θ) given in (7), the retrospective loglikelihood for the unphased genotype data in a casecontrol sample can be shown to be equivalent to the prospective loglikelihood
where Pr^{*}(D, HX; θ^{*}) is defined as Pr(D, HX; θ) with α replaced by α^{*}. The derivation is relegated to Appendix 2.
This result is parallel to the classic one given by Prentice and Pyke [2] when haplotype data are directly observed. Note that in the prospective likelihood L_{P }the original intercept α is absorbed into α^{* }and does not appear as a separate parameter. Hence α is not identifiable and estimable, unless supplementary information on the population disease prevalence Pr(D = 1) is utilized to separate α from α^{*}. Further, following the derivation in Prentice and Pyke [2] or Scott and Wild [15], we can show that the estimated covariance matrix for all the parameters except α^{* }can be obtained from the corresponding submatrix of the observed information matrix based on L_{P}. Therefore, under the proposed modeling setup the maximum likelihood inference on all the parameters except α can be based on the prospective loglikelihood L_{P}, with the casecontrol sampling design being ignored. Another consequence is that, using the result of Scott and Wild [15], using external information on Pr(D = 1) in our proposal only affects the inference of the intercept parameter and does not affect the efficiency of estimates for all the other parameters.
EM algorithm
The maximum likelihood estimation based on L_{P }can be simply implemented by the EM algorithm. Write ∂L_{P}/∂θ^{* }= ∑_{u }Ψ_{u }(θ^{*}), and let
which is the uth subject's "completedata" score function based on Pr^{*}(D, HX; θ^{*}); see Appendix 3 for its explicit expression. It can be seen that
where the expectation in the last equation is with respect to the conditional probability Pr^{*}(H_{u }= hD_{u}, G_{u}, X_{u}; θ^{*}) ≡ ρ_{hu}(θ^{*}) that is defined as
Note that with the proposed model Pr(H = hD, G, X; θ) does not depend on α, hence can be readily evaluated even if α cannot be reliably estimated, which is usually the case in casecontrol studies. This property is not shared by other modeling strategies such as those in Spinka et al. [9] and Zhao [12], unless the raredisease assumption is made.
Equation (11) suggests that the maximum likelihood estimate of θ^{* }can be obtained by the following EM algorithm. Given the estimate of θ^{* }from the rth iteration, denoted by ^{*(r)}, we first evaluate the posterior haplotypepair distribution ρ_{hu}(θ^{*}) at θ^{* }= ^{*(r)}. Then we obtain the updated estimate ^{*(r+1) }by solving θ^{* }from
where ρ_{hu }= ρ_{hu }(θ^{*(r)}). The process is repeated until some convergence criterion is met. Thus, with the proposed EM algorithm, maximum likelihood estimation with unphased genotype data under the proposed model can be readily implemented by iteratively fitting a weighted multinomial logistic regression model, with the iterativeupdated weights ρ_{hu}(θ^{*}) given by (12). It can be seen that the EM algorithm above can accommodate not only the unphased genotype but also the missing genotype data.
When solving (13), we apply the NewtonRaphson algorithm with the negative derivative of ∑_{u }∑_{h }ρ_{hu }(θ^{*}) with respective to θ^{* }approximated by the simple positivedefinite matrix
where V^{*}(·X) denotes the variance with respect to Pr^{*}(D, HX; θ^{*}); see Appendix 3 for the derivation. Note that ^{C }is also an approximation for the "completedata" information matrix based on Pr^{*}(D, HX; θ^{*}).
Let ^{* }be the final estimate given by the EM algorithm, provided convergence is achieved. Then ^{* }is the maximum likelihood estimate of θ^{*}, which is asymptotically normal. The covariance matrix for the maximum likelihood estimates of all the parameters, except the intercept parameter α^{*}, can be estimated by the corresponding submatrix of the inverse of the observed information matrix based on L_{P}. Following Louis [16], the observed information matrix based on L_{P }can be obtained by
The EM algorithm described above is found to be sensitive to the initial estimate of γ, the parameter associated with the haplotypepair distribution in the control population, in that the algorithm may fail to converge when using inappropriate initial estimates of γ. To obtain an adequate initial estimate of γ, we can adopt an approach similar to the proposal in Zhao et al. [12] to obtain the initial estimate for γ based on the control data only. For example, when the haplotype distribution in the controls follows HWE and is independent of environmental covariates, we can obtain initial estimate for γ, or equivalently the haplotype frequencies π in the controls, by solving π_{j}, j = 0,..., J  1 from
where
and π^{(1) }denotes the solution of π in the previous iteration. Our numerical experiments reveal that, starting with assigning equal frequency to each haplotype, two iterations in (14) are sufficient to yield an adequate initial estimate of γ.
To ensure stable estimation during the EM algorithm, for a haplotype with very small estimated frequency (e.g. <e^{10}), we will fix its frequency to be 0 and drop the corresponding parameter from γ.
We have developed a generalpurposed SAS Macro for implementing the proposed method, which is available upon request from the corresponding author.
Results
Application to the hypertriglyceridemia study
The proposed methodology is applied to the hypertriglyceridemia study conducted at National Taiwan University Hospital (Kao el al. [17], Tzeng et al. [18]), where 303 healthy controls and 290 cases, defined as serum triglyceride level > 400 mg/dl, were recruited. One primary objective of this study is to assess the association between the haplotypes in apolipoprotein A5 gene (APOA5) and hypertriglyceridemia in humans, adjusting for the environmental covariates Age, Sex, and BMI, and to explore the potential haplotypeenvironment interactions.
Based on the control genotype data for 5 polymorphic sites in APOA5 (IVS3+476, c.457, c.553, c.1177, c.1250), 6 common haplotypes are derived by the EM algorithm, including GGGCT (66.8%), AGGCC (15.3%), GGTCT (4.2%), GAGTT (9.5%), AGGCT (1.2%), GGGTT (0.9%), which are labeled as ħ_{j}, j = 0,..., 5, and the most frequent haplotype ħ_{0 }= GGGCT is chosen as the reference haplotype. We then fit a multinomial logistlic model, where the disease risk model (8) is specified by β_{h }= β_{hap }with Z_{h }= {I(h^{1 }= ħ_{j}) + I (h^{2 }= ħ_{j}), j = 1,..., 5}, and the environmental covariates X including Age, Sex, and BMI. The haplotypeenvironment interaction terms include only the interactions of the haplotype GGTCT with Age and BMI, since these are the only promising interactions according to preliminary analysis. The model (5) for the haplotypepair distribution in the controls is specified by m(h, X; γ) = γ_{0h }+ γ_{1h }S, where γ_{0h }= γ_{0 }and γ_{1h }= γ_{1 }with Z_{h }defined as above, and S = I(BMI > 23.2), the indicator of BMI being larger than its mean in the controls. By this model we allow the dependence between BMI and haplotypes; the dependence between other environmental factors (Age, Sex) and haplotypes is less significant in light of preliminary analysis, and hence is not considered in the final analysis.
The analysis results are displayed in Table 1. Relative to the common haplotype GGGCT, the three hapiotypes AGGCC, GGTCT, and GAGTT are associated with higher risk of hypertriglyceridemia; the former two were also identified elsewhere by haplotypespecific analysis (Pennacchio et al. [19], Kao et al. [17]), and here by joint analysis of multiple haplotype effects we further identify GAGTT as a potential risk haplotype. The significant interaction term suggests that the effect associated with the haplotype GGTCT is modified by age: older carriers of the GGTCT haplotype have decreased risk for hypertriglyceridemia than younger carriers. The estimates of γ_{1 }(data not shown) reveal that BMI and the haplotypes GGTCT and GGGCT are dependent in the control population.
Table 1. Analysis results of the hypertriglyceridemia data
Simulation studies
The first simulation study is to examine the finite sample performance of the proposed method. In each replication, data on the environmental variable X are simulated from a standard normal distribution. Given X, the haplotypepair distribution in the control population is assumed to be Pr(H^{1 }= ħ_{j}, H^{2 }= ħ_{k}X, D = 0) = Pr(ħ_{j}S, D = 0) Pr(ħ_{k}S, D = 0) with S = I(X > 0), namely the distribution of H follows HWE within each of the strata defined by whether X > 0 or not. This specification corresponds to a situation where the hapiotypes and environment covariates are dependent and hence the geneenvironment independence assumption does not hold in the control population; see Chatterjee and Carroll [8] for some examples where gene and environmental factors may be dependent. Following Satten and Epstein [14], we assume the haplotypes contain 5 tightlylinked SNPs, and the associated haplotype frequencies {Pr(ħ_{j}S,D = 0), j = 0,..., J1} in stratum S are listed in Table 2 for S = 0, 1. By convention, we choose the most common haplotype ħ_{0 }= "10011" as the reference haplotype. Further, we choose the haplotype ħ_{4 }= "01100" as a putative susceptibility haplotype. Data on the haplotype pair and disease phenotype are then generated according to the multinomial logistic model (7), with β_{H }and β_{HX }specified respectively as β_{H }= β_{hap}Z_{H }and β_{HX }= β_{int}XZ_{H}, where we take Z_{H }to be either I(H^{1 }= ħ_{4}) I(H^{2 }= ħ_{4}), I(H^{1 }= ħ_{4}) + I(H^{2 }= ħ_{4})  I(H^{1 }= ħ_{4})I(H^{2 }= ħ_{4}), or I(H^{1 }= ħ_{4}) + I(H^{2 }= ħ_{4}) when a recessive, dominant, or multiplicative genetic law is assumed, and β_{hap }and β_{int }are respectively the marginal effect of the risk haplotype ħ_{4 }and the interaction between this haplotype and X. A casecontrol sample with 415 controls and 796 cases is then selected from a larger set of data. When analyzing the data, we ignore the phase information for haplotype data and use only the unphased genotype data. Further, we allow the genotype data to be missing independently and randomly for SNPs 1–5 with probabilities 2.9, 5.6, 5.4, 4.5, and 2.3%, respectively.
Table 2. Haplotypes and frequencies used in simulation studies
In each setting considered, we set α = 3, β_{X }= 0.3 and β_{hap }= 0.1. The haplotypeenvironment interaction parameter β_{int}, which is the main focus here, is set to 0, 0.15, or 0.3. The results based on 500 replications are displayed in Table 3. The point and standard error estimates for the association parameters are essentially unbiased, and the coverage of the 95% confidence intervals is close to the nominal value. In particular, the size of the Wald test for testing H_{0 }: β_{int }= 0 essentially attains its nominal value, and the associated power for detecting haplotypeenvironment interaction is satisfactory when β_{int }= 0.3. The point and standard error estimates for the haplotype frequencies also match the true values well (results not shown). We also conduct analysis where the effects of rare haplotypes (with frequency 1%–5%) are estimated (data not shown). The point estimate essentially remains unbiased, while the standarderror estimate underestimates the true value. As commented by a referee, here a permutationbased procedure may be required for proper inference.
Table 3. Summary statistics for the first simulation studies
In the second simulation study, we compare the proposed method with some existing alternatives, including the methods by Zhao et al. [12] and by Spinka et al. [9]. To facilitate the comparison among these methods that are based on different assumptions, we conduct the simulation under a setting where the disease is rare, and the haplotypepair distribution in the whole population follows HWE and is independent of the environmental covariate. All the methods considered can apply well under this setting. Specifically, in each simulation we first simulate the environmental covariate X from a standard normal distribution, and then simulate the haplotype pairs in the whole population according to HWE with the marginal haplotype frequencies given in the third column (S = 0) of Table 2. The disease phenotype is then generated by the logistic regression model
logit {Pr(DH, X)} = α + β_{H}H + β_{X}X + β_{HX}HX, (15)
where β_{H }= β_{hap}Z_{H}, β_{HX }= β_{int}Z_{H}X with Z_{H }= I(H^{1 }= ħ_{4}) I(H^{2 }= ħ_{4}), and α = 3, β_{X }= 0.3, β_{hap }= 0.1, β_{int }= 0 or 0.3. The phase information is ignored and only the unphased genotype information is used when analyzing the simulated data. We also allow the genotype data for SNPs 1–5 to be missing independently and randomly with probabilities as in the first simulation study.
When applying the methods of Zhao et al. [12] and Spinka et al. [9], we employ the same models used to generate the data, hence the model specification is fully correct. The method of Spinka et al. is implemented using either a gridsearch method or the known value of Pr(D = 1) to estimate α. When applying our proposal, we simply assume the control haplotypepair distribution is independent of X and satisfies HWE, which is in fact a moderately wrong specification in the current simulation setting.
Table 4 exhibits the simulation results based on 500 replications. Although applied with a moderate model misspecification, the estimates from our proposal have small bias, and are more efficient than estimates from other approaches. The method of Spinka et al. [9] is less efficient than our proposal, even if the former further incorporates supplementary information on population disease prevalence. It is worth noting that, although the method of Spinka et al. and our proposal are both based on maximum likelihood estimation, they are in fact based on different modeling frameworks, hence they may results in parameter estimates with different efficiency. Both our proposal and the method of Spinka et al. are much more efficient than the method of Zhao et al. [12], consistent with the finding of Satten and Epstein [14] that fully prospective analysis such as the method of Zhao et al. may lose considerable efficiency for haplotype association analysis in casecontrol studies.
Table 4. Results of comparison of various methods, including Zhao et. al. [12] (Zhao), Spinka et al. [9] using gridsearch (Spinka, grid) or supplementary information on Pr(D = 1) (Spinka, suppl.), and the proposed multinomial logistic regression method (Proposed)
To examine the sensitivity of the proposed method to the model assumptions, we conduct a simulation study to assess the bias of the estimates when the haplotypepair distribution is wrongly assumed to follow HWE in the proposed method. Specifically, here we generate the haplotype data in the controls from the model
where the fixation index parameter f = 0.1 or 0.2, and π_{j}, j = 0,...,16, are haplotype frequencies given in the third column (S = 0) of Table 2; namely, the control haplotypepair distribution does not follow HWE. However, we wrongly assume that HWE holds for this distribution in our analysis model. The environmental covariate X is generated from standard normal distribution, and the disease phenotype is generated according to (15). From the results shown in Table 5, we observe remarkable biases for the estimates of the main haplotype effect β_{hap }when the genetic law is recessive or dominant, though the estimates for the interaction parameter β_{int }is rather robust. This observation is consistent with that made by Satten and Epstein [14]: the methods based on the retrospective likelihood, such as our proposal, is generally less robust to the model assumptions such as HWE.
Table 5. Summary statistics for the third simulation studies
Discussion
In the absence of environmental covariates, Epstein and Satten [7] (see also Satten and Epstein [14]) constructed their likelihood using a fully retrospective parameterization based only on (5) and (6). Therefore, in the absence of environmental covariates, the retrospective likelihood in Epstein and Satten [7] is exactly equivalent to that considered in our proposal. Our proposal, though based on a retrospective likelihood, can be implemented as if it were a prospective likelihood by introducing a multinomial logistic model and applying the profile livelihood approach given in Methods section. This is a preferred feature since a prospective analysis is more straightforward than a retrospective one. Moreover, our proposal provides an important extension of the Epstein and Satten's method to further incorporate environmental covariates.
Recently, based on the work of Chatterjee and Carroll [8], Spinka et al. [9] developed a profile likelihood approach to genetic association analysis with missing genetic information. In particular, for haplotype association analysis with unphased genotype data, their approach is based on a prospective logistic disease model Pr(DH, X) together with a parametric model for Pr(HX), the haplotypepair distribution in the whole population given the environmental covariates. The implementation of the method of Spinka et al. generally requires estimating the true intercept parameter α in the prospective disease model (8). Since in a casecontrol sample there would be little information on this parameter due to the nature of biased sampling, the information matrix would he nearly singular, causing computational problems. Spinka et al. [9] suggested using either a gridsearch method or supplementary information on the population disease risk Pr(D = 1) to estimate α. They also suggested a raredisease approximation thereby estimation of α is not needed. Note that an earlier proposal by Stram et al. [20] is equivalent to the method of Spinka et al. when there are no environmental covariates and the information on population disease prevalence is used.
Although both the two methods have a prospective logistic model Pr(DH, X) for the disease risk, our proposal specifies a model for Pr(HX, D) = 0), the distribution of haplotype pairs in the control population given the covariates, while the method of Spinka et al. specifies the wholepopulation counterpart Pr(HX).
In practice, although the assumption of HWE in the whole population may usually imply the same assumption hold in the controls, on the contrary, HWE in the controls may not necessarily imply HWE in the whole population when the disease is not rare. Owing to the nature of biased sampling, in a casecontrol sample it would be more plausible to check distributional assumptions for the control population than for the whole population, since in casecontrol studies estimating the control distribution Pr(HX, D = 0) is more straightforward than estimating the population distribution Pr(HX). Consequently, the proposed modeling strategy seems more suited to the casecontrol design.
When the disease is rare, Spinka et al. [9] suggested a raredisease approximation when the disease prevalence is unknown. Their approximated likelihood has the same form as our proposal. Note that the validity of our likelihood does not rely on any raredisease assumption; it serves as an exact likelihood in its own right under the modeling setup we consider. This implies that, for a common disease with unknown disease prevalence, our proposal can still work well when a suitable model can be specified for the haplotypepair distribution in the controls, while the proposal of Spinka et al. with the raredisease approximation can result in severe bias in this case. To illustrate this, we conduct a simulation with the same setting as in the second simulation study described in the previous section, except that a is now set to 0.5, corresponding to a commondisease situation. In the case when (β_{X}, β_{hap}, β_{int}) = (0.3, 0.1, 0.3), the estimates of Spinka et al. with raredisease approximation have substantial biases (0.005, 0.76, 0.168) for these parameters.
Zhao et al. [12] proposed an estimating equation approach to haplotype association analysis, which is based on the score of a prospective likelihood; a similar prospectivelikelihood approach for haplotypeenvironment interaction is also proposed by Lake et al [21]. These approaches are very easy to implement, and are particularly suitable for the case where a larger number of SNPs are involved. Also, they are found to be quite robust to misspecification of the haplotypepair distribution. The main drawback for such approaches is that they may be remarkably inefficient as compared to the retrospective likelihoodbased methods such as our proposal and the method of Spinka et al. [9]; see Satten and Epstein [14] for a comprehensive efficiency comparison.
Conclusion
To assess the association of haplotype and environmental factors with disease, we have proposed a new modeling setup that can be integrated into a multinomial logistic model, and can also be decomposed into a prospective logistic model relating the haplotype and environmental factors with disease, as well as a parametric model for the haplotypepair distribution in the control population given the environmental covariates. The new proposal amounts to a natural extension of the method of Epstein and Satten [7] to further incorporate environmental covariates. The modeling strategy we adopt is very suited to the casecontrol design in the sense that, in contrast to the procedure proposed by Spinka et al. [9], the maximum likelihood estimation for the proposed model does not require any information on the population disease risk, which is usually lacking in a casecontrol sample. In fact, the maximum likelihood estimation for the proposed model with casecontrol data can be readily performed by applying a typical EM algorithm to a prospective multinomial logistic regression model. A SAS Macro implementing the proposed method is available from the corresponding author.
Note that the proposed method does not rely on specific modeling assumptions such as raredisease, geneenvironmental independence, and HardyWeinberg equilibrium assumptions, hence is applicable in very general settings, as long as the models involved are identifiable and appropriately specified. In addition, simulation results show that our proposal can achieve satisfactory efficiency. Accordingly, our proposal may serve as a useful tool for assessing the haplotypeenvironment associations with disease in populationbased casecontrol studies. One limitation is that, unlike the prospective analysis, the proposed method, which is based on a retrospective likelihood, is sensitive to the model assumptions; namely, model misspecification may lead to substantial bias, as commented by Epstein and Satten [7]. Therefore, the model specification is crucial in the proposed method to warrant valid analysis.
Some additional work is needed to strengthen the utility of the proposed method. First, if the main interest is in the association parameters and the haplotypepair distribution is regarded as nuisance, then it would be desirable to improve the proposed method so that it can still yield valid estimates for the association parameters while allowing the model for the haplotypepair distribution to be slightly misspecified.
Another important issue for haplotypebased association analysis involves the variable selection. When dealing with a larger number of haplotypes, efficient and effective methodologies for variable selection is crucial for finding the haplotypes contributing to liability for complex diseases [24,1]. Promising strategies include the stepwise selection [22], Lasso [23,24] and the false discovery rate (FDR) procedure [25,24]. How to incorporate appropriate variable selection procedures in the proposed multinomial logistic model with unphased genotype data deserves further investigation.
Authors' contributions
YHC contributed to the development of the statistical methodology, the conducting of the data analysis and the simulations, and the writing of the manuscript. JTK contributed to the design and management of the hypertriglyceridemia study, and helped explain the results of data analysis.
Appendix
Appendix 1 the identifiability conditions
Given a model for m(H, X; γ) (the haplotypepair distribution in the control population given covariates) and a fixed value of γ, identifiability conditions of the models for β_{H }(main effects of haplotype pairs) and β_{HX }(haplotypeenvironment interactions) can be obtained by arguments similar to those in Epstein and Satten [7]. Let β be the collection of the parameters involved in β_{H }and β_{HX}. According to (9), for any covariate value x observed in the case sample, the value of Pr(D = 1, G = gX = x;θ) remains unchanged for a change in β if there exists some vector φ such that φ'R_{g}(x) = c for every genotype g, where c is a constant and
Let R(x) be the matrix with the gth row given by R_{g}(x)'. Therefore, the models β_{H }and β_{HX }are identifiable if for any covariate value x we have: (1) R(x)'R(x) has full rank so that R(x)φ ≠ 0 for any φ ≠ 0, and (2) R(x)'1 = 0 where 1 denote a vector of ones.
Appendix 2 the derivation of the profile likelihood
Since we treat the distribution P of X nonparametrically, it is sufficient to assume that P is discrete and has support points {x_{1},...,x_{K}}, the unique values of X that are observed in the casecontrol sample. Let G = 0,..., Q  1 denote labels for the observed unphased genotypes in the sample with G = 0 denoting a reference genotype and Q the number of distinct genotypes. Let δ_{k }= Pr(x_{k}), k = 1,..., K, n_{igk }the number of subjects with D = i, G = g and X = x_{k}, and P_{igk}(θ) = Pr(D = i, G = gX = x_{k}; θ), i = 0, 1, g = 0,..., Q  1, and k = 1,..., K. The loglikelihood for the casecontrol sample is given by
The maximum likelihood estimate of δ for fixed θ, subject to ∑_{k }δ_{k }= 1, satisfies
where n_{++k }= ∑_{ig }n_{igk}, and
Substituting the right side of (16) for δ_{k }in L, the profile loglikelihood is obtained as, aside from additive constants,
where
It is easy to see that
and
From (9) we thus have
where
Λ_{ih}(X; θ^{*}) = i(α^{* }+ β_{h }+ X'β_{X }+ X'β_{hX}) + m(h, X; γ),
and α^{* }= α + log(ν_{1}/ν_{0}).
Appendix 3 expressions of completedata score and information matrix
Simple algebra leads to
where E^{*}(·X) denotes expectation with respect to Pr^{*}(D, HX; θ^{*}).
The negative derivative of (θ^{*}) is given by
Note that the bracket term has mean zero. Hence the completedata information matrix can be approximated by
where V^{*}(·X) denotes the variance with respect to Pr^{*}(D, HX; θ^{*}). The negative derivative, of ∑_{u }∑_{h }ρ_{hu}(θ^{*}) can thus be approximated by ∑_{u }∑_{h }ρ_{hu}v^{* }(X_{u}; θ^{*}) = ∑_{u}v^{* }(X_{u}; θ^{*}), where
Acknowledgements
This work was supported in part by a Research Grant from the Genomic Research Center, Academia Sinica (94B0012).
References

Schaid DJ: Evaluating associations of haplotypes with traits.
Genet Epidemiol 2004, 27:348364. PubMed Abstract  Publisher Full Text

Prentice RL, Pyke R: Logistic disease incidence models and casecontrol studies.
Biometrika 1979, 66:403411. Publisher Full Text

Excoffier L, Slatkin M: Maximumlikelihood estimation of molecular haplotype frequencies in a diploid populaton.
Mol Biol Evol 1995, 12:921927. PubMed Abstract  Publisher Full Text

Li SS, Khalid N, Carlson C, Zhao LP: Estimating haplotype frequencies and standard errors for multiple single nucleotide polymorphisms.
Biostatistics 2003, 4:513522. PubMed Abstract  Publisher Full Text

Stephens M, Smith NJ, Donnelly P: A new statistical method for haplotype reconstruction from population data.
Am J Hum Genet 2001, 68:978989. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Niu T, Qin Z, Xu X, Lin J: Bayesian haplotype inference for multiple linked singlenucleotide polymorphisims.
Am J Hum Genet 2002, 70:157169. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Epstein MP, Satten GA: Inference on haplotype effects in casecontrol studies using unphased genotype data.
Am J Hum Genet 2003, 73:13161329. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chatterjee N, Carroll RJ: Semiparametric maximum likelihood estimation in casecontrol studies of geneenvironment interactions.
Biometrika 2005, 92:399418. Publisher Full Text

Spinka C, Carroll RJ, Chatterjee N: Analysis of casecontrol studies of genetic and environmental factors with missing genetic information and haplotypephase ambiguity.
Genet Epidemiol 2005, 29:108127. PubMed Abstract  Publisher Full Text

Agresti A: Categorical data analysis. New York, John Wiley & Sons; 2002.

Hosmer DW, Lemeshow S: Applied logistic regression. 2nd edition. New York, John Wiley & Sons; 2000.

Zhao LP, Li SS, Khalid N: A method for the assessment of disease associations with singlenucleotide polymorphism haplotypes and environmental variables in casecontrol studies.
Am J Hum Genet 2003, 72:12311250. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hartl DL: A primer of population genetics. Sunderland Sinauer Associates; 1988.

Satten GA, Epstein MP: Comparison of prospective and retrospective methods for haplotype inference in casecontrol studies.
Genet Epidemiol 2004, 27:192201. PubMed Abstract  Publisher Full Text

Scott AJ, Wild CJ: Fitting regression models to casecontrol data by maximum likelihood.
Biometrika 1997, 84:5771. Publisher Full Text

Louis TA: Finding the observed information matrix when using the EM algorithm.

Kao JT, Wen HC, Chien KL, Hsu HC, Lin SW: A novel genetic variant in the apolipoprotein A5 gene is associated with hypertriglyceridemia.
Hum Mol Genet 2003, 12:25332539. PubMed Abstract  Publisher Full Text

Tzeng JY, Wang CH, Kao JT, Hsiao CK: Regressionbased association analyis with clustered haplotypes through use of genotypes.
Am J Hum Genet 2006, 78:231242. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pennacchio LA, Oliver M, Hubacek JA, Cohen JC, Cox DR, Fruchart JC, Krauss RM, Rubin EM: An apolipoprotein influencing triglycerides in humans and mice revealed by comparative sequencing.
Science 2001, 294:169173. PubMed Abstract  Publisher Full Text

Stram DO, Pearce L, Henderson BE, Thomas DC: Modeling and EM estimation of haplotypespecific relative risks from genotype data for a casecontrol study of unrelated individuals.
Hum Hered 2003, 55:179190. PubMed Abstract  Publisher Full Text

Lake SL, Lyon H, Tantisira K, Silverman EK, Weiss ST, Laird NM, Schaid DJ: Estimation and tests of haplotypeenvironment interaction when linkage phase in ambiguous.
Hum Hered 2003, 55:5665. PubMed Abstract  Publisher Full Text

Cordell HJ, Clayton DG: A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes.

Tibshirani R: Regression shrinkage and selection via the Lasso.

Devlin B, Roeder K, Wasserman L: Analysis of multilocus models of association.
Genet Epidemiol 2003, 25:3647. PubMed Abstract  Publisher Full Text

Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing.