This paper describes a likelihood approach to model the relation between failure time and haplotypes in studies with unrelated individuals where haplotype phase is unknown, while dealing with the problem of unstable estimates due to rare haplotypes by considering a penalized log-likelihood.
The Cox model presented here incorporates the uncertainty related to the unknown phase of multiple heterozygous individuals as weights. Estimation is performed with an EM algorithm. In the E-step the weights are estimated, and in the M-step the parameter estimates are estimated by maximizing the expectation of the joint log-likelihood, and the baseline hazard function and haplotype frequencies are calculated. These steps are iterated until the parameter estimates converge. Two penalty functions are considered, namely the ridge penalty and a difference penalty, which is based on the assumption that similar haplotypes show similar effects.
Simulations were conducted to investigate properties of the method, and the association between IL10 haplotypes and risk of target vessel revascularization was investigated in 2653 patients from the GENDER study.
Results from simulations and real data show that the penalized log-likelihood approach produces valid results, indicating that this method is of interest when studying the association between rare haplotypes and failure time in studies of unrelated individuals.
In recent years there has been a great interest in associating haplotypes with complex disease phenotypes, and many statistical models have been described. These models are complicated by individuals that are heterozygous on two or more of these SNPs, because their haplotypes cannot be determined with certainty. Consider for instance two SNPs with alleles A or a, and B or b. Individuals that are heterozygous for both SNPs, have genotypes Aa and Bb, and they inherited either haplotype AB from one parent and ab from the other, or they inherited haplotypes Ab and aB. Hence, it is unknown whether these individuals have haplotype pair AB/ab or haplotype pair Ab/aB. This uncertainty complicates statistical inference and if the number of biallelic single nucleotide polymorphisms (SNPs) is large, or when allele frequencies of the SNPs are close to 50%, many individuals will be multiple heterozygous.
Most of the current methods focus on continuous or dichotomous outcome data [1-9], while only few can be applied in cohort studies [10,11]. Another concern is related to the presence of rare haplotypes, which is a very common problem in genetic association studies. In the present paper we adopt the suggestion of Tanck et al.  to use a weighted penalized likelihood method to estimate the association between a phenotype and the set of haplotypes, which may include rare haplotypes. We consider a model for the relation between a failure time T measured in N unrelated individuals and the haplotypes of these individuals formed by m SNPs measured in a single gene. Previously, Lin  has described a similar method for haplotype analysis in cohort studies, but this method did not include a penalty function for dealing with the unstable estimates of rare haplotypes.
In the sequel we will first describe the kind of data that we analyze, then we will described our statistical model, and the algorithm to estimate the parameters of our model. We will use simulated data to illustrate some characteristics of the estimators, and finally we will analyze real data from the GENDER study on cardiovascular disease .
Data and model
We consider a sample of i = 1,..., N unrelated individuals with failure- or censored-time ti. The indicator di is used to indicate whether ti is an event-time (di = 1), or a censored-time (di = 0). Let gi be a vector of m biallelic SNPs measured in individual i. With m biallelic SNPs there are j = 1,.., Nhap = 2m different haplotypes possible with population frequencies p1,..., pj,... pNhap.
Suppose all haplotypes were observed in all patients, then these could be represented with the vector xi of length Nhap, where xij equals 0, 1, or 2, depending on the number of haplotypes of type j observed in patient i. (Notice that ∑jxij = 2, meaning that only Nhap - 1 contrasts are identifiable.) The conditional hazard function for failure at ti given xi can then be specified as
where h0(t) is an unspecified baseline hazard function, and β a vector of regression parameters. The survival function S(ti|xi) equals
where H0(t) is the unspecified baseline cumulative hazard function.
In our case, haplotypes were not observed in all patients, and therefore we model the survival function conditional on the observed genotypes as a weighted mixture over all zi possible haplotype pairs given genotypes gi:
where Sq(ti|xiq) is the survival function specified as in equation (2), and xiq is the qth haplotype pair that is possible given genotypes gi, and wiq is the probability that individual i has haplotype pair q given genotypes gi: wiq = Pr(Xi = xiq|gi).
In most circumstances the population haplotype frequencies p1,..., pNhap are unknown, and must be estimated from the data at hand. But suppose these are known, then under the assumption of Hardy-Weinberg equilibrium the probability that an individual is carrying haplotype pair q = (h, r) is calculated as ph * pr (h, r = 1,..., Nhap). Consequently, when observing genotype vector gi the probability that individual i has haplotype pair q = (h, r) equals:
where the summation takes place over all haplotype pairs that are compatible with genotypes gi and dhri is an indicator function, which = 1 when the haplotype pair (h, r) is compatible with gi and 0 otherwise.
Given the weights wiq the full likelihood of the data given the model in equation (3) equals
and the regression parameters β, and the baseline cumulative hazard function H0(t) may be estimated by maximizing (5). If H0(t) is a parametric function, maximization of (5) is uncomplicated. If H0(t) is a nonparametric function, we follow a similar argument as Breslow . If events occurred at times τ1,..., τj,..., τk, we shift all censored observations in the interval [τj-1, τj) to τj-1, and assume (piecewise) constant hazard in the intervals. In that case the logarithm of the likelihood (5) reduces to
where λj is the jump in the cumulative baseline hazard function at time τj, and I(τi > τj) is an indicator function.
Estimation equations for H0(t) and β can be derived by equating to zero the first-order derivatives of the log-likelihood in (6). For λj we find
and for βℓ (ℓ = 1,..., Nhap):
Since in practice the population haplotype frequencies p must be estimated together with β and H0(t), and because depends on H0(t) and β, direct maximization of (5) or (6) is complicated. Instead we propose to use an EM algorithm. For that we consider the covariate vector xi as missing, and we maximize the expectation of the joint log-likelihood of L((ti, di), xi|gi) over the posterior probabilities of xi given the observed genotypes gi, and given the observed failure/censoring times (ti, di) and given current estimates of the parameters β, H0(t), and wiq:
The EM algorithm consists then of iterating two steps. In the M-step of iteration a + 1, (β, H0(t), wiq = pjpl/Σprps) are estimated by maximizing (10) given evaluated using (β(a), H0(t)(a), ), and in the E-step, is re-estimated given using (11) with (β(a+1), H0(t)(a+1), ). Estimation equations of β and H0(t) are the same as in (8) and (7), but replaced with , and these are solved iteratively. The haplotype relative frequencies pj are estimated as
where I(j, q) is an indicator-function denoting whether haplotype j is part of haplotype-combination q. Standard errors of β can be derived from the information-matrix of the log-likelihood in equation (6):
Since we are mainly interested in the uncertainty of β, we only used that part of the hessian that pertains to β. Notice that the first term of (14) equals the hessian-matrix that is used in the M-step of the EM algorithm.
Mutations in general tend to be rare, and so are the haplotypes in which they are encompassed. Furthermore, when ten loci are considered there are 210 = 1024 different haplotypes possible, many of which will have low frequency in samples up to thousands of individuals. If haplotypes have low frequency their associated hazard ratio estimate will be unstable. We used a penalized log-likelihood method to obtain more stable parameter estimates. Basically, we optimized the penalized log-likelihood ℓp, defined as , where Pen(β) is the penalty function.
As penalty functions we considered the well-known ridge-penalty function , and a difference-penalty function (Pen(β) = ∑a∑baab(βa - βb)2 (a > b)), where aab is a fixed and known value representing the similarity of haplotypes a and b. We quantified the similarity between haplotypes (aab) by counting the number of shared alleles which – with m loci – varies between zero and m - 1.
The penalty parameter λ was found by optimizing the cross-validated log-likelihood (CVL) as described by Verweij and van Houwelingen : CVL(λ) = lnLBreslow(βλ) - c(λ), where lnLBreslow(βλ) is the log-likelihood evaluated with the penalized log-likelihood estimate βλ. The factor c(λ) is an approximation of the effective dimension of the model, which in our case depends on the log hazard ratios β, the baseline hazard function H0(t), and the haplotype frequencies p. For convenience sake, we approximated the effective dimension in the same manner as Verweij and van Houwelingen  as
where Ui(βλ) is the contribution of individual i to the first-order derivative of the unpenalized log-likelihood, and Hλ (βλ) is the matrix of second-order derivatives of the penalized log-likelihood evaluated at βλ, H0(t), and p. Notice that the last term of (15) is equal to the third term of (14).
Although the penalized likelihood estimates of β are somewhat biased, and it is therefore somewhat unclear how to interpret standard errors, we nevertheless assessed the stability of the penalized estimates by a parametric bootstrap procedure. We took 200 bootstrap samples, estimated λ in each sample by optimizing CVL, and derived standard errors from the distribution of the associated penalized estimates of β, p, and H0(t). The number of bootstrap samples used was based on results from simulations, in which the number of bootstrap samples varied from 10 to 1000. These data showed that SE estimates were relatively stable after 100 to 200 bootstrap samples.
The EM algorithm presented in this paper was programmed in MATLAB® R 7.0 (The MathWorks, Natick, MA, USA) as well as in a set of R-functions and is freely available upon request from the corresponding author.
To illustrate some characteristics of our approach, simulations were carried out. In each replicate, a data set of 200 (simulation 1 and 2) or 2000 (simulation 3) individuals was created in whom 3 loci were measured. We simulated the 8 haplotypes (x1,..., x8) to have frequencies of: p000 = 0.62, p001 = 0.05, p010 = 0.02, p011 = 0.005, p100 = 0.02, p101 = 0.003, p110 = 0.002 and p111 = 0.28. Given the haplotypes drawn for a specific individual i, the survival time S was drawn from the exponential distribution with log(intensity) equal to ∑jβjxij. A censoring time C was independently drawn from an independent log-normal distribution such that in about 25% of all individuals C <S, in which case the survival time was censored at C. In each replicate, the haplotype effects were estimated using three models: 1) unpenalized (similar to  and ), 2) ridge penalized and 3) difference penalized. The statistical properties were evaluated using three different measures, namely the mean bias of the parameter estimates, the mean SE and the coverage probability, which is defined as the probability that the 95% confidence interval of the parameter estimate contains the true theoretical value of the parameter estimate.
Furthermore, for each haplotype the percentage of replicates which identified the haplotype as being significantly associated with the outcome (i.e., power or Type I error rate) was calculated. The significance level used to calculate the power and the Type I error rate was set to α = 0.05. In addition, the effect of omitting rare haplotypes (011, 101 and 110) or fixing their effect to the close haplotype 111 on the effect estimates was investigated in the unpenalized models only. For simulation 1 and 2, 500 replicates were carried out, whereas 100 replicates were carried out in simulation 3.
In the first simulation, replicates contained 200 individuals and the haplotypes with a rare allele at locus 2 (010, 011, 110 and 111) were simulated to have a relative risk of 2 compared to the most frequent haplotype 000, corresponding with a regression parameter of β = 0.69. This simulation mimics a setting in which the observed effects are due to a single SNP. The observed genotype counts of a random replicate are given in Table 1. In total, 75 (37.5%) individuals were heterozygous on multiple loci of which 63 were heterozygous at all three loci. In 80% of the replicates, all eight haplotypes were present. The mean bias, mean SE, coverage probability and percentage estimates with a p < 0.05 for this simulation based on the 80% 'complete' replicates are shown in Table 2. For the less frequent haplotypes (011, 110 and 101), the introduction of the ridge or difference penalty leads to a substantial reduction in the mean SE as compared to the non-penalized model. On the other hand, the mean SE of the more frequent haplotypes increases somewhat in the penalized models. The effect on the power to detect the effects of the haplotypes 010, 011, 110 and 111 as well as on the type I error probabilities of the haplotypes 001, 100 and 101 is less univocal. In the second simulation, replicates also contained 200 individuals and the haplotypes 001 and 101 were simulated to have a relative risk of 3 (β = 1.10) compared to the reference haplotype 000. This simulation mimics a setting in which a particular allele combination on loci 2 and 3 (i.e. ×01) is related to an increased risk. In 85% of the replicates, all eight haplotypes were available. The mean bias, mean SE, coverage probability and percentage estimates with a p < 0.05 for this simulation based on the 85% 'complete' replicates are shown in Table 3. Similar to simulation 1, inclusion of a penalty leads to a large reduction in mean SE of the less frequent haplotypes. The effect of haplotype 101 is estimated more precisely, but the power to find a significant effect for this haplotype is reduced from 25% in the non-penalized model to a value similar to the type I error probability of the haplotypes without an effect in both penalized models. The effect of introducing a penalty on the mean bias, mean SE and power to detect an significant effect of the more frequent haplotype 001 are only minor.
Table 1. Numbers of individuals with the various genotypes on three loci in a simulation of 200 individuals
Table 2. Realized βa, mean bias, mean standard error (SE), coverage probability and percentage of the effects that had a p-value < 0.05 in the simulation where haplotypes with a rare allele at locus 2 had a modeled parameter estimate of 0.69. Each replicate contained 200 individuals
Table 3. Realized βa, mean bias, mean standard error (SE), coverage probability and percentage of the effects that had a p-value < 0.05 in the simulation where haplotypes 001 and 101 had a modeled parameter estimate of 1.10. Each replicate contained 200 individuals.
For both simulations, removal of the rare haplotypes 011, 110 and 101 from the model leads to a small reduction in the bias of the remaining haplotypes (e.g. from 0.048 to 0.038 and from 0.019 to 0.005 for haplotypes 010 and 111, respectively). Depending on the modeled effects of the omitted haplotypes, the bias in the estimate of haplotype 111 decreases (simulation 1: from 0.019 to 0.006) or increases (simulation 2: from 0.005 to 0.011) a little.
In a third simulation, replicates contained 2000 individuals and the haplotypes 001 and 101 were simulated to have a relative risk of 3 (β = 1.10) compared to the reference haplotype 000. This simulation mimics a setting similar to simulation 2, but due to the larger number of individuals, all 8 haplotypes will be present in all individual replicates. The mean bias, mean SE, coverage probability and percentage estimates with a p < 0.05 for this simulation are shown in Table 4.
Table 4. Realized βa, mean bias, mean standard error (SE), coverage probability and percentage of the effects that had a p-value < 0.05 in the simulation where haplotypes 001 and 101 had a modeled parameter estimate of 1.10. Each replicate contained 2000 individuals.
In the GENDER study  3146 patients with cardiovascular disease who were treated with percutaneous transluminal coronary angioplasty (PTCA) with or without stents were followed for at least twelve months for the occurrence of clinical restenosis and revascularization of the vessel which was originally treated with PTCA. Inflammatory processes are involved in such target vessel revascularization (TVR), and the level of inflammatory response is controlled by several genes, among which possibly the IL10 gene. We determined in 2653 patients the variants of four SNPs in this gene (IL10 -592G > T, IL10 -2849G > A, IL10 -1082G > A, and IL10 4251A > G), and evaluated their association with TVR risk. Overall, there were 252 TVR, and TVR risk was 9% at nine months, and 10.5% at twelve months. Rare allele frequencies were 28%, 49%, 27%, and 24%, of the four SNPs, respectively. All four markers were in linkage disequilibrium (P < 0.001), and Hardy-Weinberg equilibrium was not rejected for any of the markers (P > 0.0125; significance level (Bonferroni) corrected for multiple testing).
Univariately, in a Cox model assuming co-dominant effects, IL10 -2849G > A, IL10 -1082G > A, and IL10 4251A > G were significantly associated with TVR risk with hazard ratios (HR) 1.21 (95% CI: 1.00–1.46), 1.20 (1.01–1.42), 1.20 (0.99–1.45), respectively. HR of IL10 -592C > A was 0.87 (0.70–1.07). In a Cox model with all SNPs, and all two-way interactions, we found significant interactions between SNPs IL10 -2849G > A, and IL10 -1082G > A (P = 0.003), and IL10 -1082G > A, and IL10 4251A > G (P = 0.001). Higher-order interactions were not significant. The prognostic index of this model (Xβ) varied between -2.6 and +1.6, and had 23 different values corresponding to the 23 different genotype combinations that were observed.
With 4 biallelic SNPs, 16 different haplotypes are possible, but seven had zero frequency in the current sample. Of the remaining nine haplotypes, there were four major haplotypes with substantial frequencies 27% (0000), 24% (0001), 21% (0100), and 27% (1110), while the relative frequencies of the remaining five haplotypes varied between 0.46% and 0.02%. The estimated haplotype frequencies and log hazard ratios at the optimal CVL (ridge penalty) are given in Figure 1. The hazard ratios of the four major haplotypes were not different from each other, but of the five rare haplotypes two had decreased (0010, and 0101), and three had increased (0110, 1000, and 1100) hazard ratios, although all had very wide confidence intervals, and none were statistically significantly associated with TVR risk.
Figure 1. Frequencies and log TVR hazard ratios with 95% confidence intervals of the IL10 haplotypes.
To validate our model we calculated the survival curves of all 23 subgroups with different genotype combinations according to equation (3) and compared these with the Kaplan-Meier curves. These curves are given in Figure 2 for the subgroup of 233 patients who were homozygous for the most common allele of IL10 -2849G > A and IL10 4251A > G, and heterozygous for IL10 -1082G > A and IL10 -592C > A (Figure 2). These two curves are comparable, indicating that our haplotype method produces valid results.
Figure 2. Estimated IL10 survival curves based on genotypes and based on haplotypes.
In the present study we present a method to model the relation between failure time and (rare) haplotypes in unrelated individuals. The simulations presented in this study show that haplotype effects of especially the rare haplotypes are closer to the true estimates when a penalty is introduced into the model.
Furthermore, the simulations show that the penalized log-likelihood approach that is used to deal with the unstable estimates of rare haplotypes can indeed shrink the estimates and their 95% confidence intervals to 'acceptable' values. Simulations also show that the cross-validated standard errors of the more common haplotypes can be increased compared to their unpenalized standard errors due to uncertainty with respect to the penalty parameter λ. The power to detect a true haplotype effect is, in general, reduced in the penalized models compared to the non-penalized model, the reduction being more pronounced for the less frequent haplotypes. This reduced power is due to the shrinkage of the estimates. With respect to the type I error probabilities, the effect of introducing a penalty depends on the penalty applied and the true haplotype effects. For the ridge penalized models, the type I error probabilities are similar to those observed in the non-penalized models. The non-penalized estimates of the haplotypes without a modeled effect are already close to the true value of zero and the nature of the ridge penalty further shrinks the effects towards zero. For the difference penalty, the type I error probability appears to be increased for some haplotypes. This deviation is related to the extent that the assumption (similar haplotypes, similar effects) is met. In the first simulation (Table 2), the mean bias of haplotypes 001 and 100 are increased in the direction of a β > 0. In this scenario, haplotypes 010, 011, 110 and 111 all had a modeled effect and the difference penalty results in estimates for the haplotypes 001 and 100 towards the effects of these haplotypes. In the second simulation, the majority of the haplotypes had no modeled effect and the effects of the rare haplotypes could be directed towards β = 0. In replicates containing 2000 individuals (Table 4), the reduction in SE is still present, but the gain is relatively small compared to simulation with only 200 individuals per replicate. This is conform expectations, since rare haplotypes are 'less rare' in larger samples, thus enabling a more precise estimation of their effect even without a penalty. Based on the characteristics of the models displayed in the simulations and the real data, the penalized log-likelihood method mostly serves the purpose of estimating the effects of rare haplotypes more accurate.
The method described in this paper is a flexible method allowing for adjustment for (environmental) covariates as well as haplotype-environment interactions. Although we focus on haplotypes consisting of a certain number of biallelic SNPs, the method is also capable to handle loci with more than two alleles. Furthermore, the method can be easily extended to deal with missing genotype data, since this will simply increase the number of possible haplotype pairs that are compatible with the observed genotype. The wiq in our method are calculated under the assumption of Hardy-Weinberg. Although we did not check robustness of the method to violations of this assumption, Lin  has shown that his method, which is similar to our unpenalized method, is robust to violations of the Hardy-Weinberg assumption.
As an alternative estimation method we considered the partial likelihood. Unfortunately, estimation of β and H0(t) are not separated, and therefore there is no reason to prefer this partial likelihood approach over the EM algorithm outlined in the present manuscript. Compared to the method described by Tregouet et al , the present EM method assumes piecewise constant hazard, which seems less restrictive than the assumption behind the their method using partial likelihood.
We use a penalty function to increase precision of estimates of rare haplotypes. Other strategies for managing unstable estimates of rare haplotypes include excluding the rare haplotypes from the variable list, pooling the rare haplotypes into one category, or pooling the rare haplotypes with common haplotypes that are very similar. The first approach implicitly groups the rare haplotypes with the reference category and the second and third approach lead to pooled categories that are sometimes hard to interpret. Nevertheless, these last two methods seem to increase power . However, the three strategies mentioned above do not result in (individual) effect estimates of rare haplotypes, whereas the penalized models do.
The method presented in this paper can be applied to estimate haplotype effects in cohort studies when haplotype phase is unknown. The joint estimation of haplotype effects and haplotype frequencies together with the penalty function provides a good way of estimating effects of rare haplotypes, which is a common problem in these studies.
SNP(s): single nucleotide polymorphism(s); SE: standard error; PTCA: percutaneous transluminal coronary angioplasty; TVR: target vessel revascularization; IL10: interleukin 10.
OWS participated in programming the method, performed the statistical analysis of the data and drafted the manuscript. AHZ was responsible for the theoretical background of the method, participated in the programming of the method and the analysis of the data and helped to draft the manuscript. JWJ was responsible for the collection of the data and critically evaluated the manuscript. MWTT participated in the development of the method, performed the simulations and helped to draft the manuscript. All authors read and approved the final manuscript.
M.W.T. Tanck was financially supported by the Netherlands Heart Foundation grant no 2000.125.
Stram DO, Leigh PC, Bretsky P, Freedman M, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Thomas DC: Modeling and E-M estimation of haplotype-specific relative risks from genotype data for a case-control study of unrelated individuals.
Tanck MW, Klerkx AH, Jukema JW, Knijff PD, Kastelein JJ, Zwinderman AH: Estimation of multilocus haplotype effects using weighted penalised log-likelihood: analysis of five sequence variations at the cholesteryl ester transfer protein gene locus.
Agema WR, Monraats PS, Zwinderman AH, de Winter RJ, Tio RA, Doevendans PA, Waltenberger J, de Maat MP, Frants RR, Atsma DE, van der Laarse A, van der Wall EE, Jukema JW: Current PTCA practice and clinical outcomes in The Netherlands: the real world in the pre-drug-eluting stent era.
Scandinavian Journal of Statistics 2004, 31:43-50. Publisher Full Text