Aitkin recently proposed an integrated Bayesian/likelihood approach that he claims is general and simple. We have applied this method, which does not rely on informative prior probabilities or large-sample results, to investigate the evidence of association between disease and the 16 variants in the KDR gene provided by Genetic Analysis Workshop 17. Based on the likelihood of logistic regression models and considering noninformative uniform prior probabilities on the coefficients of the explanatory variables, we used a random walk Metropolis algorithm to simulate the distributions of deviance and deviance difference. The distribution of probability values and the distribution of the proportions of positive deviance differences showed different locations, but the direction of the shift depended on the genetic factor. For the variant with the highest minor allele frequency and for any rare variant, standard logistic regression showed a higher power than the novel approach. For the two variants with the strongest effects on Q1 under a type I error rate of 1%, the integrated approach showed a higher power than standard logistic regression. The advantages and limitations of the integrated Bayesian/likelihood approach should be investigated using additional regions and considering alternative regression models and collapsing methods.
Association studies are substantially contributing to our knowledge of the genetic basis of human disease, but for most complex diseases the proportion of explained heritability remains at best modest . Large variants, epistatic and parent-of-origin effects, additional layers of genetic variation (microRNA, DNA methylation, histone modification), and rare variants probably account for the missing heritability [2,3]. In this paper we explore the properties of a novel method of inference to quantify the evidence of association between a response variable (an indicator of disease status, the disease being common) and several independent variables, including the carriage of a rare variant.
Likelihood plays a central role in Bayesian, frequentist, and likelihood theory as a measure of strength of evidence. Based on the posterior distribution of the likelihood first described by Dempster, Aitkin has recently proposed an integrated Bayesian/likelihood approach that he claims is general and simple [4,5]. Bayes factors are usually applied to compare different statistical models and parameter values. Under the novel approach, likelihood ratios between models and parameter values are interpreted by using the full posterior distribution of the likelihood. A particular advantage of the method is that it does not rely on informative prior probabilities or large-sample results, and therefore it may be particularly suitable to identify rare susceptibility variants. We have applied this novel approach to investigate the evidence of association between disease and the 16 variants in the KDR gene provided by Genetic Analysis Workshop 17 (GAW17).
Analyses were performed with knowledge of the underlying simulating model. Details on data simulation are provided by Almasy et al. . In short, some KDR variants influence a quantitative risk factor Q1 with larger effects among smokers. Q1 correlates with a quantitative risk factor Q2 and with a normally distributed latent liability trait that increases with age and is higher in smokers. The dependent variable investigated in the present contribution (y = disease affection status) is a function of Q1, Q2, the liability trait, and additional risk factors.
We consider a baseline logistic regression model (model 1), where y depends on age (continuous variable), smoking status (dichotomous), and ethnicity (five categories). Model 2 includes age, smoking status, ethnicity, and a single genetic factor. Table 1 shows the list of investigated genetic factors. We consider individual variants 1 to 16, modeled by minor allele counts (additive model) and a collapsed genotype, and the presence of any rare variant, “rare” being defined as having a minor allele frequency less than 1%.
Table 1. Characteristics of the investigated variants in the KDR gene and results for the first replicate
Under the two models, yi ~ Bernoulli (πi) with:
where xi contains the elements of the corresponding design matrices and β contains the model parameters (age, smoking status, ethnicity, and, for model 2, the single genetic factor). We consider noninformative, improper uniform prior probabilities for β. As a consequence, the joint posterior distributions of β are proportional to the likelihoods from the two logistic regression models, that is,
with fixed N = 697 persons in the present exercise. We use a random walk Metropolis algorithm to make 10,000 draws from the posterior distribution under model 1 and 10,000 independent draws under model 2. The deviance is defined as minus twice the logarithm of the likelihood. Random draws from posterior distributions are used to simulate the distributions of deviance and deviance difference.
Power calculations rely on the probability values and on the proportions of positive deviance differences for the 200 replicates in the GAW17 data. To compare standard logistic regression with the novel approach, we use Wilcoxon rank sum tests on the null location shift and two-sample tests for the equality of proportions. Three different thresholds, 0.1, 0.05, and 0.01, are considered for probability values and proportions of positive deviance differences. All calculations and graphics are implemented with the free software environment R.
Table 1 shows some characteristics of the investigated variants. Effect sizes are shown for nonsmokers; they were 50% higher in smokers. Out of 16 variants, 10 variants influenced Q1, with variants 7 and 13 having the strongest effects (1.08 and 0.94, respectively). Minor allele frequencies (MAFs) varied from 0.1% to 17%. Out of 697 individuals, 17 (2.4%) carried at least one rare variant.
Detailed results for the first replicate from standard logistic regression and from the novel approach are also presented in Table 1. The probability value for a genetic effect based on a standard logistic regression model that also included age, smoking status, and ethnicity was 0.02 for variant 16 and 0.06 for “any rare variant.” The five probability values less than 0.10 were found for variants without a direct effect on Q1, with the exception of variant 5. The two variants with the strongest effects on Q1 (variants 7 and 13) showed the highest probability values (0.71 for both).
Figure 1A shows the posterior cumulative distributions of deviance for four logistic regression models. The lower the deviance (curves to the left), the higher the likelihood and the better the model. For each model, the lowest value of the deviance corresponds to the frequentist maximum likelihood. For example, the maximum likelihood was similar under model 1 (black curve) and when variant 8 was included (blue curve). If we consider the complete deviance distribution, we observe that the model that included any rare variant (gray curve) was better than model 1 and that a model with variant 16 (red curve) clearly dominated over the other three models.
Figure 1. Posterior cumulative distributions of (A) deviance and (B) deviance difference under four logistic regression models.A baseline model that includes age, smoking status, and ethnicity was used as a reference to calculate deviance differences.
Figure 1B shows deviance differences between three models that include a genetic factor and the reference model 1. The median deviance difference for the model with variant 16 (red curve) was −9.77 and the central 95% credible interval of the deviance difference was (−20.28, −1.95). The deviance difference was positive in 125 out of 10,000 draws. This proportion (125/10,000 ≈ 0.01) measures how strongly the data support a genetic effect of variant 16. Medians, 95% credible intervals, and proportions of positive deviance differences using model 1 as a reference are shown in Table 1 for the first replicate. In agreement with probability values from standard logistic regression, the strongest evidence of association based on the proportion of positive deviance differences was found for variants 16, 3, 5, and 15. In contrast to a probability value of 0.06, the proportion of positive deviance differences for “any rare variant” was 0.27. The proportion of positive deviance differences for variants 7 and 13, with the largest effect on Q1, was 0.42.
Box plots of probability values and proportions of positive deviance differences in replicates 1 to 200 are shown in Figure 2. In general, probability values and proportions of positive deviance differences showed skewed distributions. The smallest median probability value and the smallest median proportion of positive deviance differences were observed for the two variants with the strongest effects on Q1 (variants 7 and 13).
Figure 2. Box plots of probability values from standard logistic regression and proportions of positive deviance differences for the 200 replicates
Table 2 summarizes the probability values and proportions of positive deviance differences for the 200 replicates in the GAW17 data. Effects on Q1 and MAFs are also shown for convenience. For example, for variant 7 with an effect of 1.08 and MAF = 0.07%, the median probability value was 0.03 and 95% of the probability values belonged to the interval (0.01, 0.80). For this variant, the power of standard logistic regression was 70.5% (type I error rate, 10%), 65% (error rate, 5%) and 11.5% (error rate, 1%). The statistical power was identical for variant 13. Under standard logistic regression and a type I error rate of 10%, the power was 49% for “any rare variant” and it was 34.5% for variant 8, which had the highest MAF.
Table 2. Probability values, proportions of positive deviance differences, and power comparisons based on the 200 replicates
Based on the integrated Bayesian/likelihood approach, the smallest median proportion of positive deviance differences (0.01) and the highest power (69% for a type I error rate of 10%) were also found for variants 7 and 13. Wilcoxon rank sum tests indicated that the distribution of probability values and the distribution of the proportions of positive deviance differences showed different locations, but the direction of the shift was variable (see Figure 2). For variant 8 and for “any rare variant,” standard logistic regression showed a higher power than the novel approach (probability values from Wilcoxon rank and equality of proportion tests < 0.0001). For variants 7 and 13 and a type I error rate of 1%, the integrated approach showed a higher power than standard logistic regression.
Discussion and conclusions
Genome-wide association studies usually rely on the common disease/common variant hypothesis: Polymorphisms are genotyped and their association with disease is investigated. The identified, most often indirect associations are assumed to reflect a shared inheritance of the genotyped markers and linked causal variants . The first aim of these studies is to detect association regions, mostly by relying on significance testing and summarizing results by probability values. When more precise hypotheses and more refined studies have been set up (e.g., focusing on a specific chromosomal region and incorporating genotypes from public data repositories), the subsequent goal is to quantify and model associations and to identify causal variants underlying the association signal.
The investigated genotypes are usually ranked according to plausibility of causal association. Ranks are most often based on probability values, although probability values have important disadvantages compared to Bayes factors . Probability values are often misinterpreted as the probability of no association given the observed data, but they actually measure the probability of the data given no association. The ranking of markers based on probability values is difficult to interpret, because probability values depend on factors that influence power, such as allele frequency and sample size. In addition, Bayes factors may discriminate between causal variants and markers better than probability values. However, a limitation of Bayes factors is the need to define proper prior probabilities in order to compute an integrated likelihood. Aitkin recently proposed a Bayesian/likelihood approach that allows incorporation of a priori external information if desired, but it also allows use of improper noninformative prior probabilities. The relative simplicity and the independence on large-sample results are important advantages of the method. Limitations include computation time and the necessity to carefully check for convergence in Markov chain Monte Carlo methods. The novel approach could also be applied to investigate the role of additional layers of genetic variation (gene methylation and expression), once its potential over standard methods has been investigated.
The marked differences between the data in Table 1, which shows results for the first replicate, and Table 2, which summarizes results for all 200 replications, reflect the importance of external validation of initial association signals. For example, in the first replicate, the highest probability value and the highest proportion of positive deviance differences were observed for the two variants with the strongest effects on Q1 (variants 7 and 13). Table 2 shows that the average statistical power was highest for these two variants.
Our analyses have some limitations. We have considered one single gene and used a simplistic method to collapse rare variants; details on more sophisticated collapsing methods can be found in the overview by Dering et al. . The baseline model was also fixed, and gene-smoking interactions were not investigated. The distribution of deviance differences was smooth, but the deviance distribution was discrete. Our reference method was standard logistic regression; a comparison with Bayesian methods relying on Bayes factors would also be of interest. Our goal with this analytical exercise was to gain experience with a novel technique that may offer some advantage to identify rare susceptibility variants.
The authors declare that there are no competing interests.
JLB conceived of the study, performed statistical analyses and drafted the manuscript. CF contributed to the study design and helped to draft the manuscript. LB extracted the necessary data for the subsequent statistical analysis. All authors read, commented on and approved the final version of the manuscript.
This work was supported by Deutsche Forschungsgemeinschaft (DFG) grant SFB/TRR77 (Project Z2). The Genetic Analysis Workshop is supported by National Institutes of Health grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
Statistics and Computing 1997, 7:247-252. Publisher Full Text