Skip to main content

Model selection based on logistic regression in a highly correlated candidate gene region

Abstract

Our aim is to develop methods for identifying a (causal) variant or variants from a dense panel of single-nucleotide polymorphisms (SNPs) that are genotyped on the evidence of previous studies. Because a large number of SNPs are in close proximity to each other, the magnitude of linkage disequilibrium (LD) plays an important role. Namely, highly correlated SNPs may hamper standard methods such as multivariate logistic regression due to multicolinearity between the covariates. Sequences of models with high dimension naturally raise questions about model selection strategies. We investigate three variable selection methods based on logistic regression. The penalties on stepwise selection were imposed using the Akaike's Information Criterion (AIC), and using the lasso penalty. Finally, a Bayesian variable-selection logistic regression model was implemented. The methods are illustrated using the simulated dense SNPs including the causal DR/C locus on chromosome 6. We also evaluate model selection in terms of average prediction error across nine replicates. We conclude that for the Genetic Analysis Workshop 15 (GAW15) data, the newly developed Bayesian selection method performs well.

Background

When a large number of potentially causative sites have been determined, the next question is how to distinguish the sites that have a causal role from the ones that show disease association because of linkage disequilibrium (LD). Stepwise logistic regression has been suggested to identify the relative importance of variants at different sites [1]. In order to deal with high correlation between the single-nucleotide polymorphisms (SNPs), the lasso penalty was applied for model selection, which shrinks some coefficients to zero for sufficiently large penalty [2]. Subsequently, we contrast results with an explicit variable selection implemented within fully Bayesian framework.

Methods

Model selection using penalties

Let y i , i = 1,..., m, be the binary response variable and let x ij , j = 1,..., p, be the predictor variables. Further define the monotone logit transformation η i = p i /(1 - p i ), where p i is the probability of observing y i = 1. For the logistic regression model η i = β 0 + ∑ j = 1 p β j x i j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF3oaAdaWgaaWcbaGaemyAaKgabeaakiabg2da9iab=j7aInaaBaaaleaacqaIWaamaeqaaOGaey4kaSYaaabmaeaacqWFYoGydaWgaaWcbaGaemOAaOgabeaakiabdIha4naaBaaaleaacqWGPbqAcqWGQbGAaeqaaaqaaiabdQgaQjabg2da9iabigdaXaqaaiabdchaWbqdcqGHris5aaaa@42D7@ , where β0 denotes the offset, the binomial log-likelihood l is

l ( β ) = ∑ i = 1 m { y i log p ( x i ; β ) + ( 1 − y i ) log ( 1 − p ( x i ; β ) ) } . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGSbaBcqGGOaakiiGacqWFYoGycqGGPaqkcqGH9aqpdaaeWaqaamaacmaabaGaemyEaK3aaSbaaSqaaiabdMgaPbqabaGccyGGSbaBcqGGVbWBcqGGNbWzcqWGWbaCcqGGOaakcqWG4baEdaWgaaWcbaGaemyAaKgabeaakiabcUda7iab=j7aIjabcMcaPiabgUcaRiabcIcaOiabigdaXiabgkHiTiabdMha5naaBaaaleaacqWGPbqAaeqaaOGaeiykaKIagiiBaWMaei4Ba8Maei4zaCMaeiikaGIaeGymaeJaeyOeI0IaemiCaaNaeiikaGIaemiEaG3aaSbaaSqaaiabdMgaPbqabaGccqGG7aWocqWFYoGycqGGPaqkcqGGPaqkaiaawUhacaGL9baacqGGUaGlaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd2gaTbqdcqGHris5aaaa@640A@
(1.1)

Here β = {β0, β j } and the vector x i includes the constant term 1.

In case of a large number of predictors, it is often desirable to determine a smaller subset with the strongest effects. Our first strategy is to consider stepwise selection with Akaike's Information Criterion (AIC) [2], as defined by AIC = -2l(β) + 2k. Here, k is the number of parameters included in the model, and AIC penalizes for the addition of parameters.

As an alternative we impose the so-called lasso penalty [2–4]. The above log-likelihood [Eq. (1.1)] can be modified as follows:

l λ s ( β ) = l ( β ) − λ s ‖ β ‖ s , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGSbaBdaahaaWcbeqaaGGaciab=T7aSnaaBaaameaacqWGZbWCaeqaaaaakiabcIcaOiab=j7aIjabcMcaPiabg2da9iabdYgaSjabcIcaOiab=j7aIjabcMcaPiabgkHiTiab=T7aSnaaBaaaleaacqWGZbWCaeqaaOWaauWaaeaacqWFYoGyaiaawMa7caGLkWoadaqhaaWcbaGaem4CamhabaaaaOGaeiilaWcaaa@4628@
(1.2)

where s = 1 and ||β||1 = ∑ j |β j |, the L1 norm of the parameter vector β. This lasso-type penalization can be useful for variable selection, because it shrinks some coefficients to zero. Note that only the β j values are subject to penalization, not the offset β0. An optimal λ can be determined by AIC and 10-fold cross-validation (CV).

From a Bayesian point of view, Eq. (1.2) can be seen as the posterior mode for combining a flat prior β0, and independently normally distributed β j values. Therefore, lasso parameter λ1 can be seen as the inverse of the variance of the prior. From this perspective, it becomes clear why Markov Chain Monte Carlo (MCMC) techniques can be applied for better handling of model uncertainty.

Bayesian model selection

We implemented a fully Bayesian approach to variable selection for the logistic regression model, with hierarchical specification on the regression parameter vector and logit link on the class probabilities. A flat prior was assumed for the intercept term β0, and the β j values were independently modeled as N ( 0 , 1 ζ ν 2 I k ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGobGtcqGGOaakcqaIWaamcqGGSaaldaWcaaqaaiabigdaXaqaaGGaciab=z7a6baacqWF9oGBdaahaaqabeaacqaIYaGmaaGaemysaK0aaSbaaSqaaiabdUgaRbqabaGccqGGPaqkaaa@398C@ -distributed random variables, where ν is a known scale and 1/ζ a rescaling factor, such that ζ has a Gamma(a, b) distribution with a and b positive real numbers. We assumed a uniform prior on SNP location choices (i.e., from a variable selection point of view, all SNPs have equal prior probability). An auxiliary variable approach was implemented to generate the logistic model via mixture modeling within an ordinary normal regression model [5, 6]. A hybrid MCMC sampler was applied, based on random choices between three steps to add (birth), remove (death), or move a SNP to the model. For more details we refer to Mertens [7], Green [8], and Holmes and Held [6]. We performed a simulation of 100,000 iterations and discarded the first 50,000 as burn-in. The classification performance from the model was investigated based on the marginal mean posterior class probabilities. Sensitivities and specificities were presented along with the receiver operating characteristic (ROC) curve. The Bayesian logistic regression variable selection model was implemented in MATLAB.

Evaluation of the model selection regarding prediction performance

Because it is difficult to defend a model that predicts poorly, we also examined prediction performance. The first simulated data set, Replicate 1, was used as a training set to determine the optimal set of variables, and Replicates 2 to 10 were used for validation by calculating average prediction error. This was defined to be the average of the classification errors from the predicted values using selected SNPs for each of nine data sets. The computations were performed with the programming language R modifying various existing packages.

Materials

We used ten replicates of simulated data, a dense map of 17,820 SNPs on chromosome 6, modeled after the rheumatoid arthritis (RA) data. We selected a high LD region of approximately 5 Mb including the trait loci (DR/C locus) with a large trait effect; we knew the "answers". Setting a threshold of p-value < 0.001 using the Cochran-Armitage trend test, we obtained 73 SNPs. Further, we chose 200 cases and 200 controls by the RA status.

Results

Whereas the disease SNP 3437 was selected by all methods (Table 1), the Bayesian logistic variable selection regression model was the most parsimonious one, as it identified only two SNPs. In Figure 1, the number of times that each particular SNP was selected into the model across all models simulated was expressed as a percentage of the total number of models considered. The two SNPs were selected from nearly 98% of all models. To summarize classification performance from all models visited, we calculated the mean posterior class probability, and found sensitivity and specificity of 0.82, with the global misspecification error of 0.13. Figure 2 shows the ROC curve for the marginal mean posterior class probability and the area under the curve (AUC) equals to 0.933.

Figure 1
figure 1

Relative importance of the SNPs across all models simulated. Two SNPs including the causal SNP 3437 were selected nearly 98% of all models.

Figure 2
figure 2

Classification performance of all models simulated. ROC curve for the marginal mean posterior class probability.

Table 1 Selection of SNPs

Even though validation set size (nine replicates) was very small, we investigated prediction performance to evaluate the models (Table 2). Based on the average prediction error over nine replicates, the Bayesian selection seemed to outperform other methods.

Table 2 Evaluation of model selection

Discussion

When we dropped the causal SNP and analyzed data again, by stepwise selection with the AIC and by the lasso penalty method, the same remaining SNPs were selected. The corresponding average prediction error still remained at the same level. Meanwhile, the Bayesian method selected two SNPs, 3436 (with probability 87%) and 3439 (with probability 97%), between which the causal SNP 3437 was located. Because the penalty methods selected several (possibly correlated) SNPs, we applied ridge penalty [9] and random forests [10] to stabilize the system. We found that prediction performance generally improved slightly with ridge penalty regularization (Table 2). We also compared the above findings with situations in which only the causal SNP was included in the model. In all situations, prediction performance based on nine replicates remained almost at the same level for each selection method.

Additionally, we analyzed another candidate region of 5 Mb around the D locus with a small effect and moderate LD, where the causal SNP was not included. Using these data sets, which can be considered as the opposite of those used in the main analyses, none of the methods performed well in terms of prediction performance.

Our Bayesian method can (theoretically) deal with a great number of SNPs, provided that time and facilities are available. The computational efficiency is mainly achieved by integrating out regression coefficients within the ratio of marginal likelihoods. However, the usefulness for genome-wide scan remains to be evaluated.

Conclusion

All methods identified the causal SNP together with other variants. In terms of parsimony of the model and prediction performance, Bayesian method outperformed other methods. When high correlations between the SNPs are characteristics in some candidate region such as in the specific data presented in this paper, and the focus of investigation is to find a causal gene, we conclude that a Bayesian method might perform well to disentangle the structure. Figures 1 and 2 summarize our evaluation results succinctly. The plot of mean posterior class probability indicates relative importance of the selected SNPs: i.e., in average how many times these SNPs were included across all the models simulated. Additionally, the ROC curve indicates how well these models were classified.

References

  1. Cordell HJ, Clayton DG: A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type I diabetes. Am J Hum Genet. 2002, 70: 124-141. 10.1086/338007.

    Article  PubMed Central  PubMed  CAS  Google Scholar 

  2. Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2001, New York: Springer

    Book  Google Scholar 

  3. Tibshirani R: Regression shrinkage and selection via the lasso. J Royal Stat Soc B. 1996, 58: 267-288.

    Google Scholar 

  4. Park MY, Hastie T: L1 regularization path algorithm for generalized linear models. J R Statistic Soc B. 2007, 69 (part 4): 659-677. 10.1111/j.1467-9868.2007.00607.x.

    Article  Google Scholar 

  5. Andrews DF, Mallows CL: Scale mixtures of normal distributions. J Roy Stat Soc B. 1974, 36: 99-102.

    Google Scholar 

  6. Holmes CC, Held L: Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis. 2006, 1: 145-168. 10.1214/06-BA105.

    Article  Google Scholar 

  7. Mertens BJA: Logistic regression modeling of proteomic mass spectra in a case-control study on diagnosis for colon cancer. Bayesian Statistics 8. Edited by: Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heekerman D, Smith AFM, West M. 2007, New York: Oxford University Press, 639-644.

    Google Scholar 

  8. Green P: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995, 82: 711-732. 10.1093/biomet/82.4.711.

    Article  Google Scholar 

  9. le Cessie S, van Houwelingen JC: Ridge estimators in logistic regression. Appl Stat. 1992, 41: 191-201. 10.2307/2347628.

    Article  Google Scholar 

  10. Breiman L: Random forests. Mach Learn. 2001, 45: 5-32. 10.1023/A:1010933404324.

    Article  Google Scholar 

Download references

Acknowledgements

H.W. Uh was funded by the GENOMEUTWIN project which is supported by the European Union Contract No. QLG2-CT-2002-01254.

This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hae-Won Uh.

Additional information

Competing interests

The author(s) declare that they have no competing interests.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Uh, HW., Mertens, B.J., Jan van der Wijk, H. et al. Model selection based on logistic regression in a highly correlated candidate gene region. BMC Proc 1 (Suppl 1), S114 (2007). https://doi.org/10.1186/1753-6561-1-S1-S114

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1753-6561-1-S1-S114

Keywords