Log on / register
Feedback | Support | My details
Open AccessResearch article

Application of two machine learning algorithms to genetic association studies in the presence of covariates

Bareng AS Nonyane email and Andrea S Foulkes email

Division of Biostatistics and Epidemiology, School of Public Health and Health Sciences, University of Massachusetts Amherst, MA, USA

author email corresponding author email

BMC Genetics 2008, 9:71doi:10.1186/1471-2156-9-71

Published: 14 November 2008

Abstract

Background

Population-based investigations aimed at uncovering genotype-trait associations often involve high-dimensional genetic polymorphism data as well as information on multiple environmental and clinical parameters. Machine learning (ML) algorithms offer a straightforward analytic approach for selecting subsets of these inputs that are most predictive of a pre-defined trait. The performance of these algorithms, however, in the presence of covariates is not well characterized.

Methods and Results

In this manuscript, we investigate two approaches: Random Forests (RFs) and Multivariate Adaptive Regression Splines (MARS). Through multiple simulation studies, the performance under several underlying models is evaluated. An application to a cohort of HIV-1 infected individuals receiving anti-retroviral therapies is also provided.

Conclusion

Consistent with more traditional regression modeling theory, our findings highlight the importance of considering the nature of underlying gene-covariate-trait relationships before applying ML algorithms, particularly when there is potential confounding or effect mediation.


© 1999-2009 BioMed Central Ltd unless otherwise stated. Part of Springer Science+Business Media.