Email updates

Keep up to date with the latest news and content from BMC Proceedings and BioMed Central.

This article is part of the supplement: Genetic Analysis Workshop 17: Unraveling Human Exome Data

Open Access Proceedings

Performance of random forests and logic regression methods using mini-exome sequence data

Yoonhee Kim1, Qing Li1, Cheryl D Cropp1, Heejong Sung1, Juanliang Cai1, Claire L Simpson1, Brian Perry1, Abhijit Dasgupta2, James D Malley3, Alexander F Wilson1 and Joan E Bailey-Wilson1*

  • * Corresponding author: Joan E Bailey-Wilson jebw@mail.nih.gov

  • † Equal contributors

Author Affiliations

1 Inherited Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, Baltimore, MD 21224, USA

2 Clinical Sciences Section, National Institute of Arthritis and Musculoskeletal and Skin Disease, National Institutes of Health, Bethesda, MD 20892, USA

3 Center for Information Technology, National Institutes of Health, Bethesda, MD 20892, USA

For all author emails, please log on.

BMC Proceedings 2011, 5(Suppl 9):S104  doi:10.1186/1753-6561-5-S9-S104

Published: 29 November 2011

Abstract

Machine learning approaches are an attractive option for analyzing large-scale data to detect genetic variants that contribute to variation of a quantitative trait, without requiring specific distributional assumptions. We evaluate two machine learning methods, random forests and logic regression, and compare them to standard simple univariate linear regression, using the Genetic Analysis Workshop 17 mini-exome data. We also apply these methods after collapsing multiple rare variants within genes and within gene pathways. Linear regression and the random forest method performed better when rare variants were collapsed based on genes or gene pathways than when each variant was analyzed separately. Logic regression performed better when rare variants were collapsed based on genes rather than on pathways.