Email updates

Keep up to date with the latest news and content from BMC Genomics and BioMed Central.

This article is part of the supplement: The 2007 International Conference on Bioinformatics & Computational Biology (BIOCOMP'07)

Open Access Research

Supervised learning-based tagSNP selection for genome-wide disease classifications

Qingzhong Liu12, Jack Yang3, Zhongxue Chen4, Mary Qu Yang56, Andrew H Sung12* and Xudong Huang3*

Author affiliations

1 Department of Computer Science, New Mexico Institute of Mining and Technology, Socorro, NM 87801, USA

2 Institute for Complex Additive Systems Analysis, New Mexico Institute of Mining and Technology, Socorro, NM 87801, USA

3 Department of Radiology, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02120, USA

4 The Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA

5 National Human Genome Research Institute, National Institutes of Health (NIH), U.S. Department of Health and Human Services, USA

6 Oak Ridge Institute for Science and Education, Oak Ridge National Laboratory, U.S. Department of Energy, USA

For all author emails, please log on.

Citation and License

BMC Genomics 2008, 9(Suppl 1):S6  doi:10.1186/1471-2164-9-S1-S6

Published: 20 March 2008

Abstract

Background

Comprehensive evaluation of common genetic variations through association of single nucleotide polymorphisms (SNPs) with complex human diseases on the genome-wide scale is an active area in human genome research. One of the fundamental questions in a SNP-disease association study is to find an optimal subset of SNPs with predicting power for disease status. To find that subset while reducing study burden in terms of time and costs, one can potentially reconcile information redundancy from associations between SNP markers.

Results

We have developed a feature selection method named Supervised Recursive Feature Addition (SRFA). This method combines supervised learning and statistical measures for the chosen candidate features/SNPs to reconcile the redundancy information and, in doing so, improve the classification performance in association studies. Additionally, we have proposed a Support Vector based Recursive Feature Addition (SVRFA) scheme in SNP-disease association analysis.

Conclusions

We have proposed using SRFA with different statistical learning classifiers and SVRFA for both SNP selection and disease classification and then applying them to two complex disease data sets. In general, our approaches outperform the well-known feature selection method of Support Vector Machine Recursive Feature Elimination and logic regression-based SNP selection for disease classification in genetic association studies. Our study further indicates that both genetic and environmental variables should be taken into account when doing disease predictions and classifications for the most complex human diseases that have gene-environment interactions.