Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010)

Open Access Research

Identifying main effects and epistatic interactions from large-scale SNP data via adaptive group Lasso

Can Yang1*, Xiang Wan1, Qiang Yang2, Hong Xue3 and Weichuan Yu1*

Author Affiliations

1 Laboratory for Bioinformatics and Computational Biology, Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, PR China

2 Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, PR China

3 Department of Biochemistry, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, PR China

For all author emails, please log on.

BMC Bioinformatics 2010, 11(Suppl 1):S18  doi:10.1186/1471-2105-11-S1-S18

Published: 18 January 2010

Abstract

Background

Single nucleotide polymorphism (SNP) based association studies aim at identifying SNPs associated with phenotypes, for example, complex diseases. The associated SNPs may influence the disease risk individually (main effects) or behave jointly (epistatic interactions). For the analysis of high throughput data, the main difficulty is that the number of SNPs far exceeds the number of samples. This difficulty is amplified when identifying interactions.

Results

In this paper, we propose an Adaptive Group Lasso (AGL) model for large-scale association studies. Our model enables us to analyze SNPs and their interactions simultaneously. We achieve this by introducing a sparsity constraint in our model based on the fact that only a small fraction of SNPs is disease-associated. In order to reduce the number of false positive findings, we develop an adaptive reweighting scheme to enhance sparsity. In addition, our method treats SNPs and their interactions as factors, and identifies them in a grouped manner. Thus, it is flexible to analyze various disease models, especially for interaction detection. However, due to the intensive computation when millions of interaction terms needs to be searched in the model fitting, our method needs to combined with some filtering methods when applied to genome-wide data for detecting interactions.

Conclusion

By using a wide range of simulated datasets and a real dataset from WTCCC, we demonstrate the advantages of our method.