Email updates

Keep up to date with the latest news and content from BMC Proceedings and BioMed Central.

This article is part of the supplement: Genetic Analysis Workshop 17: Unraveling Human Exome Data

Open Access Proceedings

Large-scale risk prediction applied to Genetic Analysis Workshop 17 mini-exome sequence data

Gengxin Li1, John Ferguson1, Wei Zheng2, Joon Sang Lee1, Xianghua Zhang13, Lun Li14, Jia Kang1, Xiting Yan2 and Hongyu Zhao1*

Author Affiliations

1 Department of Epidemiology and Public Health, Yale University, 60 College Street, New Haven, CT 06520, USA

2 Keck Laboratory, Yale University, 300 George Street, New Haven, CT 06511, USA

3 Department of Electronic Science and Technology, University of Science and Technology of China, Hefei, China

4 Bioinformatics and Molecular Imaging Key Laboratory, Huazhong University of Science and Technology, Wuhan, China

For all author emails, please log on.

BMC Proceedings 2011, 5(Suppl 9):S46  doi:10.1186/1753-6561-5-S9-S46

Published: 29 November 2011

Abstract

We consider the application of Efron’s empirical Bayes classification method to risk prediction in a genome-wide association study using the Genetic Analysis Workshop 17 (GAW17) data. A major advantage of using this method is that the effect size distribution for the set of possible features is empirically estimated and that all subsequent parameter estimation and risk prediction is guided by this distribution. Here, we generalize Efron’s method to allow for some of the peculiarities of the GAW17 data. In particular, we introduce two ways to extend Efron’s model: a weighted empirical Bayes model and a joint covariance model that allows the model to properly incorporate the annotation information of single-nucleotide polymorphisms (SNPs). In the course of our analysis, we examine several aspects of the possible simulation model, including the identity of the most important genes, the differing effects of synonymous and nonsynonymous SNPs, and the relative roles of covariates and genes in conferring disease risk. Finally, we compare the three methods to each other and to other classifiers (random forest and neural network).