This article is part of the supplement: Selected articles from the 9th Annual Biotechnology and Bioinformatics Symposium (BIOT 2012)
Breast cancer prediction using genome wide single nucleotide polymorphism data
- Equal contributors
1 Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada
2 Alberta Innovates Centre for Machine Learning, University of Alberta, Edmonton, Alberta, Canada
3 Department of Oncology, University of Alberta, Edmonton, Canada
4 Department of Laboratory Medicine and Pathology, University of Alberta, Edmonton, Alberta, Canada
5 PolyomX Program, Cross Cancer Institute, Alberta Health Services, Edmonton, Alberta, Canada
BMC Bioinformatics 2013, 14(Suppl 13):S3 doi:10.1186/1471-2105-14-S13-S3Published: 1 October 2013
This paper introduces and applies a genome wide predictive study to learn a model that predicts whether a new subject will develop breast cancer or not, based on her SNP profile.
We first genotyped 696 female subjects (348 breast cancer cases and 348 apparently healthy controls), predominantly of Caucasian origin from Alberta, Canada using Affymetrix Human SNP 6.0 arrays. Then, we applied EIGENSTRAT population stratification correction method to remove 73 subjects not belonging to the Caucasian population. Then, we filtered any SNP that had any missing calls, whose genotype frequency was deviated from Hardy-Weinberg equilibrium, or whose minor allele frequency was less than 5%. Finally, we applied a combination of MeanDiff feature selection method and KNN learning method to this filtered dataset to produce a breast cancer prediction model. LOOCV accuracy of this classifier is 59.55%. Random permutation tests show that this result is significantly better than the baseline accuracy of 51.52%. Sensitivity analysis shows that the classifier is fairly robust to the number of MeanDiff-selected SNPs. External validation on the CGEMS breast cancer dataset, the only other publicly available breast cancer dataset, shows that this combination of MeanDiff and KNN leads to a LOOCV accuracy of 60.25%, which is significantly better than its baseline of 50.06%. We then considered a dozen different combinations of feature selection and learning method, but found that none of these combinations produces a better predictive model than our model. We also considered various biological feature selection methods like selecting SNPs reported in recent genome wide association studies to be associated with breast cancer, selecting SNPs in genes associated with KEGG cancer pathways, or selecting SNPs associated with breast cancer in the F-SNP database to produce predictive models, but again found that none of these models achieved accuracy better than baseline.
We anticipate producing more accurate breast cancer prediction models by recruiting more study subjects, providing more accurate labelling of phenotypes (to accommodate the heterogeneity of breast cancer), measuring other genomic alterations such as point mutations and copy number variations, and incorporating non-genetic information about subjects such as environmental and lifestyle factors.