Email updates

Keep up to date with the latest news and content from BMC Genomics and BioMed Central.

Open Access Methodology article

Iterative feature removal yields highly discriminative pathways

Stephen O’Hara1, Kun Wang16, Richard A Slayden2, Alan R Schenkel2, Greg Huber3, Corey S O’Hern4, Mark D Shattuck5 and Michael Kirby1*

Author Affiliations

1 Department of Mathematics, Colorado State University, Fort Collins, CO, USA

2 Department of Microbiology, Immunology, and Pathology, Colorado State University, Fort Collins, CO, USA

3 Kavli Institute for Theoretical Physics, University of California, Santa Barbara, CA, USA

4 Department of Mechanical Engineering & Materials Science, Department of Applied Physics, and Department of Physics, Yale University, New Haven, CT, USA

5 Physics Department, The City College of New York, New York, NY, USA

6 Department of Mechanical Engineering & Materials Science, Yale University, New Haven, CT, USA

For all author emails, please log on.

BMC Genomics 2013, 14:832  doi:10.1186/1471-2164-14-832

Published: 25 November 2013

Abstract

Background

We introduce Iterative Feature Removal (IFR) as an unbiased approach for selecting features with diagnostic capacity from large data sets. The algorithm is based on recently developed tools in machine learning that are driven by sparse feature selection goals. When applied to genomic data, our method is designed to identify genes that can provide deeper insight into complex interactions while remaining directly connected to diagnostic utility. We contrast this approach with the search for a minimal best set of discriminative genes, which can provide only an incomplete picture of the biological complexity.

Results

Microarray data sets typically contain far more features (genes) than samples. For this type of data, we demonstrate that there are many equivalently-predictive subsets of genes. We iteratively train a classifier using features identified via a sparse support vector machine. At each iteration, we remove all the features that were previously selected. We found that we could iterate many times before a sustained drop in accuracy occurs, with each iteration removing approximately 30 genes from consideration. The classification accuracy on test data remains essentially flat even as hundreds of top-genes are removed.

Our method identifies sets of genes that are highly predictive, even when comprised of genes that individually are not. Through automated and manual analysis of the selected genes, we demonstrate that the selected features expose relevant pathways that other approaches would have missed.

Conclusions

Our results challenge the paradigm of using feature selection techniques to design parsimonious classifiers from microarray and similar high-dimensional, small-sample-size data sets. The fact that there are many subsets of genes that work equally well to classify the data provides a strong counter-result to the notion that there is a small number of “top genes” that should be used to build classifiers. In our results, the best classifiers were formed using genes with limited univariate power, thus illustrating that deeper mining of features using multivariate techniques is important.

Keywords:
Feature selection; Microarray; Discrimination; Classification; Pathways; Sparse SVM; Influenza