This article is part of the supplement: The 2010 International Conference on Bioinformatics and Computational Biology (BIOCOMP 2010): Genomics

Open Access Open Badges Research article

Maximizing biomarker discovery by minimizing gene signatures

Chang Chang1, Junwei Wang1, Chen Zhao1, Jennifer Fostel2, Weida Tong3, Pierre R Bushel4, Youping Deng5, Lajos Pusztai6, W Fraser Symmans6 and Tieliu Shi1*

Author affiliations

1 The Center for Bioinformatics and the Institute of Biomedical Sciences, School of Life Science, East China Normal University, 500 Dongchuan Road, Shanghai 200241, China

2 SRA Global Health Sector/NIEHS, Research Triangle Park, NC, 27709, USA

3 National Center for Toxicological Research, US Food and Drug Administration, 3900 NCTR Road, Jefferson, AK 72079, USA

4 Biostatistics Branch, National Institute of Environmental Health Sciences, P.O. Box 12233, Research Triangle Park, NC 27709, USA

5 Rush University Cancer Center, Department of Internal Medicine, Rush University Medical Center, Chicago, IL 60612, USA

6 Department of Breast Medical Oncology and Department of Pathology, The University of Texas M. D. Anderson Cancer Center, PO Box 301439, Houston, TX 77230, USA

For all author emails, please log on.

Citation and License

BMC Genomics 2011, 12(Suppl 5):S6  doi:10.1186/1471-2164-12-S5-S6

Published: 23 December 2011



The use of gene signatures can potentially be of considerable value in the field of clinical diagnosis. However, gene signatures defined with different methods can be quite various even when applied the same disease and the same endpoint. Previous studies have shown that the correct selection of subsets of genes from microarray data is key for the accurate classification of disease phenotypes, and a number of methods have been proposed for the purpose. However, these methods refine the subsets by only considering each single feature, and they do not confirm the association between the genes identified in each gene signature and the phenotype of the disease. We proposed an innovative new method termed Minimize Feature's Size (MFS) based on multiple level similarity analyses and association between the genes and disease for breast cancer endpoints by comparing classifier models generated from the second phase of MicroArray Quality Control (MAQC-II), trying to develop effective meta-analysis strategies to transform the MAQC-II signatures into a robust and reliable set of biomarker for clinical applications.


We analyzed the similarity of the multiple gene signatures in an endpoint and between the two endpoints of breast cancer at probe and gene levels, the results indicate that disease-related genes can be preferably selected as the components of gene signature, and that the gene signatures for the two endpoints could be interchangeable. The minimized signatures were built at probe level by using MFS for each endpoint. By applying the approach, we generated a much smaller set of gene signature with the similar predictive power compared with those gene signatures from MAQC-II.


Our results indicate that gene signatures of both large and small sizes could perform equally well in clinical applications. Besides, consistency and biological significances can be detected among different gene signatures, reflecting the studying endpoints. New classifiers built with MFS exhibit improved performance with both internal and external validation, suggesting that MFS method generally reduces redundancies for features within gene signatures and improves the performance of the model. Consequently, our strategy will be beneficial for the microarray-based clinical applications.