Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Methodology article

Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes

Hongying Jiang1, Youping Deng2, Huann-Sheng Chen3, Lin Tao3, Qiuying Sha3, Jun Chen2, Chung-Jui Tsai1 and Shuanglin Zhang3*

  • * Corresponding author: Shuanglin Zhang

  • † Equal contributors

Author Affiliations

1 Plant Biotechnology Research Center, School of Forest Resources & Environmental Science, Michigan Technological University, 1400 Townsend Dr., Houghton, MI 49931, USA

2 Division of Biology, Kansas State University, Manhattan, KS 66506, USA

3 Department of Mathematical Sciences, Michigan Technological University, 1400 Townsend Dr., Houghton, MI 49931, USA

For all author emails, please log on.

BMC Bioinformatics 2004, 5:81  doi:10.1186/1471-2105-5-81

Published: 24 June 2004



Due to the high cost and low reproducibility of many microarray experiments, it is not surprising to find a limited number of patient samples in each study, and very few common identified marker genes among different studies involving patients with the same disease. Therefore, it is of great interest and challenge to merge data sets from multiple studies to increase the sample size, which may in turn increase the power of statistical inferences. In this study, we combined two lung cancer studies using micorarray GeneChip®, employed two gene shaving methods and a two-step survival test to identify genes with expression patterns that can distinguish diseased from normal samples, and to indicate patient survival, respectively.


In addition to common data transformation and normalization procedures, we applied a distribution transformation method to integrate the two data sets. Gene shaving (GS) methods based on Random Forests (RF) and Fisher's Linear Discrimination (FLD) were then applied separately to the joint data set for cancer gene selection. The two methods discovered 13 and 10 marker genes (5 in common), respectively, with expression patterns differentiating diseased from normal samples. Among these marker genes, 8 and 7 were found to be cancer-related in other published reports. Furthermore, based on these marker genes, the classifiers we built from one data set predicted the other data set with more than 98% accuracy. Using the univariate Cox proportional hazard regression model, the expression patterns of 36 genes were found to be significantly correlated with patient survival (p < 0.05). Twenty-six of these 36 genes were reported as survival-related genes from the literature, including 7 known tumor-suppressor genes and 9 oncogenes. Additional principal component regression analysis further reduced the gene list from 36 to 16.


This study provided a valuable method of integrating microarray data sets with different origins, and new methods of selecting a minimum number of marker genes to aid in cancer diagnosis. After careful data integration, the classification method developed from one data set can be applied to the other with high prediction accuracy.