Open Access Highly Accessed Research article

Gene expression profiling of breast cancer survivability by pooled cDNA microarray analysis using logistic regression, artificial neural networks and decision trees

Hsiu-Ling Chou1, Chung-Tay Yao2, Sui-Lun Su3, Chia-Yi Lee3, Kuang-Yu Hu4, Harn-Jing Terng5, Yun-Wen Shih3, Yu-Tien Chang3, Yu-Fen Lu3, Chi-Wen Chang6, Mark L Wahlqvist7, Thomas Wetter8 and Chi-Ming Chu3*

Author Affiliations

1 Department of Nursing, Far Eastern Memorial Hospital & Oriental Institute of Technology, New Taipei, Taiwan

2 Department of Surgery, Cathay General Hospital, Taipei, Taiwan

3 Section of Biomedical informatics, School of Public Health, National Defense Medical Center, Taipei, Taiwan

4 Department of Bioinformatics, Chung Hua University, Hsinchu, Taiwan

5 Advpharma, Inc., Taipei, Taiwan

6 School of Nursing, College of Medicine, Chang-Gung University, Taoyuan, Taiwan

7 National Health Research Institute, Chunan, Taiwan

8 Department of Medical Informatics, University of Heidelberg, Heidelberg, Germany

For all author emails, please log on.

BMC Bioinformatics 2013, 14:100  doi:10.1186/1471-2105-14-100

Published: 19 March 2013



Microarray technology can acquire information about thousands of genes simultaneously. We analyzed published breast cancer microarray databases to predict five-year recurrence and compared the performance of three data mining algorithms of artificial neural networks (ANN), decision trees (DT) and logistic regression (LR) and two composite models of DT-ANN and DT-LR. The collection of microarray datasets from the Gene Expression Omnibus, four breast cancer datasets were pooled for predicting five-year breast cancer relapse. After data compilation, 757 subjects, 5 clinical variables and 13,452 genetic variables were aggregated. The bootstrap method, Mann–Whitney U test and 20-fold cross-validation were performed to investigate candidate genes with 100 most-significant p-values. The predictive powers of DT, LR and ANN models were assessed using accuracy and the area under ROC curve. The associated genes were evaluated using Cox regression.


The DT models exhibited the lowest predictive power and the poorest extrapolation when applied to the test samples. The ANN models displayed the best predictive power and showed the best extrapolation. The 21 most-associated genes, as determined by integration of each model, were analyzed using Cox regression with a 3.53-fold (95% CI: 2.24-5.58) increased risk of breast cancer five-year recurrence…


The 21 selected genes can predict breast cancer recurrence. Among these genes, CCNB1, PLK1 and TOP2A are in the cell cycle G2/M DNA damage checkpoint pathway. Oncologists can offer the genetic information for patients when understanding the gene expression profiles on breast cancer recurrence.

Breast cancer; Microarray; Artificial neural network; Logistic regression; Decision tree