An improved survivability prognosis of breast cancer by using sampling and feature selection technique to solve imbalanced patient classification data

Wang, Kung-Jeng; Makond, Bunjira; Wang, Kung-Min

doi:10.1186/1472-6947-13-124

Research article
Open access
Published: 09 November 2013

An improved survivability prognosis of breast cancer by using sampling and feature selection technique to solve imbalanced patient classification data

Kung-Jeng Wang¹,
Bunjira Makond^1,2 &
Kung-Min Wang³

BMC Medical Informatics and Decision Making volume 13, Article number: 124 (2013) Cite this article

4990 Accesses
31 Citations
1 Altmetric
Metrics details

Abstract

Background

Breast cancer is one of the most critical cancers and is a major cause of cancer death among women. It is essential to know the survivability of the patients in order to ease the decision making process regarding medical treatment and financial preparation. Recently, the breast cancer data sets have been imbalanced (i.e., the number of survival patients outnumbers the number of non-survival patients) whereas the standard classifiers are not applicable for the imbalanced data sets. The methods to improve survivability prognosis of breast cancer need for study.

Methods

Two well-known five-year prognosis models/classifiers [i.e., logistic regression (LR) and decision tree (DT)] are constructed by combining synthetic minority over-sampling technique (SMOTE) ,cost-sensitive classifier technique (CSC), under-sampling, bagging, and boosting. The feature selection method is used to select relevant variables, while the pruning technique is applied to obtain low information-burden models. These methods are applied on data obtained from the Surveillance, Epidemiology, and End Results database. The improvements of survivability prognosis of breast cancer are investigated based on the experimental results.

Results

Experimental results confirm that the DT and LR models combined with SMOTE, CSC, and under-sampling generate higher predictive performance consecutively than the original ones. Most of the time, DT and LR models combined with SMOTE and CSC use less informative burden/features when a feature selection method and a pruning technique are applied.

Conclusions

LR is found to have better statistical power than DT in predicting five-year survivability. CSC is superior to SMOTE, under-sampling, bagging, and boosting to improve the prognostic performance of DT and LR.

Peer Review reports

Background

The need to monitor the survivability of breast cancer patients is threefold. First, breast cancer is one of the most critical cancers [1] and is a major cause of cancer death among women. DeSantis et al. [2] reported that in 2011, around 230,480 American women were diagnosed with invasive breast cancer and 39,520 breast cancer patients died. Second, the survivability of breast cancer patients has a significant impact on healthcare expenses and planning for both the government and private sectors. Third, the survivability of most common cancers (e.g., breast, prostate, lung, and colorectal) has changed over time, increasing continuously over the long term [3] because of the recent advances in cancer diagnosis and treatments, which reduce mortalities and increase survival time. Although many previous studies have been conducted, constant monitoring is still necessary. Thus, the survivability of breast cancer patients without bias is a critical task for the healthcare system.

Recently, artificial-intelligence-based data-mining techniques have been comprehensively used to predict the survivability of breast cancer patients. Lundin et al. [4] used the artificial neural network (ANN) to predict breast cancer survival in Turku, Finland, from 1945 to 1984. Soria et al. [5] compared three classifiers-naive Bayes algorithm, C4.5 DT, and multilayer perceptron function-to evaluate the most suitable technique for predicting the survivability of breast cancer patients from the Nottingham Tenovus Primary Breast Carcinoma Series. Khan et al. [6] used fuzzy DTs to predict breast cancer survivability. Chang and Liou [7] investigated the application of ANN, DT, logistic regression (LR), and genetic algorithm in the prognosis models of breast cancer acquired from patients at the University of Wisconsin.

Surveillance, Epidemiology, and End Results (SEER) data have been recognized and applied for breast cancer prognosis. Delen et al. [8] used the SEER database from 1973 to 2000, and studied breast cancer survivability using C5 decision tree, LR, and ANN. The five-year survival was 46% and their DT-based model was the best predictor with 93.62% accuracy. In comparison, the accuracy of ANN and LR were 91.21% and 89.20%, respectively. Bellaachia and Guven [9] used SEER data from 1973 to 2002 to predict breast cancer survivability and to compare naive Bayes network (BN), back-propagated ANN, and C4.5 DT; the real survivability was 76.80%. Their resulting decision trees (C4.5) had the best classification with 86.70% accuracy, followed by ANN and BN with 86.50% and 84.50% respectively. Endo et al. [10] used the SEER data set from 1992 to 1997. They proposed several models (i.e., LR, J48 DT, DT with naive BN, ANN, naive BN, BN, and ID3 DT) to predict the five-year survivability of breast cancer patients; the survivability was 81.50%. Their study showed that LR has the highest accuracy (85.80 ± 2%). Liu et al. [11] used DT-based predictive models for breast cancer survivability, concluding that the survival rate of patients was 86.52%. They employed the under-sampling technique and bagging algorithm to deal with the imbalanced problem, thus improving the predictive performance. These studies are comparatively summarized in Table 1.

Table 1 Breast cancer survival prognosis researches using SEER data

An improved survivability prognosis of breast cancer by using sampling and feature selection technique to solve imbalanced patient classification data

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Data and pre-processing

Feature selection

Synthetic minority over-sampling technique

Cost-sensitive learning

Logistic regression

Decision tree

Model evaluation

Experiment framework

Results

Efficiency of all techniques

Efficiency of feature selection

Feature pruning effect

The improvement of the proposed method

Discussion

Conclusions

References

Pre-publication history

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Electronic supplementary material

12911_2013_740_MOESM1_ESM.pdf

Authors’ original submitted files for images

Authors’ original file for figure 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Informatics and Decision Making

Contact us