This article is part of the supplement: Proceedings of the 2011 International Conference on Bioinformatics and Computational Biology (BIOCOMP'11)
TSG: a new algorithm for binary and multi-class cancer classification and informative genes selection
- Equal contributors
1 Department of Statistics, Kansas State University, Manhattan, KS 66506, USA; this work was done while Haiyan Wang was on sabbatical leave at Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China
2 Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China
3 College of Information Science and Technology, Hunan Agricultural University, Changsha 410128, China
4 College of Bio-safety Science and Technology, Hunan Agricultural University, Changsha 410128, China
5 USDA-ARS and Department of Entomology, Kansas State University, Manhattan, KS 66506, USA
BMC Medical Genomics 2013, 6(Suppl 1):S3 doi:10.1186/1755-8794-6-S1-S3Published: 23 January 2013
One of the challenges in classification of cancer tissue samples based on gene expression data is to establish an effective method that can select a parsimonious set of informative genes. The Top Scoring Pair (TSP), k-Top Scoring Pairs (k-TSP), Support Vector Machines (SVM), and prediction analysis of microarrays (PAM) are four popular classifiers that have comparable performance on multiple cancer datasets. SVM and PAM tend to use a large number of genes and TSP, k-TSP always use even number of genes. In addition, the selection of distinct gene pairs in k-TSP simply combined the pairs of top ranking genes without considering the fact that the gene set with best discrimination power may not be the combined pairs. The k-TSP algorithm also needs the user to specify an upper bound for the number of gene pairs. Here we introduce a computational algorithm to address the problems. The algorithm is named Chisquare-statistic-based Top Scoring Genes (Chi-TSG) classifier simplified as TSG.
The TSG classifier starts with the top two genes and sequentially adds additional gene into the candidate gene set to perform informative gene selection. The algorithm automatically reports the total number of informative genes selected with cross validation. We provide the algorithm for both binary and multi-class cancer classification. The algorithm was applied to 9 binary and 10 multi-class gene expression datasets involving human cancers. The TSG classifier outperforms TSP family classifiers by a big margin in most of the 19 datasets. In addition to improved accuracy, our classifier shares all the advantages of the TSP family classifiers including easy interpretation, invariant to monotone transformation, often selects a small number of informative genes allowing follow-up studies, resistant to sampling variations due to within sample operations.
Redefining the scores for gene set and the classification rules in TSP family classifiers by incorporating the sample size information can lead to better selection of informative genes and classification accuracy. The resulting TSG classifier offers a useful tool for cancer classification based on numerical molecular data.