Open Access Highly Accessed Methodology article

Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm

Kun-Huang Chen1*, Kung-Jeng Wang1, Min-Lung Tsai2, Kung-Min Wang3, Angelia Melani Adrian1, Wei-Chung Cheng45, Tzu-Sen Yang67, Nai-Chia Teng8, Kuo-Pin Tan9 and Ku-Shang Chang2

Author Affiliations

1 Department of Industrial Management, National Taiwan University of Science and Technology, Taipei 106, Taiwan, R.O.C

2 Department of Food Science, Yuanpei University, No. 306, Yuanpei Street, Hsinchu 300, Taiwan, R.O.C

3 Department of Surgery, Shin-Kong Wu Ho-Su Memorial Hospital, Taipei, Taiwan, R.O.C

4 Pediatric Neurosurgery, Department of Surgery, Cheng Hsin General Hospital, Taipei 11220, Taiwan, R.O.C

5 Genomic Research Center, National Yang-Ming University, Taipei 11221, Taiwan, R.O.C

6 School of Dental Technology, Taipei Medical University, Taipei 110, Taiwan, R.O.C

7 Taiwan Research Center for Biomedical Implants and Microsurgery Devices, Taipei Medical University Taipei 110, Taiwan, R.O.C

8 School of Dentistry, College of Oral Medicine, Taipei Medical University, Taipei, Taiwan, R.O.C

9 MBA, School of Management, National Taiwan University of Science and Technology, Taipei 106, Taiwan, R.O.C

For all author emails, please log on.

BMC Bioinformatics 2014, 15:49  doi:10.1186/1471-2105-15-49

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/15/49


Received:7 August 2013
Accepted:7 February 2014
Published:20 February 2014

© 2014 Chen et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.

Abstract

Background

In the application of microarray data, how to select a small number of informative genes from thousands of genes that may contribute to the occurrence of cancers is an important issue. Many researchers use various computational intelligence methods to analyzed gene expression data.

Results

To achieve efficient gene selection from thousands of candidate genes that can contribute in identifying cancers, this study aims at developing a novel method utilizing particle swarm optimization combined with a decision tree as the classifier. This study also compares the performance of our proposed method with other well-known benchmark classification methods (support vector machine, self-organizing map, back propagation neural network, C4.5 decision tree, Naive Bayes, CART decision tree, and artificial immune recognition system) and conducts experiments on 11 gene expression cancer datasets.

Conclusion

Based on statistical analysis, our proposed method outperforms other popular classifiers for all test datasets, and is compatible to SVM for certain specific datasets. Further, the housekeeping genes with various expression patterns and tissue-specific genes are identified. These genes provide a high discrimination power on cancer classification.

Keywords:
Gene expression; Cancer; Particle swarm optimization; Decision tree classifier

Background

Researchers have tried to analyze thousands of genes simultaneously by microarray technology to obtain important information about specific cellular functions of gene(s) which can be used in cancer diagnosis and prognosis [1]. The gene selection from gene expression data are challenging due to the properties of small sample size, high dimension and high noise. A method is needed for choosing the important subset of genes with high classification accuracy. Such method would not only enable doctors to identify a small subset of biologically relevant genes for cancers, but will also save computational costs [2].

The gene selection method can be divided into three classes, the wrapper, the filter, and the embedded approaches. Wrappers utilize learning machine and search for the best features in the space of all feature subsets. Despite their simplicity and often having the best performance results, wrappers highly depend on the inductive principle of the learning model and may suffer from excessive computational complexity because the learning machine has to be retrained for each feature subset considered [3]. The wrapper method is usually superior to the filter one since it involves intercorrelation of individual genes in a multivariate manner, and can automatically determine the optimal number of feature genes for a particular classifier. The filter approach usually employs statistical methods to collect the intrinsic characteristics of genes in discriminating the targeted phenotype class, such as statistical tests, Wilcoxon’s rank test and mutual information, to directly select feature genes [4]. This approach is easily implemented, but ignores the complex interaction among genes. Finally, the embedded method is a catch-all group technique which performs feature selection as part of the model construction process. It is similar to the wrapper method, while multiple algorithms can be combined in the embedded method to perform feature subset selection [5,6]. Genetic algorithms (GAs) [7] are generally used as the search engine for feature subset in the embedded method, while other classification methods, such as estimation of distribution algorithm (EDA) with SVM [8-13], K nearest neighbors/genetic algorithms (KNN/GA) [14], genetic algorithms-support vector machine (GA-SVM) [15] and so forth, are used to select feature subset.

Particle Swarm Optimization (PSO), developed by Kennedy and Eberhart [16], is a population-based meta-heuristic on the basis of stochastic optimization, inspired by the social behavior of flocks of birds or schools of fish [17]. PSO has been widely applied in many fields to solve various optimization problems, including gene selection [1,2,18-20]. A swarm of particles with randomly initialized positions would move toward the optimal position along the search path that is iteratively updated on the basis of the best particle position and velocity in PSO. The potential solutions, called particles, are used to represent a candidate solution for the problem. Among the classifiers given a specific search algorithm, C4.5 is a decision tree-based classifier listed in the top 10 most influential data-mining algorithms [21]. Decision trees are a linear method which is easy to interpret and understand.

This paper presents a PSO-based algorithm to address the problem of gene selection. The proposed approach is an integration of PSO searching algorithm and C4.5 decision tree classifier, called PSODT. Combining PSO with C4.5 classifier has rarely been investigated by previous researchers. The performance of our proposed method will be evaluated by 11 microarray datasets, which consist of 1 dataset from cancer patients of the M2 DB in Taiwan [22] and 10 from the Gene Expression Model Selector [23]. In addition, the performance of our proposed method will be compared with other well-known classifier algorithms, such as self-organizing map (SOM), C4.5, back propagation neural network (BPNN), SVM, NaivaBayes (NB), CART decision tree, and artificial immune recognition system (AIRS). Statistical test will be employed to discriminate the difference of all the algorithms in terms of classification accuracy.

Gene selection and classification

DNA microarray (also commonly known as DNA chip or biochip) is a collection of microscopic DNA spots attached to a solid surface and allows researchers to measure the expression levels of thousands of genes simultaneously in a single experiment. The DNA microarray is operated by classifier approaches to compare the gene expression levels in tissues under different conditions [24]; for instance, the study of Jiang et al. [25] devised an RF-based method to classify real pre-miRNAs using a hybrid feature set for the wild type versus mutant, or healthy versus diseased classes. Batuwita and Palade [26] developed a classifier named micro-Pred for distinguishing human pre-miRNA hairpins from both pseudo hairpins and other ncRNAs. Wang et al. [27] presented a hybrid method combining GA and SVM to identify the optimal subset of microarray datasets, and claimed their method was superior to those obtained by microPred and miPred. Further, Nanni et al [28] recently devised a support vector machine (SVM) as classifier for microarray gene classification. Their method combines different feature reduction approaches to improve classification performance of the accuracy and area under the receiver operating characteristic (ROC). Park et al [29] presented a method for inferring combinatorial Boolean rules of gene sets for cancer classification and cancer transcriptome. Their study identified a small group of gene sets that synergistically contribute to the classification of samples into their corresponding phenotypic groups (such as normal and cancer) and reduced the search space of the possible Boolean rules.

Due to the high computational cost and memory usage for classifying high dimensional data, appropriate gene selection procedure is required to improve classification performance. As addressed by Tan et al. [30], given the quantity and complexity of the gene expression data, it is unlikely to efficiently compute and compare the n × m gene expression matrix by manually. Instead, machine learning and other artificial intelligence techniques have potential to characterize gene expression data promptly [8,31,32].

Previous study

Some studies have proposed PSO algorithm for gene selection problems. For instance, Alba et al. [1] presented a modified PSO (geometric PSO) for high-dimensional microarray data. Both augmented SVM and GA were proposed for comparison on six public cancer datasets. Li et al. [23] devised a method of combining PSO with a GA and adopted SVM as the classifier for gene selection. Their proposed approach used three benchmark gene expression datasets for validation: leukemia, colon cancer, and breast cancer. Mohamad et al. [19] presented an improved binary PSO combined with an SVM classifier to select a near-optimal subset of informative genes relevant to cancer classification.

Zhao et al. [33] lately presented a novel hybrid framework (NHF) for gene selection and cancer classification of high dimensional microarray data by combining the information gain (IG), F-score, GA, PSO, and SVM. Their method was compared to PSO-based, GA-based, ant colony optimization-based, and simulated annealing (SA)-based methods on five benchmark data sets: leukemia, lung carcinoma, colon, breast, and brain cancers. Chen et al. [18] used PSO + 1NN for feature selection and tested their algorithm against 8 benchmark datasets from UC Irvine Machine Learning Repository as well as to a real case of obstructive sleep apnea. Previous research all indicates that PSO is promising to solve the gene selection problem.

Methods

We integrated PSO algorithm with the C4.5 classifier to address the gene selection problem (refer to Appendix 1 & 2 at [34]). The important genes were proposed using PSO algorithm, and then C4.5 was employed as a fitness function of the PSO algorithm to verify the efficiency of the selected genes.

Solution/particle representation and initialization

A particle represents a potential solution (i.e., gene subset) in an n-dimensional space. The particles used binary digits string with length n, the total number of genes for gene selection. The bits consisted of 0 and 1 digits, which correspond to non-selected and selected gene, respectively. Each particle was coded as binary alphabetical string. For instance, a particle of ‘11000’ contains five genes where only the first and the second gene were selected. We updated the dimension d of particle i by <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/49/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/49/mathml/M1">View MathML</a>.

We used a random function to initialize the particle population of PSO. Seeding PSO with a good initial can lead to a better result. This study has examined two generators of random seeds to initiate solutions: the first is generated by using Visual C# random seed function and the second is from a uniform distribution with a range from 0 to 1, denoted as of U(0,1). The result (as shown in Table 1) reveals that U(0,1) outperforms Visual C# random seed generator. In this study, a probability of 0.5 is randomly assigned to bit values 0 and 1. If U(0,1)>0.5, then <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/49/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/49/mathml/M2">View MathML</a>; otherwise, <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/49/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/49/mathml/M3">View MathML</a>.

Table 1. Random seed comparison

Fitness function and PSO procedure

The PSO fitness function is based on the classification accuracy measured by the C4.5 classifier. Figure 1 shows the procedure of applying PSODT on gene selection.

thumbnailFigure 1. The proposed PSODT for gene selection.

Results and discussion

Experimental setting

This study used 10 microarray cancer datasets (with diverse sizes, features, and classes) and conducted numerical experiments to evaluate the performance of our proposed method. The 10 datasets were obtained from GEMS [23], including 11_Tumors, 14_Tumors, 9_Tumors, Brain_Tumor1, Brain_Tumor2, Leukemia2, Lung_Cancer, SRBCT, Prostate_Tumor, and DLBCL. In GEMS dataset, these types of cancer belong in the top 10 in terms of cancer incidences and deaths in USA in 2012. Table 2 summarizes the characteristics of those microarray datasets. In addition, five sets of cDNA clones were selected and used individually for this purpose (refer to [34]).

Table 2. Microarray datasets employed in the study

The PSO parameters are chosen by a survey on several related research articles concerning the utilization of PSO. Such parameter setting was optimized by literatures (refer to [35-37]). Moreover, we conducted many trials to test such parameter setting which shows the best objective value. The parameters used for PSODT are as follows. The number of particles in the population was set to the one-tenth number of genes (features) (refer to the field of ‘particle size” in Table 2). The parameter, c1 and c2, were both set at 2, whereas the parameter, lower (vmin) and upper bounds (vmax), were set at -4 and 4, respectively. The inertia weight (w) was set at 0.4. Random factors, r1 and r2, are within [0, 1] interval. The process was repeated until either the fitness of the given particle was 1.0 or the number of the iterations was achieved by the default value of T = 100. Table 2 shows the summarization of microarray dataset characteristics.

Cross-validation

To guarantee the impartial comparison of the classification results and avoid generating random results, this study adopted a five-fold cross-validation strategy. Cross-validation is a statistical method by dividing data into two segments for evaluating and comparing learning algorithms. One part used to learn or train a model and the other used to validate the model. Stone [38] and Geisser [39] employed cross-validation as means for choosing proper model parameters, as opposed to using cross-validation purely for estimating model performance [40-42]. K-fold cross-validation is used to evaluate algorithms. In this study we set K = 5, and the details are stated as follows: in each iteration, the algorithms apply K folds of data to earn one or more models, and subsequently the learned models are asked to predict the data in the validation fold. The performance of the algorithm on each fold is tracked by its accuracy. Upon completion, the K samples of the accuracy is available for validation.

An illustration of the resulting cancer classifier structure

Figure 2 demonstrates a sample decision tree for classifying three female cancers (i.e., ovary, cervix uteri and uterus). The genes causing cancers led to a classification tree with four terminal nodes (or clusters of cancer). For instance, 218934_s_at, 206166_s_at and 212341_at are identified as splitters. 218934_s_at are strongly associated with the three cancers; the first branch of the tree is based on 218934_s_at: a high score (i.e., 218934_s_at > 2.7133) implies the occurrence of uterus cancer (Node 1). When 218934_s_at < = 2.7133 (Node 2), 206166_s_at > 2.5063 implies the occurrence of cervix uteri cancer (Node 3), and when 206166_s_at < = 2.5063 (Node 4) and 212341_at < = 10.026, it implies the occurrence of ovary cancer (Node 5); otherwise, 212341_at > 10.026 implies again the occurrence of cervix uteri cancer (Node 6).

thumbnailFigure 2. An illustration of partial decision tree.

Benchmark results with other classification algorithms

To confirm effectiveness of our proposed PSODT, this study compares its accuracy with the other seven popular classification algorithms (i.e., SVM, SOM, BPNN, C4.5, BN, CART, and AIRS). Table 3 shows the accuracy of our proposed method as compared to the other four algorithms. Five-fold cross-validation is applied on the datasets and the average and standard deviations were obtained. Our proposed method was superior to the others, except it is compatible to SVM for two datasets, 9_Tumors and SRBCT. The stability (convergence) shows that the standard deviation of PSODT is less than 1%. Figure 3 shows the averaged classification accuracy in 95% confidence interval (with respect to the 10 datasets) which indicates that PSODT outperformed the other algorithms. This study used two-way ANOVA to determine whether the eight algorithms were significantly different in terms of average classification accuracy. The result fulfills the ANOVA assumptions on normality, homoscedasticity and independence. In ANOVA analysis, the classification algorithms defined as “factor”, whereas the datasets were defined as “block”. Table 4 lists the ANOVA results for average classification accuracy. The results showed significant differences of classification accuracy among the 8 algorithms. Further, to determine if each pair of the five algorithms differed from each other, Fisher’s test was used in this study, as shown in Table 5. The p-values demonstrate that our proposed method exhibits differences in mean classification accuracy as compared with the other algorithms, except it is compatible with SVM. Table 6 shows the computational time for each algorithm. Although the time consumed by the proposed tree based algorithm is significantly larger than the others, it is within a reasonable range even for the large-sized datasets.

Table 3. Classification accuracy for 10 microarray datasets (%)

thumbnailFigure 3. 95% confidence interval of the mean for classification accuracy.

Table 4. ANOVA for average classification accuracy

Table 5. p-value of multiple comparison for average classification accuracy

Table 6. CPU time (in sec.)

In summary, SVM classification method which is generally considered as one of the most powerful machine learning classifiers is based on the statistical learning theory [43]. However, the structure of SVM is a black box system which does not provide insights on the reasons of a classification or explanations similar to ANN. SOM is one of the categories of ANN algorithms for supervised learning. BPNN is a common type of ANN and capable to recognize complex patterns in data. However, all these abovementioned classifiers are black box systems and nonlinear models. NB classifier considers each of these features to contribute independently to the probability, regardless of the presence or absence of the other features. CART may be no good binary split on an attribute that has a good multi-way split [44], which may lead to inferior trees. AIRS have many parameters that is not easy to find the optimum combination of parameters. Instead, C4.5 is a classifier that creates a decision tree based on rules, and is a linearly method simple to understand and interpret. This study integrates the nonlinear search capability of PSO and linearly separable advantage of DT.

Model justification by a clinical dataset

This study investigated a set of clinical practice data including 13 actual cancer cases from the M2 data bank in Taiwan [22]. The raw intensity data of cancer (CEL files) generated using Affymetrix HG-U133A and HG-U133 plus 2.0 platforms were retrieved from Array Express and Gene expression omnibus (GEO). Arrays performed with samples other than human clinical specimens, such as cell lines, primary cells, and transformed cells, were excluded.

All raw data of microarray (5,335 samples) were pre-processed using three different algorithms: Affymetrix Microarray Suite 5 (MAS5), robust multi-chip average (RMA), and GC-robust multi-chip average (GCRMA) as implemented in the Bioconductor packages. RMA and GCRMA processed data on a multi-array basis. All of the arrays of the same platform were uniformly pre-processed to reduce variance. The cancer microarray consisted of 13 cancer types, namely, bladder, blood, bone marrow, brain, breast, cervix uterus, colon, kidney, liver, lung, lymph node, ovary, and prostate. The information of each cancer is shown in Table 7.

Table 7. The arrays of cancers

Table 8 presents the classification accuracy of PSODT for each run and the number of genes selected. The accuracy of PSODT and SVM were 97.26 and 72.46, respectively. The test results on the 13 cancer microarrays for all benchmark algorithms are shown in Table 9. The results indicated that PSODT outperformed the SVM and other benchmark methods.

Table 8. Classification accuracies for each run using PSODT

Table 9. Classification accuracy for 13 sets of cancer microarray (%)

To perform a five-fold cross-validation, we selected five independent sets of cDNA clones (refer to supplementary Tables one to five of Appendix three at [34]). A total of 453 cDNA clones were selected at least once. Among the lists of cDNA clones, a number of them were selected multiple times. The genes being selected multiple times (with Frequency ≥ 4) indicate that the expression levels of these genes provide a high discrimination power among the tumors of different anatomical origin. Therefore, these genes are likely to be the tissue-specific genes. Alternatively, such expression differences may be generated result from organ- or tissue-specific malignant transformation.

Conclusions

We proposed a novel method to identify tissue-specific genes as well as housekeeping genes with altered expression patterns that provide a high discrimination power on cancer classification. These genes may play as an important role in diagnosis and/or pathogenesis of various types of tumors. Eleven cancer datasets were used to test the performance of the proposed method, and a five-fold cross-validation method was used to justify the performance of our proposed method. Our proposed approach achieved a higher accuracy as compared with all the other methods.

This proposed method has integrated with the nonlinear search capability of PSO and linearly separable advantage of DT to apply to microarray cancer datasets for gene selection. Hawse have identified representative cancer genes (453 genes) from numerous microarray data (65,000 genes) that can reduce costs. In addition, we compared our proposed method with four well-known algorithms using a variety of datasets (diverse sizes and numbers of classes and features). Consequently, our proposed method outperformed all the other benchmark methods and is compatible to SVM for certain specific datasets.

Further studies to be further conducted are suggested as follows. First, PSO may result in better solutions by optimizing parameter settings; therefore, self-adaptation parameters of particle size, number of iterations, and constant weight factors are worth developing. Second, adding hybrid search algorithms in PSO algorithm may improve its performance; for example, swarms with mixed particles may further enhance the effectiveness. Third, the improvement in the execution time for large-sized data sets could be treated as a research subject in the future.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

KHC, KJW and AMA designed all the experiments and wrote most of this paper. KSC discussed and refined the paper. MLT, WCC, KPT and KMW interpreted the results. TSY and NCT guided the whole project. All authors read and approved the final manuscript.

Acknowledgements

The authors gratefully acknowledge the comments and suggestions of the editor and the anonymous referees. This work is partially supported by the National Science Council of the Republic of China.

References

  1. Alba E, et al.: Gene selection in cancer classification using PSO/SVM and GA/SVM hybrid algorithms.

    IEEE C Evol Computat 2007, 9:284-290. OpenURL

  2. Li S, Wu X, Tan M: Gene selection using hybrid particle swarm optimization and genetic algorithm.

    Soft Comput 2008, 12:1039-1048. Publisher Full Text OpenURL

  3. Ahmad A, Dey L: A feature selection technique for classificatory analysis.

    Pattern Recogn Lett 2005, 26:43-56. Publisher Full Text OpenURL

  4. Su Y, Murali TM, et al.: RankGene: identification of diagnostic genes based on expression data.

    Bioinformatics 2003, 19:1578-1579. PubMed Abstract | Publisher Full Text OpenURL

  5. Kahavi R, John GH: Wrapper for feature subset selection.

    Artif Intell 1997, 97:273-324. Publisher Full Text OpenURL

  6. Li X, Rao S, Wang Y, Gong B: Gene mining: a novel and powerful ensemble decision approach to hunting for disease genes using microarray expression profiling.

    Nucleic Acids Res 2004, 32:2685-2694. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  7. Zhao XM, Cheung YM, Huang DS: A novel approach to extracting features from motif content and protein composition for protein sequence classification.

    Neural Netw 2005, 18:1019-1028. PubMed Abstract | Publisher Full Text OpenURL

  8. Brown MP, et al.: Knowledge-based analysis of microarray gene expression data by using support vector machines.

    Proc Natl Acad Sci U S A 2000, 97:262-267. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  9. Evers L, Messow CM: Sparse kernel methods for high-dimensional survival data.

    Bioinformatics 2008, 24:1632-1638. PubMed Abstract | Publisher Full Text OpenURL

  10. Hua S, Sun Z: A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach.

    J Mol Biol 2001, 308:397-407. PubMed Abstract | Publisher Full Text OpenURL

  11. Oh JH, Gao J: A kernel-based approach for detecting outliers of high-dimensional biological data.

    BMC Bioinforma 2009, 10:S7. OpenURL

  12. Saeys Y, et al.: Feature selection for splice site prediction: a new method using EDA-based feature ranking.

    BMC Bioinforma 2004, 5:64. BioMed Central Full Text OpenURL

  13. Zhu Y, Shen X, Pan W: Network-based support vector machine for classification of microarray samples.

    BMC Bioinforma 2009, 10:S21. OpenURL

  14. Li L, Darden TA, et al.: Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method.

    Comb Chem High T Scr 2001, 4:727-739. OpenURL

  15. Li L, Jiang W, et al.: A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset.

    Genomics 2005, 85:16-23. PubMed Abstract | Publisher Full Text OpenURL

  16. Kennedy J, Eberhart R: Particle swarm optimization.

    IEEE Int Conf Neural Networks - Conf Proc 1995, 4:1942-1948. OpenURL

  17. Robinson J, Rahmat-Samii Y: Particle swarm optimization in Electromagnetics.

    IEEE Trans Antennas Propag 2004, 52:397-407. Publisher Full Text OpenURL

  18. Chen LF, et al.: Particle swarm optimization for feature selection with application in obstructive sleep apnea diagnosis.

    Neural Comput Appl 2011, 21(8):2087-2096. OpenURL

  19. Mohamad MS, et al.: Particle swarm optimization for gene selection in classifying cancer classes. Proceedings of the 14th International Symposium on Artificial Life and Robotics; 2009:762-765. OpenURL

  20. Shen Q, Shi WM, Kong W: Hybrid particle swarm optimization and tabu search approach for selecting genes for tumor classification using gene expression data.

    Comput Biol Chem 2008, 32:52-59. PubMed Abstract | Publisher Full Text OpenURL

  21. Wu X, et al.: Top 10 algorithms in data mining.

    Knowl Inf Syst 2008, 14:1-37. Publisher Full Text OpenURL

  22. Cheng WC, et al.: Microarray meta-analysis database (M2DB): a uniformly pre-processed, quality controlled, and manually curated human clinical microarray database.

    BMC Bioinforma 2010, 11:421. BioMed Central Full Text OpenURL

  23. GEMS Dataset

    2012.

    http://www.gems-system.org/ webcite

  24. Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data.

    J Am Stat Assoc 2002, 97:77-86. Publisher Full Text OpenURL

  25. Jiang P, et al.: MiPred: classification of real and pseudo microRNA precursors using random forest prediction modelwith combined features.

    Nucleic Acids Res 2007, 35:W339-W344. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  26. Batuwita R, Palade V: MicroPred: effective classification of pre-miRNAs for human miRNA gene prediction.

    Bioinformatics 2009, 25:989-995. PubMed Abstract | Publisher Full Text OpenURL

  27. Wang Y, et al.: Predicting human microRNA precursors based on an optimized feature subset generated by GA-SVM.

    Genomics 2011, 98:73-78. PubMed Abstract | Publisher Full Text OpenURL

  28. Nanni L, Brahnam S, Lumini A: Combining multiple approaches for gene microarray classification.

    Bioinformatics 2008, 28:1151-1157. OpenURL

  29. Park I, Lee KH, Lee D: Inference of combinatorial Boolean rules of synergistic gene sets from cancer microarray datasets.

    Bioinformatics 2010, 26:1506-1512. PubMed Abstract | Publisher Full Text OpenURL

  30. Tan PN, Steinbach M, Kumar V: Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. 1st edition. Addison Wesley, Boston, MA, USA; 2005. OpenURL

  31. Brazma A, Vilo J: Gene expression data analysis.

    FEBS Lett 2000, 480:2-16. PubMed Abstract | Publisher Full Text OpenURL

  32. Golub TR, et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

    Science 1999, 286:531-527. PubMed Abstract | Publisher Full Text OpenURL

  33. Zhao W, et al.: A novel framework for gene selection.

    Int J Adv Comput Technol 2011, 3:184-191. OpenURL

  34. TOM laboratory:

    TOM laboratory. 2013.

    http://tom.im.ntust.edu.tw/ webcite

    OpenURL

  35. Kennedy J, Eberhart RC, Shi Y: Swarm Intelligence. San Francisco, CA, USA: Morgan Kaufman; 2001. OpenURL

  36. Shi Y, Eberhart RC: A Modified Particle Swarm Optimizer. Anchorage Alaska: IEEE International Conference on Evolutionary Computation; 1998:69-73. OpenURL

  37. Tan S: Neighbor-weighted k-nearest neighbor for unbalanced text corpus.

    Expert Syst Appl 2005, 28:667-671. Publisher Full Text OpenURL

  38. Stone M: Cross-validatory choice and assessment of statistica predictions.

    J Royal Stat Soc 1974, 36:111-147. OpenURL

  39. Geisser S: The predictive sample reuse method with applications.

    J Am Stat Assoc 1975, 70:320-328. Publisher Full Text OpenURL

  40. Larson S: The shrinkage of the coefficient of multiple correlation.

    J Educat Psychol 1931, 22:45-55. OpenURL

  41. Mosteller F, Turkey JW: Data analysis, including statistics. Handbook of Social Psychology. Reading, MA: Addison-Wesley; 1968. OpenURL

  42. Mosteller F, Wallace DL: Inference in an authorship problem.

    J Am Stat Assoc 1963, 58:275-309. OpenURL

  43. Cortes C, Vapnik V: Support-vector networks.

    Mach Learn 1995, 20:273-297. OpenURL

  44. Kononenko I: A counter example to the stronger version of the binary tree hypothesis. ECML-95 workshop on Statistics, machine learning, and knowledge discovery in databases; 1995:31-36. OpenURL