Abstract
Background
Modeling highdimensional data involving thousands of variables is particularly important for gene expression profiling experiments, nevertheless,it remains a challenging task. One of the challenges is to implement an effective method for selecting a small set of relevant genes, buried in highdimensional irrelevant noises. RELIEF is a popular and widely used approach for feature selection owing to its low computational cost and high accuracy. However, RELIEF based methods suffer from instability, especially in the presence of noisy and/or highdimensional outliers.
Results
We propose an innovative feature weighting algorithm, called LHR, to select informative genes from highly noisy data. LHR is based on RELIEF for feature weighting using classical margin maximization. The key idea of LHR is to estimate the feature weights through local approximation rather than global measurement, which is typically used in existing methods. The weights obtained by our method are very robust in terms of degradation of noisy features, even those with vast dimensions. To demonstrate the performance of our method, extensive experiments involving classification tests have been carried out on both synthetic and real microarray benchmark datasets by combining the proposed technique with standard classifiers, including the support vector machine (SVM), knearest neighbor (KNN), hyperplane knearest neighbor (HKNN), linear discriminant analysis (LDA) and naive Bayes (NB).
Conclusion
Experiments on both synthetic and realworld datasets demonstrate the superior performance of the proposed feature selection method combined with supervised learning in three aspects: 1) high classification accuracy, 2) excellent robustness to noise and 3) good stability using to various classification algorithms.
Keywords:
Feature weighting; Local hyperplane; Classification; RELIEF; KNNBackground
Feature weighting is an important step in the preprocessing of data, especially in gene selection for cancer classification. The growing abundance of genomewide sequence data made possible by highthroughput technologies, has sparked widespread interest in linking sequence information to biological phenotypes. However, the expression data usually consist of vast numbers of genes (≥10,000), but with small sample size. Therefore, feature selection is a necessary for solving such problems. Reducing the dimensionality of the feature space and selecting the most informative genes for effective classification with new or existing classifiers are commonly adopted techniques in empirical studies.
In general, the feature weights are obtained by assigning a continuous relevance value to each feature via a learning algorithm by focusing on the context or domain knowledge. The feature weighting procedure is particularly useful for instances based on learning models, in which a distance metric is typically constructed using all features. Moreover, feature weighting can reduce the risk of overfitting by removing noisy features, thereby improving the predictive accuracy. Existing feature selection methods broadly fall into two categories: wrapper and filter methods. Wrapper methods use the predictive accuracy of predetermined classification algorithms (called base classifiers), such as the support vector machine (SVM), as the criterion for determining the goodness of a subset of features [1,2]. Filter methods select features according to discriminant criteria based on the characteristics of the data, independent of any classification algorithms [35]. Commonly used discriminant criteria include entropy measurements [6], Fisher ratio measurements [7], mutual information measurements [810], and RELIEFbased measurements [11,12].
As a result of emerging needs in the biomedical and bioinformatics fields, researchers are particularly interested in algorithms that can process data containing features with large (or even huge) dimensions, for example, microarray data in cancer research. Therefore, filter methods are widely used owing to their efficient computation. Of the existing filter methods for feature weighting, the RELIEF algorithm [13] is considered to be one of the most successful owing to its simplicity and effectiveness. The main idea behind RELIEF is to iteratively update feature weights iteratively using a distance margin to estimate the difference between neighboring patterns. The algorithm has been further generalized (with the new algorithm referred to as RELIEFF) to average multiple nearest neighbors, instead of just one, when computing sample margins, whose name is RELIEFF [13]. Sun et al. showed that RELIEFF achieves significant improvement in performance over the original RELIEF. Sun also systematically proved that RELIEF is indeed an online algorithm for a convex optimization problem [11]. By maximizing the averaged margin of the nearest patterns in the feature scaled space, RELIEF can estimate the feature weights in a straightforward and efficient manner. Based on the theoretical framework, IRELIEF, an outlier removal scheme, can be applied since the margin averaging is sensitive to large variations [11].
To accomplish sparse feature weighting, the author incorporated a l_{1} penalty into the optimization by IRELIEF [12].
In this paper, we propose a new feature weighting scheme within the RELIEF framework. The main contribution of the proposed algorithm is that the feature weights are estimated from local patterns approximated by a locally linear hyperplane, and thus we call the proposed algorithm as LHRELIEF or (LHR), for short. It is shown that the proposed feature weighting scheme achieves good performance when combined with standard classification models, such as the support vector machine (SVM), naive Bayes (NB) [14], knearest neighbors (KNN), linear discriminant analysis (LDA) [15] and kierarchical knearest neighbor (HKNN) [16]. The superior performance with respect to classification accuracy and excellent robustness to data heavily contaminated by noises make the proposed method promising for using in bioinformatics, where data are severely degraded by background artefacts owing to sampling bias or the high degree of redundancy, such as in the simultaneous parallel sequencing of large/huge numbers of genes.
The advantages of our method are as follows: (1) The gene selection process considers the discriminative power of multiple similar genes that are conditional on their linear combinations. This allows joint interactions between genes to be fully incorporated to reflect the importance of similar genes; (2) LHR assigns weights to genes and thus allows the selection of important genes that can accurately classify samples; (3) Using the genes selected by LHR, classic classifiers including NB, LDA, SVM, HKNN and KNN achieved comparable or even superior accuracy as reported in the literature. This confirms that incorporation of interactions among similar genes in feature weighting estimation under local linear assumptions not only conveys information of the underlying biomolecular reaction mechanisms, but also provides high gene selection accuracy.
Results and discussion
To evaluate the performance of the proposed LHR, we conducted extensive experiments on different datasets. First, we performed experiments on a synthetic data from the famous Fermat’s spiral problem [17]. We then tested it on nine medium to large benchmark microarray datasets, which were all used to investigate the relationship between cancers and gene expression.
Evaluation methods
In this study, we tested the performance of the proposed LHR by combining it with standard classifiers, including NB, KNN, SVM, and HKNN [16]. We applied leaveoneout crossvalidation (LOOCV) or 10fold cross validation (CV) to evaluate classification accuracy. LOOCV provides an unbiased estimate of the generalization error for stable classifiers such as KNN. Using LOOCV, each sample in the dataset was predicted by the model built from the rest of the samples and the accuracy for each predication was included in the final measurement. Using the 10fold CV scheme, the dataset was randomly divided into ten equal subsets. At each turn, nine subsets were used to construct the model while the remaining subset was used for prediction. The average accuracy for 10 iterations was recorded as the final measurement. For classifiers with tuning parameters (such as the SVM), the optimal parameters were first estimated with 5fold CV using the training data and then used in the modeling. To simplify the comparison, some of the accuracy results were taken from the literature.
Parameter settings
LHR takes two parameters: the number of nearest neighbors (k) and the regularized constant (λ). The choice of k depends on the sample size. For small samples, k should be small, such as 3 or 5, whereas for large samples, k should be set to a larger value, such as 10 or 20. Performance generally improves as k increases, however, beyond a certain threshold, larger values of k may not lead to any further improvement [18]. A rule of thumb is to set k to be the odd number 7. λ helps to stabilize the matrix inversion from singular and is generally a tiny constant. In our experiments, we set λ=10^{3}.
Synthetic experiments on Fermat’s spiral problem
In the first experiment, we tested the performance of the proposed method on the wellknown Fermat’s spiral problem. The test dataset consists of two classes with 200 samples for each class. The labels of the spiral are completely determined by its first two features. The shape of the Fermat’s spiral distribution is shown in Figure 1(a). Heuristically, the label of a sample can easily be inferred from its local neighbors. Therefore, classification based on local information thus gives a more accurate result than global measurement based prediction (or classification) since the latter is sensitive to noise degradation. To test the stability and robustness of LHR, irrelevant features following the standard normal distribution were added to the spiral for classification testing. The dimensions of the irrelevant features were set to {0,1000,2000,3000,4000,5000,6000,7000,8000,9000,10000}. To compare the ability to recover informative features, both the IRELIEF and LOGO algorithms were also used because of its intrinsic closeness to LHR. The three feature weighting schemes were first applied to rank the importance of the features. Only the top five ranked features were retained to test the robustness of feature selection schemes under noisy contamination. Performance comparisons were conducted on the truncated dataset using five classic classifiers: SVM, LDA, NB, KNN, and HKNN. For each experiment, both 10fold CV and LOOCV were used to evaluate the classification accuracy. To eliminate statistical variations, we repeated the experiments ten times on each dataset and recorded the average classification errors. The detailed numerical results are given in Tables 1 and 2 for 10fold CV and LOOCV, respectively. To visualize the results, we created a box plot of the distributions thereof for the experimental results after 10fold CV and LOOCV in Figure 1(b) and (c), respectively. Each plot represents the classification accuracy for a single dataset. Figure 1(b) shows the 10fold CV accuracy for each of the five classifiers against the dimensions of the noisy features. Figure 1(c) shows the LOOCV accuracy values against the dimensions of the noisy features. We use dark colors to denote the accuracy results achieved using IRELIEF and LOGO, while a light color is used for those by LHR. In most cases, the performance of LHR coupled with various classifiers is superior to that of both IRELIEF and LOGO, and thus the corresponding box plot lies above the ones for IRELIEF and LOGO.
Figure 1. Experiments on the Fermat’s Spiral problem.(a) The Spiral consists of two classes, each having 200 samples labeled by different colors; Boxplot results after LHR, IRELIEF and LOGO through five classifiers on (a), degraded by noise features whose dimension extending from 0 to 10000. Two criteria of 10fold CV (b) and LOOCV (c) are used to evaluate the performance of the feature selection methods. The result after various classifier is marked in red circle. The averaged values were connected to highlight the different performance.
Table 1. Tenfold CV experiments on robustness of the feature weighting on the spiral with irrelevant noisy feature
Table 2. LOOCV experiments on robustness of the feature weighting on the spiral with irrelevant noisy feature
The line graph of the average performance confirms that the proposed method is more robust to noise than IRELIEF and LOGO. In both CV experiments, we observed that the performance of the three methods was very similar in case where the dimension of the irrelevant features was small. For example, with a zero dimension of irrelevant features, i.e, no noisy features, classification results by the five classifiers were very similar. The average accuracy is 75.2% for LHR and 75.4% for 10fold CV, 72.3% for LHR and 72.0% for LOOCV. However, as the dimension of the irrelevant features increases, both the performance of IRELIEF and LOGO are severely degraded by the noisy features. In comparison, the performance of LHR is very stable and superior to that of the other combinations. In both experiments, the overall accuracy by LHR is better than that of IRELIEF and LOGO. We also observed that the accuracies after LOGO, when combining with the five classifiers, were in small variance. This nice property implies that the LOGO method could derive features that are less dependent on classification model, and thus are less redundant than LHR and IRELIEF do.
Empirical large/huge microarray datasets
In the second experiment, we tested the performance of the proposed algorithm on nine binary microarray datasets. The benchmark datasets, which have been widely used to test a variety of algorithms, are all related to human cancers, including the central nervous system, colorectal, diffuse large Bcell lymphoma, leukemia, lung, and prostate tumors. Characteristics of the datasets are summarized in Table 3.
We note that most of the test datasets have small sample sizes (less than 100). This poses a difficulty in evaluating the performances of classifiers using the standard fold CV schemes. In this experiment, the LOOCV method was used instead to estimate the accuracy of the classifiers. Each sample in the dataset was predicted by a classifier constructed using the rest of the samples. To assess the generality of the selected informative genes, classic classifiers including LDA, KNN, NB, HKNN and SVM were tested on the selected genes. The experimental results are summarized in Table 4. Note that some of the results were taken directly from the literature.
Table 4. Classification accuracies (%) on 9 real data sets
For the individual dataset, LHR outperformed or achieved comparable performance to the best result reported in the literature. For the CNS data, the LHRSVM, LHRLDA and LHRHKNN achieved superior performances with almost 100% accuracy, which is much higher than the second best performance by kTSP [19]. For the colon data, although the accuracy of the LHRbased classifier is worse than that of BMSFSVM, IVGASVM and LOGO, the accuracy of all the five classifiers are similar. This implies that the selected genes are very robust to the choice of different classifiers. Similar results are observed on the DLBCL, prostate2 and prostate3 datasets. For the GCM, leukemia, lung and prostate1 datasets, the LHRbased classifier was ranked either first or second. The selected genes tested by the five classifiers show similar performance on the leukemia, lung and prostate1 datasets. For the prostate2 data, BMSFSVM realized remarkably good accuracy, although the results using the other three classifiers with BMSF feature selection are less impressive. LOGO also performed nicely, yet the average is suboptimal to LHG. In comparison, the performance using LHR feature selection is fairly stable. For the prostate3 data, LOGO based classifiers performed very well, while the LHR based ones were slightly less accurate than the top ones. Compared with LOGO in terms of the ability to select informational genes, the proposed algorithm achieved comparable performance by reaching the classification accuracy of 97.39%, which is slightly less than LOGO of 97.61%.
When considering the average accuracy for each algorithm across all cancers datasets, the top four methods with the highest average accuracy are LOGOHKNN, BMSFSVM, LHRKNN/LOGOKNN, LHRSVM and LHRHKNN. The proposed scheme has a slightly lower average accuracy than BMSFSVM and LOGOHKNN, but a higher accuracy than the others. However, the values for mean ± standarddeviation of the averaged accuracy are 96.65±0.725 for LHR, 97.61±1.5 for LOGO and 94.88±2.191 for BMSF. This shows that the proposed LHR outperforms both LOGO and BMSF in terms of overall accuracy as well as confirming its excellent stability in terms of the choice of classification method.
Comparison with standard feature selection methods
For comparison with other feature selection models, eleven standard techniques were tested as well as the proposed LHR. The selected techniques include tstatistic (tstat), twoing rule (TR), information gain (IG), Gini index (Gini), max minority (MaxM), sum minority (SumM), sum of variances (SumV), onedimensional support vector machine (OSVM), minimum redundancy maximum relevance (mMRM) [27] and IRELIEF [28]. The code for the first eight schemes is available through RankGene at http://genomics10.bu.edu/yangsu/rankgene webcite. The code for mRMR is available at http://penglab.janelia.org/proj/mRMR/ webcite, where two implementations of mRMR: namely, MID and MIQ, are provided. The IRELIEF package is available at http://plaza.ufl.edu/sunyijun/ webcite[28].
It has been suggested by the author in [25,27] that accurate discretization could improve the performance of mRMR. The author also reported consistent results when the expression values are transformed into 2 or 3 states using μ±kσ with k ranging from 0.5 to 2, and where μ and σ are gene specific mean and standard deviation, respectively (http://penglab.janelia.org/proj/mRMR/FAQ_mrmr.htm webcite). In our experiments, we followed the transformation rule suggested in [25] to simplify the comparison. Expression values greater than μ+σ were set to 1; values between μσ and μ+σ were set to 0; and values less than μσ were set to 1.
In each experiment, a feature selection scheme was first used to select the informative genes, followed by classification tests on the truncated dataset. For subjective comparison, we set the number of informative genes for the selected feature selection scheme to be the same as that determined by LHR, which usually finds a relatively small number of genes (less than 30). This allowed us to examine whether the limited number of informative genes generated by LHR had more discriminative power than those generated by the other methods.
The LOOCV accuracy for each of the five classification algorithms (LDA, NB, SVM, KNN, and HKNN) is reported in Table 5. The number of genes selected by LHR is listed in the second column and the same number is used to create the truncated data for the other feature selection schemes. In most cases, the variables selected by LHR achieved the optimal or suboptimal LOOCV accuracy when coupled with the five classifiers. To investigate the extent of the information conveyed by the selected genes, we created a box plot of the LOOCV accuracy for the five classification algorithms (LDA, SVM, KNN, NB, and HKNN) on each of the tested datasets in Figure 2. A remarkable characteristics of the proposed LHR is its low dependence on the classifiers, resulting in the corresponding box plot having a narrower bandwidth than that for the other methods, shown in Figure 2. This property implies that the genes selected by LHR are highly informative, and thus the discriminative performance is robust to the choice of different classifiers.
Table 5. Performance comparison of the LHR with 12 standard feature selection schemes (FSSs)
Figure 2. Performance comparison of the LHR with 12 standard feature selection schemes (FSSs). Nine benchmarked microarray datasets (ai), whose name was positioned in middle below of the xaxis, were used to test the performance of the FSSs. For each tested data, the results after five classification, coupling with 13 FSSs through LOOCV are box plotted. The proposed LHR outperformed or archived comparable performance to other methods. Moreover, the results after LHR show small variation on LOOCV error when tested with different classifiers, implying a high degree of robustness.
Computation complexity
Solving of the LHR algorithm involves in a quadratic minimization problem (Eq. (4)) for each sample. Therefore, it needs a much higher computational cost than linear method does, such as IRELIEF and LOGO. Although the matrix of H^{T}WH in Eq. (4) is positivedefinite and in small size, the minimization problem of Eq. (4) can be solved in polynomial time (O(n^{3}) for n NNs of a sample). Thus, the complexity in each iteration are approximately O(n^{3}∗N) times higher than IRELIEF does.
Conclusions
In this paper, we proposed a new feature weighting scheme to overcome the common drawbacks of the RELIEF family. The nearest miss and hit subsets are approximated by constructing a local hyperplane. Then feature weight updating is achieved by measuring the margin between the sample and its hyperplane in a general RELIEF framework. The main contribution of the new variation is that the margin is more robust to the noise and outliers than those of earlier works. Therefore, the feature weights can characterize the local structure more accurately. Experimental results on both synthetic and realworld microarray datasets validated our findings when combining the proposed method with five classic classifiers. The performance of the proposed weighting scheme performed is superior in terms of classification error on most test datasets. Extensive experiments demonstrated that the proposed scheme has three remarkable characteristics: 1) high accuracy in classification, 2) excellent robustness to noise and 3) good stability with respect to various classification algorithms.
Methods
RELIEF
The RELIEF algorithm has been successfully applied in feature weighting owing to its simplicity and effectiveness [12,13]. The main idea of RELIEF is the iterative adjustment of feature weights according to their ability to discriminate among neighboring patterns. Mathematically, suppose that X={x_{1},x_{2},⋯,x_{n}}_{d×N} is a randomly selected sample matrix of binary class data where each sample x has d dimensions, x={x_{1},x_{2},⋯,x_{d}}. One can estimate the two nearest neighbors, where one is from the same class (called the nearest hit or NH) and the other is from a different class (called the nearest miss or NM). Then, weight w_{f} of the fth (f=1,2,⋯,d) feature is updated by the heuristic estimation:
where NM_{f},NH_{f} denote the fth coordinate value of vector NM and NH, respectively. Since no exhaustive or iterative search is needed for RELIEF updates, this scheme is very efficient in processing data with huge dimensions. Thus, it is particularly promising for largescale problems such as analysis of microarray data [3,12,27]. The author generalized the updates scheme to compute the maximum expected margin E[ ρ(w)] by scaling the features [11,12] to overcome the drawbacks of RELIEF, such as outlier detection and inaccurate updates:
with , where NM(x_{i})={x_{n}:1≤n≤N,y_{i}≠y_{n}} and NH(x_{i})={x_{n}:1≤n≤N,y_{i}=y_{n}} are index sets of the nearest miss and the nearest hit for the sample x_{i}. N is the sample size. P(x_{n}=NM(x_{i})w) (or P(x_{n}=NH(x_{i})w)) is the probability of a sample x_{n} being in the set of NM(x_{i}) (or NH(x_{i})) in the feature space scaled by weights w. Though the probability distributions are initially unknown, they can be estimated through kernel density estimation [29]. The authors called this method IRELIEF and showed that it achieved significant performance improvement over the traditional models. Classification of a feature scaled dataset achieved higher accuracy than standard techniques such as the SVM [1,2,30] and NN model [31]. Feature weighting is also robust to noisy features. To obtain a sparse and economic feature weighting, Sun incorporated the l_{1} penalty into the optimization of IRELIEF and named the algorithm by Logo (fit locally and think globally) [12]. Extensive experiments have demonstrated that Logo could accurately grasp the intrinsic structure of the data and match nicely with classic classification models.
However, the expectation in Eq. (2) is obtained by averaging the nearest neighbors. Therefore, feature weight estimation may be less accurate if the samples contain many outliers or most of the features are irrelevant. In both cases, the distance between the tested sample and its nearest neighbor is a large value. It follows that large bias is introduced to margin estimation by using the such averaging operation. Although the influence of abnormal samples can be reduced by introducing kernel distribution estimation [11,12], this in turn introduces additional free parameters. Moreover, probability estimation via kernel approximation is sensitive to the sample size [28]. Therefore, it limits the empirical applications such as analysis of microarray data, which the data are notoriously known for the fact that the dimension of the sample observations is much smaller than that of the sample features [32]. In this paper, we propose using a local hyperplane to approximate the set of the nearest hit and miss, and then estimate the feature weight by maximizing the expected margin defined by the hyperplane. The advantage of this approximation is that the hyperplane is more robust to noisy feature degradation than averaging all the neighbors [1113].
Local hyperplane conditional on feature weight
Processing highdimensional data by mapping the data of interest into an embedded nonlinear manifold within the higherdimensional space has attracted wide interest in machine learning. The local hyperplane approximation shares similar merits with local linear embedding methods [12,26,33]. It assumes that the samples’ structure is locally linear and therefore each sample lies on a local linear hyperplane, spanned by its nearest neighbors. Mathematically, let us assume that the feature weights are known in advance. Thus, sample x can be represented by a local hyperplane of class c, conditional on the feature weight w, as:
where H is an I×n matrix comprising n NNs of sample x: H={h_{1},h_{2},⋯,h_{n}}, with h_{i} being the ith nearest neighbor (called the prototype) of class c. W is a diagonal matrix with diagonal element w_{i} being the weight of the ith feature. The parameters of α=(α_{1},…,α_{n})^{T} are the weights of the prototypes {h_{i}, i=1,2,…,n}. These can be viewed as the spanning coefficients of subspace LH_{c}(x). Therefore, the hyperplane can be represented as: {· Hα=α_{1}Wh_{1}+α_{2}Wh_{2}+…+α_{n}Wh_{n}}. The projection LH_{c}(x) of x onto the hyperplane can be computed by minimizing the distance between sample x and the hyperplane, both of which are dependent on the feature weight. Therefore, the value of α can be estimated as:
The regularization parameter λ is used to emphasize the “smoothing” effect of the optimum solution, which degenerates to be an unit vector in certain radical cases.
We propose using a hyperplane to represent the set of the nearest miss NM(x) and nearest hit NH(x) for a given sample x. The advantage of the representation is the robust characterization of the local sample patterns. Then the distances between the sample and its NH (or NM) set can be estimated from the local hyperplane rather than averaging across all samples within the set. Therefore, we redefine the margin for a sample x as . The feature weights are then estimated by maximizing the total margin:
where vector z_{n} is defined as: , where H_{NM}(x_{n}) and H_{NH}(x_{n}) are the nearest neighbors of the set of the nearest miss and hit of sample x_{n}. α_{n} and β_{n} are the coefficients for spanning hyperplane and . w is a vector with its ith element w(i) being the weight of the ith feature, for i=1,2,…,I. To solve the minimization problem of Eq. (5), the parameters of α_{n}, β_{n}, which are dependent on the nearest neighbors, must be estimated. The main problem with this estimation, however, is that the nearest neighbors of a given sample are unknown before learning. In the presence of many thousands of irrelevant features, the nearest neighbors defined in the original space can be completely different from those in the induced space. Therefore, the nearest neighbors defined in the original feature space may not be the same in the weighted feature space. To address these difficulties, we use an iterative algorithm, similar to the Expectation Maximization algorithm and IRELIEF [11], to estimate the feature weights. The detailed numerical solution is provided in Additional file 1: S.1. The pseudocode for LHRELIEF is summarized in Additional file 2: S.2.
Additional file 1. S.1. Numerical solution for LHR.
Format: PDF Size: 25KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 2. S.2. Pseudocode for LHR.
Format: PDF Size: 23KB Download file
This file can be viewed with: Adobe Acrobat Reader
Availability of supporting data
The Matlab code used to tested on the Fermat’s spiral and the cancer microarray datasets is available at http://sunflower.kuicr.kyotou.ac.jp/\~{r}uan/LHR/ webcite.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
HM designed the LHR algorithm, participated in the numerical experiments and drafted the manuscript. PY participated in the numerical experiments. MN participated in the design of the study and TA participated in the study design and helped to draft the manuscript. All authors read and approved the final manuscript.
Acknowledgements
The authors would like to thank Dr. Y. Sun for the source code of IRELIEF and LOGO, and Dr. H.Y. Zhang for her valuable comments on the BMSF method. This work was partially supported by ICRKU International Shortterm Exchange Program for Young Researchers in design and analysis of computational experiments. HM was supported by BGISCUT Innovation Fund Project (SW20130803), National Nature Science Foundation of China (61372141) and the Fundamental Research Fund for the Central Universities (2013ZM0079).
References

Duan KBB, Rajapakse JC, Wang H, Azuaje F: Multiple SVMRFE for gene selection in cancer classification with expression data.
IEEE Trans Nanobiosci 2005, 4(3):228234. Publisher Full Text

Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using support vector machines.
Mach Learn 2002, 46:389422. Publisher Full Text

Ding C, Peng H: Minimum redundancy feature selection from microarray gene expression data.
J Bioinform Comput Biol 2005, 3(2):185205. PubMed Abstract  Publisher Full Text

Huang CJ, Yang DX, Chuang YT: Application of wrapper approach and composite classifier to the stock trend prediction.
Expert Syst Appl 2008, 34(4):28702878. Publisher Full Text

Koller D, Sahami M: Toward optimal feature selection. In Proceedings of the Thirteenth International Conference on Machine Learning . Edited by Saitta L. Morgan Kaufmann Press; 1996:284292.

Jain AK, Duin RPW, Mao J: Statistical pattern recognition: a review.
IEEE Trans Pattern Anal Mach Intell 2000, 22(1):437. Publisher Full Text

Kwak N, Choi CH: Input feature selection by mutual information based on parzen window.
IEEE Trans Pattern Anal Mach Intell 2002, 24:16671671. Publisher Full Text

Brown G: Some thoughts at the interface of ensemble methods and feature selection. In Multiple Classifier Systems . Edited by Neamat EG, Josef K, Fabio R. Springer Press; 2010:314314.

Brown G: An information theoretic perspective on multiple classifier systems.
In Multiple Classifier Systems Edited by Springer Press, Jón B, Josef K, Fabio R. 2009, 344353.

Sun Y: Iterative relief for feature weighting: Algorithms, theories, and applications.
IEEE Trans Pattern Anal Mach Intell 2007, 29(6):10351051. PubMed Abstract  Publisher Full Text

Sun Y, Todorovic S, Goodison S: Locallearningbased feature selection for highdimensional data analysis.
IEEE Trans Pattern Anal Mach Intell 2010, 32(9):16101626. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kononenko I: Estimating attributes: analysis and extensions of RELIEF. In European Conference on Machine Learning . Edited by Francesco B, Luc DR. Berlin Heidelberg: Springer Press; 1994:171182.

Li T, Zhang C, Ogihara M: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression.
Bioinformatics 2004, 20(15):24292437. PubMed Abstract  Publisher Full Text

Wu MC, Zhang L, Wang Z, Christiani DC, Lin X: Sparse linear discriminant analysis for simultaneous testing for the significance of a gene set/pathway and gene selection.
Bioinformatics 2009, 25(9):11451151. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Vincent P, Bengio Y: Klocal hyperplane and convex distance nearest neighbor algorithms. In Advances in Neural Information Processing Systems . Edited by Thomas G, Sue B, Zoubin G. MIT Press; 2001:985992.

Sun Y, Wu D: A relief based feature extraction algorithm. In SDM . Edited by Apte C, Park H, Wang K, Zaki JM. SIAM Press; 2008:188195.

Hall P, Park BU, Samworth RJ: Choice of neighbor order in nearestneighbor classification.
Ann Stat 2008, 36(5):21352152. Publisher Full Text

Tan AC, Naiman DQ, Xu L, Winslow RL, Geman D: Simple decision rules for classifying human cancers from gene expression profiles.
Bioinformatics 2005, 21(20):38963904. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Geman D, Christian A, Naiman DQ, Winslow RL: Classifying gene expression profiles from pairwise mRNA comparisons.

Chopra P, Lee J, Kang J, Lee S: Improving cancer classification accuracy using gene pairs.
PloS One 2010, 5(12):e14305. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dagliyan O, Uney YF, Kavakli IH, Turkay M: Optimization based tumor classification from microarray gene expression data.
PloS One 2011, 6(2):e14579. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zheng CH, Chong YW, Wang HQ: Gene selection using independent variable group analysis for tumor classification.
Neural Comput Appl 2011, 20(2):161170. Publisher Full Text

Zhang JG, Deng HW: Gene selection for classification of microarray data based on the Bayes error.
BMC Bioinformatics 2007, 8(1):370378. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Zhang H, Wang H, Dai Z, Chen Ms, Yuan Z: Improving accuracy for cancer classification with a new algorithm for genes selection.
BMC Bioinformatics 2012, 13(1):120. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Roweis ST, Saul LK: Nonlinear dimensionality reduction by locally linear embedding.
Science 2000, 290(5500):23232326. PubMed Abstract  Publisher Full Text

Peng YH: A novel ensemble machine learning for robust microarray data classification.
Comput Biol Med 2006, 36:553573. PubMed Abstract  Publisher Full Text

Girolami M, He C: Probability density estimation from optimally condensed data samples.
IEEE Trans Pattern Anal Mach Intell 2003, 25:12531264. Publisher Full Text

Christopher A, Andrew M, Stefan S: Locally weighted learning.
Artif Intell Rev 1997, 11:1173. Publisher Full Text

Statnikov A, Wang L, Aliferis CF: A comprehensive comparison of random forests and support vector machines for microarraybased cancer classification.
BMC Bioinformatics 2008, 9:319328. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Shakhnarovich G, Darrell T, Indyk P: Nearestneighbor methods in learning and vision.

Fraley C, Adrian ER: Modelbased clustering, discriminant analysis, and density estimation.
J Am Stat Assoc 2002, 97(458):611631. Publisher Full Text

Pan Y, Ge SS, Al Mamun A: Weighted locally linear embedding for dimension reduction.
Pattern Recognit 2009, 42(5):798811. Publisher Full Text