Selecting subsets of newly extracted features from PCA and PLS in microarray data analysis

Li, Guo-Zheng; Bu, Hua-Long; Yang, Mary Qu; Zeng, Xue-Qiang; Yang, Jack Y

doi:10.1186/1471-2164-9-S2-S24

Volume 9 Supplement 2

IEEE 7<Superscript>th</Superscript> International Conference on Bioinformatics and Bioengineering at Harvard Medical School

Research
Open access
Published: 16 September 2008

Selecting subsets of newly extracted features from PCA and PLS in microarray data analysis

Guo-Zheng Li^1,2,
Hua-Long Bu²,
Mary Qu Yang³,
Xue-Qiang Zeng² &
…
Jack Y Yang⁴

BMC Genomics volume 9, Article number: S24 (2008) Cite this article

5441 Accesses
23 Citations
Metrics details

Abstract

Background

Dimension reduction is a critical issue in the analysis of microarray data, because the high dimensionality of gene expression microarray data set hurts generalization performance of classifiers. It consists of two types of methods, i.e. feature selection and feature extraction. Principle component analysis (PCA) and partial least squares (PLS) are two frequently used feature extraction methods, and in the previous works, the top several components of PCA or PLS are selected for modeling according to the descending order of eigenvalues. While in this paper, we prove that not all the top features are useful, but features should be selected from all the components by feature selection methods.

Results

We demonstrate a framework for selecting feature subsets from all the newly extracted components, leading to reduced classification error rates on the gene expression microarray data. Here we have considered both an unsupervised method PCA and a supervised method PLS for extracting new components, genetic algorithms for feature selection, and support vector machines and k nearest neighbor for classification. Experimental results illustrate that our proposed framework is effective to select feature subsets and to reduce classification error rates.

Conclusion

Not only the top features newly extracted by PCA or PLS are important, therefore, feature selection should be performed to select subsets from new features to improve generalization performance of classifiers.

Background

Tumor classification is performed on microarray data collected by DNA microarray experiments from tissue and cell samples [1–3]. The wealth of this kind of data in different stages of cell cycles helps to explore gene interactions and to discover gene functions. Moreover, obtaining genome-wide expression data from tumor tissues gives insight into the gene expression variation of various tumor types, thus providing clues for tumor classification of individual samples. The output of microarray experiment is summarized as an n × p data matrix, where n is the number of tissue or cell samples; p is the number of genes. Here p is always much larger than n, which hurts generalization performance of most classification methods. To overcome this problem, dimension reduction methods are applied to reduce the dimensionality from p to q with q ≪ p.

Dimension reduction usually consists of two types of methods, feature selection and feature extraction [4]. Feature selection chooses a subset from original features according to classification performance, the optimal subset should contain relevant but non redundant features. Feature selection can help to improve generalization performance and speed of classifiers. There have been a great deal of work in machine learning and related areas to address this issue [5–9]. But in most practical cases, relevant features are not known beforehand. Finding out which features to be used is a hard work. At the same time, feature selection will lose the relevant information among features, while feature extraction is good at handling interactions among features.

Feature extraction projects the whole data into a low dimensional space and constructs the new dimensions (components) by analyzing the statistical relationship hidden in the data set. Principle components analysis (PCA) is one of the frequently used methods for feature extraction of microarray data. It is unsupervised, since it need not the label information of the data sets. Partial Least Squares (PLS) is one of the widely used supervised feature extraction methods for analysis of gene expression microarray data [10, 11], it represents the data in a low dimensional space through linear transformation. Although feature extraction methods produce independent features, but Usually, a large number of features are extracted to represent the original data. As we known, the extracted features also contain noise or irrelevant information. Choosing an appropriate set of features is critical. Some researcher considered that the initial several components of PLS contain more information than the others, but it is hard to decide how many tail components are trivial for discrimination. Some authors proposed to fixed the number of components from three to five [12]; some proposed to determine the size of the space by classification performance of cross-validation [13]. However each one has its own weakness. Fixing at an arbitrary dimensional size is not applicable to all data sets, and the cross-validation method is often obstructed by its high computation. An efficient and effective model selection method for PLS is demanded. Furthermore, we consider not all the initial components are important for classification, subsets should be selected for classification.

Here, we propose and demonstrate the importance of feature selection after feature extraction in the tumor classification problems. We have performed experiments by using PCA [14] and PLS [15] as feature extraction methods separately. In this paper, we will perform a systematic study on both PCA and PLS methods, which will be combined with the feature selection methods (Genetic Algorithm) to get more robust and efficient dimensional space, and then the constructed data from the original data is used with Support Vector Machine (SVM) and k Nearest Neighbor (k NN) for classification. By applying the systematic study on the analysis of gene microarray data, we try to study whether feature selection selects proper components for PCA and PLS dimension reduction and whether only the top components are nontrivial for classification.

Results and discussion

Results by using SVM

In order to demonstrate the importance of feature selection in dimension reduction, we have performed the following series experiments by using support vector machine (SVM) as the classifier:

1. SVM is a baseline method, all the genes without any selection and extraction are input into SVM for classification.

2. PCASVM uses PCA as feature extraction methods, all the newly extracted components are input into SVM.

3. PLSSVM uses PLS as feature extraction methods, all the newly extracted components are input into SVM.

4. PPSVM uses PCA+PLS as feature extraction methods, all the newly extracted components are input into SVM.

5. GAPCASVM uses PCA as feature extraction methods to extract new components from original gene set and GA as feature selection methods to select feature subset from the newly extracted components, the selected subset is input into SVM.

6. GAPLSSVM uses PLS as feature extraction methods to extract new components from original gene set and GA as feature selection methods to select feature subset from the newly extracted components, the selected subset is input into SVM.

7. GAPPSVM uses PCA+PLS as feature extraction methods to extract new components from original gene set and GA as feature selection methods to select feature subset from the newly extracted components, the selected subset is input into SVM.

Since there are parameters for SVM, we try to reduce its effect to our comparison and use four pairs of different parameters for SVM, they are C = 10, σ = 0.01, C = 10, σ = 10, C = 1000, σ = 0.01, and C = 1000, σ = 10. It is noted that different data sets including the extracted data sets and selected data sets need different optimal parameters for different methods, we do not choose the optimal parameters, because 1) this is unreachable, finding the optimal parameters is an NP hard problem; 2) we do not exhibit the top performance of one special method on one single data set, but we want to show the effect of our proposed framework.

Prediction performance

The average error rates and the corresponding standard deviation values are shown in Table 1, where the standard deviation values are produced from our 50 times repeated experiments. From Table 1, we can find that:

Table 1 Statistical classification error rates (and their corresponding standard deviation) by using SVM with different parameters on four microarray data sets (%)

Full size table

• Results of all the classification methods with feature selection and extraction like PLSSVM, GAPLSSVM, PCASVM, GAPCASVM, GAPPSVM are better than that of SVM without any dimension reduction on average. Only on the LUNG data set, when SVM uses parameters of C = 10, σ = 0.01, results of PPSVM are worse than those of SVM.

• Results of classification methods with feature selection like GAPLSSVM, GAPCASVM and GAPPSVM are better than those of the corresponding feature extraction methods without feature selection like PLSSVM, PCASVM and PPSVM on average. Only on few cases, i.e. when C = 10, σ = 10 is for SVM, results of GAPCASVM are slightly worse than those of PCASVM on the COLON data set.

• Results of GAPLSSVM are better than those of PCASVM and GAPCASVM, even the corresponding results of PPSVM and GAPPSVM on average. Only on the CNS data set out of four data sets, GAPCASVM obtains the best results than other methods do.

• Results of PPSVM and GAPPSVM which combine PCA and PLS as feature extraction methods are not the best, just equal with those of PCASVM and GAPCASVM.

Number of selected features

We also show the number of features selected by each method with their corresponding standard deviation in Table 2, where the standard deviation values are produced from our 50 times repeated experiments. The values for PCASVM means the ratios of the number of top principle components to that of extracted components, those of PLSSVM and PPSVM have the same meaning. The values for GAPCASVM means the ratios of the number of selected components used in SVM to that of extracted components, and those of GAPLSSVM and GAPPSVM have the same meaning. From Table 2, we can see that if we use the top components, about 60–80% components are selected into learning machines, while if we use feature selection to select useful components, about 30% components are selected on average. Only on the LUNG data set, the selected components by different methods are 70%–80% of extracted components.

Table 2 Average percentage of features (and their corresponding standard deviation) used by SVM with different parameters on four microarray data sets (%)

Full size table

Distribution of selected features

Fig. 1 shows the comparison of distributions of components selected by GA in two cases of GAPCASVM and GAPLSSVM, and Fig. 2 shows that of GAPPSVM. Difference between Fig. 1 and Fig. 2 is that in Fig. 1, PCA and PLS are used as feature extraction individually, while in Fig. 2, PCA is combined with PLS as feature extraction methods.

From Fig. 1 and Fig. 2, we can find that:

• When only PLS is used for feature extraction, the top components are a little more than that of others in the selected components, but the others are also important.

• When only PCA is used, the top components is less than others in the selected features, and the tail components are more important than others.

• When both PCA and PLS are used as feature extraction methods, they are nearly equal in the selected components, and the top components of PLS is a little more than others.

Results by using k NN

In order to show the importance of feature selection, we have also performed the following series experiments on the k NN learning machine to reduce the bias caused by learning machines.

1. KNN is a baseline method, all the genes without any selection and extraction are input into k NN for classification.

2. PCAKNN uses PCA as feature extraction methods, all the newly extracted components are input into k NN.

3. PLSKNN uses PLS as feature extraction methods, all the newly extracted components are input into k NN.

4. PPKNN uses PCA+PLS as feature extraction methods, all the newly extracted components are input into k NN.

5. GAPCAKNN uses PCA as feature extraction methods to extract new components from original gene set and GA as feature selection methods to select feature subset from the newly extracted components, the selected subset is input into k NN.

6. GAPLSKNN uses PLS as feature extraction methods to extract new components from original gene set and GA as feature selection methods to select feature subset from the newly extracted components, the selected subset is input into k NN.

7. GAPPKNN uses PCA+PLS as feature extraction methods to extract new components from original gene set and GA as feature selection methods to select feature subset from the newly extracted components, the selected subset is input into k NN.

Since there are parameters for k NN, we try to reduce its effect to our comparison and use three parameters for k NN, they are k = 1, k = 4 and k = 7.

It is noted that different data sets need different optimal parameters for different methods, we do not choose the optimal parameters, because we do not exhibit the top performance of one special method on one single data set, but we want to show the effect of our proposed framework.