Skip to main content

Dimension reduction with redundant gene elimination for tumor classification

Abstract

Background

Analysis of gene expression data for tumor classification is an important application of bioinformatics methods. But it is hard to analyse gene expression data from DNA microarray experiments by commonly used classifiers, because there are only a few observations but with thousands of measured genes in the data set. Dimension reduction is often used to handle such a high dimensional problem, but it is obscured by the existence of amounts of redundant features in the microarray data set.

Results

Dimension reduction is performed by combing feature extraction with redundant gene elimination for tumor classification. A novel metric of redundancy based on DIScriminative Contribution (DISC) is proposed which estimates the feature similarity by explicitly building a linear classifier on each gene. Compared with the standard linear correlation metric, DISC takes the label information into account and directly estimates the redundancy of the discriminative ability of two given features. Based on the DISC metric, a novel algorithm named REDISC (Redundancy Elimination based on Discriminative Contribution) is proposed, which eliminates redundant genes before feature extraction and promotes performance of dimension reduction. Experimental results on two microarray data sets show that the REDISC algorithm is effective and reliable to improve generalization performance of dimension reduction and hence the used classifier.

Conclusion

Dimension reduction by performing redundant gene elimination before feature extraction is better than that with only feature extraction for tumor classification, and redundant gene elimination in a supervised way is superior to the commonly used unsupervised method like linear correlation coefficients.

Background

DNA microarray experiments are used to collect information from tissue and cell samples regarding gene expression differences for tumor diagnosis [1, 2]. The output of microarray experiment is summarized as an n × p data matrix, where n is the number of tissue or cell samples, p is the number of genes (features). Here, p is always much larger than n, which hurts generalization performance of most classification methods. To overcome this problem, we either select a small subset of interesting genes (gene selection, feature selection) or construct K new components summarizing the original data as well as possible, with K <p (feature extraction).

Gene selection has been studied extensively in the last few years. The most commonly used procedures of gene selection are based on a score which is calculated for all genes individually and genes with the best scores are selected. Gene selection procedures output a list of relevant genes which can be experimentally analyzed by biologists. The method is often denoted as univariate gene selection, whose advantages are its simplicity and interpretability. However, interactions and correlations between genes are omitted during gene selection, although they are of great interest in system biology. Furthermore, gene selection often fails to pick relevant genes, because the score they assign to correlated genes is too similar, and none of the genes is strongly preferred over another.

Feature extraction is an alternative to gene selection to overcome curse of dimensionality. Unlike gene selection, feature extraction projects the whole data into a low dimensional space and constructs new dimensions (components) by analyzing the statistical relationship hidden in the data set. Although feature extraction is often criticized for the lack of interpretability, the new components often give good information or hints about the data's intrinsic structure. Researchers have developed different feature extraction methods in applications of bioinformatics and computational biology [35], which are generally divided into two groups, unsupervised and supervised. Among various methods, Principle Component Analysis (PCA), an unsupervised method, and Partial Least Squares (PLS), a supervised method, are widely used [5].

Considering of the fact that gene selection and feature extraction algorithms have complementary advantages and disadvantages. Feature extraction algorithms thrive on correlation among features but fail to remove irrelevant and redundant features from a set of complex features. Feature selection algorithms fail when all the features are correlated but do well with informative features. It would be an interesting work to combine gene selection and feature extraction into a general model. In practical, the simplest way is to apply a preliminarily gene selection procedure before feature extraction.

As to analysis of microarray data whose speciality is the huge amount of genes with few examples, it is believed that there exist many redundant genes among the full gene set [6]. Preserving the most discriminative genes and reducing other irrelevant and redundant genes still remain as an open issue. In this paper, we propose a novel metric of redundancy which can effectively eliminate redundant genes before feature extraction. By measuring the discriminative ability of each gene and the pair-wise complementarity, the new method reduce the redundant genes with little contribution of discriminative ability. We also compare our method with commonly used redundant gene reduction methods based on linear correlation. Experiments on several real microarray data sets demonstrate the outstanding performance of our method.

Some notions used in this work are clarified here. Expression levels of p genes in n microarray samples are collected in an n × p data matrix X = (x ij ), 1 ≤ in, 1 ≤ jp; of which an entry x ij is the expression level of the j th gene in the i th microarray sample. As we only consider binary classification problems, the labels of the n microarray samples are collected in the vector y. When the i th sample belongs to class one, the element y i is 1; otherwise it is -1. The matrix S X denotes the p × p covariance matrix of the gene expressions.

Besides, || • || denotes the length of a vector. XTrepresents the transpose of X, X-1 represents the inverse matrix of X. The matrices X and y used in the following are assumed to be centered to zero mean by each column.

Results and discussion

Results

According to the framework proposed in this paper, dimension reduction is performed by combining redundant gene elimination with feature extraction, then the classifier is used to perform classification on the extracted feature subsets. The novel proposed algorithm REDISC (Redundancy Elimination based on DIScriminative Contribution) is compared with the commonly used algorithm RELIC (Redundancy Elimination based on LInear Correlation) to perform redundant gene elimination on two microarray data sets, i.e. Colon and Leukemia, where the threshold of δ in REDISC and RELIC is varied from 0.1 to 0.9. Feature extraction is performed by principle component analysis (PCA) and partial least squares (PLS). The classifier is a linear support vector machine (SVM) with C = 1.

Statistical results of the number of remained genes after performing REDISC and RELIC are showed in Figure 1, the detailed results are also listed in Tables 1-4.

Table 1 Statistical results by performing PLS after REDISC and RELIC with different parameters on the Colon data set
Table 2 Statistical results by performing PCA after REDISC and RELIC with different parameters on the Colon data set
Table 3 Statistical results by performing PLS after REDISC and RELIC with different parameters on the Leukemia data set
Table 4 Statistical results by performing PCA after REDISC and RELIC with different parameters on the Leukemia data set
Figure 1
figure 1

The number of selected genes by performing REDISC and RELIC with different parameters.

Comparative results of BACC obtained by SVM on the new feature sets by using PCA or PLS after performing REDISC and RELIC are illustrated in Figure 2 and Figure 3. Detailed results of Sensitivity, Specificity, BACC, Precision, PPV, NPV and correction on Colon and Leukemia are showed in Tables 1-4, where the results are averaged on ten times of run.

Figure 2
figure 2

Comparative results of BACC scores by using different algorithms on the Colon data set.

Figure 3
figure 3

Comparative results of BACC scores by using different algorithms on the Leukemia data set.

The results in Figures 1-3 and Tables 1-4 show that:

1. Both REDISC and RELIC dramatically reduce the number of genes from the original data. With the same value of δ, REDISC obtains more compact subsets than RELIC does.

2. With δ = 0:1, RELIC always obtains better results than REDISC, but when δ increases, the results of REDISC are better than those of RELIC. On average, REDISC obtains better results than RELIC does.

3. When the results of REDISC and RELIC reach their highest point, REDISC uses less features than RELIC.

4. The effect of REDISC is positive for both PCA and PLS, while RELIC loses in some case, i.e. BACC of PCA on the Leukemia data set.

5. REDISC and RELIC with different threshhold values produces different results, no one is optimal for all the data sets.

Discussion

The experimental results prove our assumption that redundant features hurt performance of feature extraction and classification, other considerations on the above results are listed as below:

1. The results confirm that there exist many redundant genes in the microarray data and it is necessary to perform redundant gene elimination. Usually, there are four types of features in one data set, I is strong relevant features, II is weak relevant but non redundant features, III is weak relevant and redundant features and IV is irrelevant features. I and II are the essential features in the data sets, and III and IV should be removed [7]. The previous works show III and IV should be removed for classifiers, and in this paper, we show they should also be removed for feature extraction like PCA and PLS.

2. REDISC obtains better results with less features than RELIC, which shows that REDISC has the higher ability to select relevant features and eliminate the redundant features than RELIC. Proper redundant feature elimination help improve performance of feature extraction and classification. Simply reducing redundant genes by linear correlation is not always positive, because without considering the label information in the data set, linear correlation does not give properly redundancy estimation. REDISC takes label information into account for redundant gene elimination, which may be viewed as a supervised way. Since the final step is classification, so a supervised redundant gene elimination is better than an unsupervised one like RELIC.

3. It shows the performances of dimension reduction is improved when redundant genes are properly eliminated. The improvement for PLS is much more dramatic than that of PCA. A possible reason is redundant genes obstruct the performance of supervised methods more obviously, since supervised methods often build more precisely model than unsupervised ones.

Conclusion

Dimension Reduction is widely used in bioinformatics and related fields to overcome the curse of dimensionality. But the existence of amounts of redundant genes in the microarray data often obscure the application of dimension reduction. Preliminarily redundant gene elimination before feature extraction for dimension reduction is an interesting issue, which was often neglected.

In this paper, a novel metric of redundancy based on Discriminative Contribution (DISC) is proposed, which directly estimates the similarity between two features by explicitly building linear classifiers on each genes. The REDISC algorithm (Redundancy Elimination based on Discriminative Contribution) is also proposed. REDISC is compared with a commonly used algorithm RELIC (Redundancy Elimination based on Linear Correlation) on two real microarray data sets. Experimental results demonstrate the necessariness of preliminarily redundant gene elimination before feature extraction for tumor classification and the superiority of REDISC to RELIC, a commonly used method. This work is an attempt to propose a general framework performing dimension reduction for tumor classification by combing redundant gene elimination and feature extraction. More investigation need to be done on the efficiency of fusion of feature selection with feature extraction in the future.

Methods

A framework of dimension reduction

In this paper, we propose a novel framework for dimension reduction by combining redundant feature elimination with feature extraction to improve performance of classification. The framework is illustrated as in Figure 4, where the microarray data is performed dimension reduction before classification, and dimension reduction consists of redundant gene elimination and feature extraction. The algorithms of redundant gene elimination before feature extraction in this paper actually remove irrelevant features and redundant features at the same time. We omit irrelevant gene elimination because irrelevant genes are few in the gene data set and are not the focus in this paper.

Figure 4
figure 4

The novel framework of dimension reduction.

Redundant gene elimination is the critical part in the framework, we propose a novel algorithm based on discriminative ability to improve performance of commonly used linear correlation, which is described in detail in the following subsections. Feature extraction is performed by using two methods, one is supervised, i.e. partial least squares, another is unsupervised, i.e. principle component analysis, which are briefly introduced in the following subsection. As for the classifier, support vector machine is used.

Redundant gene elimination

As redundant features have no contribution for classification, we consider eliminating them preliminarily before feature extraction, which has the following benefits:

1. Eliminating redundant features improves classification accuracy. In general, original microarray data sets have many irrelevant and redundant genes, which hurts performance of feature extraction. In practical, biologists often expect noises are reduced, at least in some extent, during the stage of feature extraction. But, if some redundant genes are reduced beforehand, performance of feature extraction may be improved.

2. Preliminarily feature selection facilitates the application of feature extraction. Compared with modeling on the original data directly, the computational and RAM consumptions of feature extraction on preliminarily gene selected data are much less. Especially for the RAM consumption, most feature extraction methods are often not practical for high dimensional data, since the requirement of loading all data into RAM at one time. However, any additional gene selection procedure may bring some extra computation, so the computational complexity of preliminarily feature selection must not be too high.

3. Preliminarily feature selection improves the interpretability of the components. The meanings of the components are always difficult to be interpreted in feature extraction. Biologists often analyze the relation between extracted components and original features by the coefficients, but it is obscured by the large amount of genes. Reducing a number of original features is obviously helpful when the components are needed to be related with original genes manually.

The previous metrics

Discriminative ability (predictive ability) is a general notion which can be measured in various ways and be used to select significant features for classification. Many effective metrics had been proposed such as t-statistic, information gain, χ2 statistic, odds ratio etc. [8, 9]. Filter feature selection methods sort features by the discriminative ability scores, and some top rank features are retained to be essential for classification.

However, t-statistic and most of other discriminative ability measures are based on individual features, which do not consider the redundancy between two features. Because given two features with the same rank scores, they may be redundant to each other when they are completely correlated, otherwise, they may also be complementary to each other when they are nearly independent.

For the task of feature selection, we want to eliminate the redundant features and only retain the interactive ones. But there exist many redundant features in the top rank feature set produced by using the filter methods. The redundant features increase the dimensionality and contribute little for the final classification. In order to eliminate redundant features, metrics need to estimate the redundancy directly.

Notions of feature redundancy are normally in terms of feature correlation. It is widely accepted that two features are redundant to each other if their values are completely correlated. But in fact, it may not be so straightforward to determine feature redundancy when a feature is correlated with a set of features. The widely used way is to approximate the redundancy of feature set by considering the pair-wise feature redundancy.

The linear correlation metric

For linear cases, the most well known pair-wise redundancy metric is the linear correlation coefficient. Given a pair of features (x, y), the definition of the linear correlation coefficient Cor(x, y) is:

Cor ( x , y ) = Σ i ( x i x ¯ ) ( y i y ¯ ) Σ i ( x i x ¯ ) 2 Σ i ( y i y ¯ ) 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaee4qamKaee4Ba8MaeeOCaiNaeiikaGIaemiEaGNaeiilaWIaemyEaKNaeiykaKIaeyypa0tcfa4aaSaaaeaacqqHJoWudaWgaaqaaiabdMgaPbqabaGaeiikaGIaemiEaG3aaSbaaeaacqWGPbqAaeqaaiabgkHiTiqbdIha4zaaraGaeiykaKIaeiikaGIaemyEaK3aaSbaaeaacqWGPbqAaeqaaiabgkHiTiqbdMha5zaaraGaeiykaKcabaWaaOaaaeaacqqHJoWudaWgaaqaaiabdMgaPbqabaGaeiikaGIaemiEaG3aaSbaaeaacqWGPbqAaeqaaiabgkHiTiqbdIha4zaaraGaeiykaKYaaWbaaeqabaGaeGOmaidaaaqabaWaaOaaaeaacqqHJoWudaWgaaqaaiabdMgaPbqabaGaeiikaGIaemyEaK3aaSbaaeaacqWGPbqAaeqaaiabgkHiTiqbdMha5zaaraGaeiykaKYaaWbaaeqabaGaeGOmaidaaaqabaaaaaaa@5F0E@
(1)

where x ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmiEaGNbaebaaaa@2D66@ and y ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmyEaKNbaebaaaa@2D68@ are the mean of x and y respectively. The value of Cor(x, y) lies between -1 and 1. If x and y are completely correlated, Cor(x, y) takes the value of 1 or -1; if x and y are independent, Cor(x, y) is zero. It is a symmetrical metric.

The linear correlation coefficient has the advantage of its efficiency and simplicity, but it is not suitable for redundant feature elimination when classification is the final target, since it does not use any label information. For example, two highly correlated features, whose differences are minor in values but happen to causing different critical discriminative ability, may be considered as a pair of redundancy features. Reducing any one of them will decrease classification accuracy. Guyon et al. has also pointed out that high correlation (or anti-correlation) of variables does not mean absence of variable complementarity [8]. The problem of the linear correlation coefficient is that it measures the similarity of the numerical values between two features, but not the similarity of discriminative ability between two features.

The ideal feature set should have both great discriminative ability and little feature redundancy, where redundancy could not be obtained by estimating their properties separately. A more elaborate measure of redundancy is required to estimate the differences of the discriminative ability between two features.

The proposed novel metric

In order to measure the similarity of discriminative ability of two features, the discriminative ability need be defined more precisely. That is to say, we want to know which example can be rightly classified by the given feature and which can not. Upon the new metric, it is possible to compare the discriminative ability of two features by the corresponding correctly classified examples.

In the field of text classification, Training Accuracy on Single Feature (TASF) has been proved to be an effective metric of discriminative ability [9], which builds a classifier for each feature, and the corresponding training accuracy is used as the discriminative score.

Various classifiers can be used to calculate TASF, in simplification, we consider a linear learner here. Given a feature z, the classification function is given as:

y ^ = sgn ( ( z ¯ 1 z ¯ 2 ) ( z n 1 z ¯ 1 + n 2 z ¯ 2 n 1 + n 2 ) ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmyEaKNbaKaacqGH9aqpcyGGZbWCcqGGNbWzcqGGUbGBcqGGOaakcqGGOaakcuWG6bGEgaqeamaaCaaaleqabaGaeGymaedaaOGaeyOeI0IafmOEaONbaebadaahaaWcbeqaaiabikdaYaaakiabcMcaPiabcIcaOiabdQha6jabgkHiTKqbaoaalaaabaGaemOBa42aaWbaaeqabaGaeGymaedaaiqbdQha6zaaraWaaWbaaeqabaGaeGymaedaaiabgUcaRiabd6gaUnaaCaaabeqaaiabikdaYaaacuWG6bGEgaqeamaaCaaabeqaaiabikdaYaaaaeaacqWGUbGBdaahaaqabeaacqaIXaqmaaGaey4kaSIaemOBa42aaWbaaeqabaGaeGOmaidaaaaakiabcMcaPiabcMcaPaaa@5261@
(2)

where z ¯ 1 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmOEaONbaebadaahaaWcbeqaaiabigdaXaaaaaa@2E87@ and n1 are the feature mean and the sample size of class one, z ¯ 2 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmOEaONbaebadaahaaWcbeqaaiabikdaYaaaaaa@2E89@ and n2 are the feature mean and the sample size of class two. This is a weighted centroid based classifier, which predicts examples as the class label whose weighted distance to its centroid is smaller. The computational complexity of this classifier is O(n).

Putting the whole training set back, we can estimate training accuracy of each classifier by different features, which is used to represent discriminative ability of the corresponding feature. The higher training accuracy, the greater discriminative ability. Since only one feature is used to build the classifier, a part of training examples can be correctly separated in most cases. So the value of TASF ranges from 0 to 1. One feature is considered as an irrelevant one if its TASF value is no greater than 0.5.

Based on TASF, we propose a novel metric of feature redundancy. Given two features of z1 and z2, two classifiers C1 and C2 can be constructed. Feeding the whole training set to the classifiers, both C1 and C2 can correctly classify a sample subset. The differences of the correctly classified examples are used to estimate the similarity of discriminative abilities. We record the concrete classification results as in table 5, where a + b + c + d equals to the size of the training set n. The values of (a + b)/n and (a + c)/n are training accuracy of C1 and C2 respectively. The score of a + d measures the similarity of the features, and the score of b + c measures the dissimilarity. When b + c = 0, the two features z1 and z2 have exactly the same discriminative ability.

Table 5 Statistical relative classification results of two classifiers

Our feature elimination problem is becoming whether the contribution of the additional feature to the given feature is significant. The additional feature is considered as redundant if its contribution is tiny. Then, we propose a novel metric of Redundancy based on DIScriminative Contribution (DISC). DISC of z1 and z2, which estimates z2's redundancy to z1, is defined as follows,

DISC ( z 1 , z 2 ) = 1 c c + d = d c + d MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeWabiqaaaqaaiabbseaejabbMeajjabbofatjabboeadjabcIcaOiabdQha6naaBaaaleaacqaIXaqmaeqaaOGaeiilaWIaemOEaO3aaSbaaSqaaiabikdaYaqabaGccqGGPaqkcqGH9aqpcqaIXaqmcqGHsisljuaGdaWcaaqaaiabdogaJbqaaiabdogaJjabgUcaRiabdsgaKbaaaOqaaiabg2da9KqbaoaalaaabaGaemizaqgabaGaem4yamMaey4kaSIaemizaqgaaaaaaaa@4741@
(3)

The pair-wise DISC metric is asymmetrical, and the computation complexity is O(n).

It is clear that c + d is the number of examples which could not be discriminated by C1, c is that which could be correctly classified by the collaboration of C1 and C2. So the proportion of c/(c + d) is the discriminative contribution of C2 to C1, and the value of d/(c + d) is the DISC metric of redundancy, which varies from 0 to 1. When the DISC score takes 1, C2's discriminative ability is covered by C1's and then z2 is completely redundant to z1. When the DISC value is 0, all training examples could be correctly classified by the union of C2 and C1 and we consider z2 is complementary to z1.

DISC is proposed in a linear way, which shows in two respects, one is the linear classifier, another is the linear way of counting the cross discriminative abilities. The microarray problems meet the assumption, since most microarray data sets are binary classification problems, where each gene has equal position to perform classification.

The proposed redundant gene elimination algorithms

The REDISC algorithm

Based on the DISC redundancy metric, we propose the REDISC algorithm (Redundancy Elimination based on Discriminative Contribution), which eliminates redundant features by the pair-wise DISC scores. REDISC is illustrated in Figure 5, its basic idea is that, firstly, REDISC filters out trivial features, which do not have discriminative ability on itself, by the TASF score threshold of 0.5. Then the features are ordered by their TASF scores. As we usually want to retain the more discriminative one between two redundant features, REDISC tries to preserve the top TASF score ranked features. REDISC uses two nested iterations to eliminate redundant features whose discriminative ability are covered by any higher ranked features. The computational complexity of REDISC is O(np2).

Figure 5
figure 5

The REDISC algorithm.

The RELIC algorithm

In order to compare our method with commonly used redundant feature elimination methods, we present the algorithm of RELIC (Redundancy Elimination based on Linear Correlation) [10], which filters out redundant features by the pair-wise linear correlation. A threshold is needed to control how many features should be eliminated. RELIC is given in Figure 6, whose computational complexity is also O(np2).

Figure 6
figure 6

The RELIC algorithm.

Feature extraction techniques

Principle component analysis

Principle component analysis (PCA) is a well-known method of feature extraction [11]. The basic idea of PCA is to reduce the dimensionality of a data set, while retaining as much as possible the variation in the original predictor variables. This is achieved by transforming the p original variables X = [x1, x2, ..., x p ] to a new set of K predictor variables, T = [t1, t2, ..., t K ], which are linear combinations of the original variables. In mathematical terms, PCA sequentially maximizes the variance of a linear combination of the original predictor variables,

u K = arg max u T u = 1 ( Var ( X u ) ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbeGae8xDau3aaSbaaSqaaiabdUealbqabaGccqGH9aqpdaWfqaqaaiGbcggaHjabckhaYjabcEgaNjGbc2gaTjabcggaHjabcIha4bWcbaGae8xDau3aaWbaaWqabeaacqWGubavaaWccqWF1bqDcqGH9aqpcqaIXaqmaeqaaOGaeiikaGIaeeOvayLaeeyyaeMaeeOCaiNaeiikaGIaemiwaGLae8xDauNaeiykaKIaeiykaKcaaa@489F@
(4)

subject to the constraint u i T S X u j = 0 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbeGae8xDau3aa0baaSqaaiabdMgaPbqaaiabdsfaubaakiabdofatnaaBaaaleaacqWGybawaeqaaOGae8xDau3aaSbaaSqaaiabdQgaQbqabaGccqGH9aqpcqaIWaamaaa@37A5@ , 1 ≤ i <j. The orthogonal constraint ensures that the linear combinations are uncorrelated, i.e. Cov(X u i , X u j ) = 0, ij. These linear combinations

t i = X u i (5)

are known as the principal components (PCs).

The maximum number of components K is determined by the number of nonzero eigenvalues, which is the rank of S X , and K ≤ min(n, p). But in practical, the maximum value of K is not necessary. Some tail components, which have tiny eigenvalues and represent few variances of original data, are often needed to be reduced. The threshold of K often determined by cross-validation or the proportion of explained variances [11]. The computational cost of PCA, determined by the number of original predictor variables p and the number of samples n, is in the order of min(np2 + p3, pn2 + n3). In other words, the cost is O(pn2 + n3) when p >n.

Partial Least Squares

Partial Least Squares (PLS) was firstly developed as an algorithm performing matrix decompositions, and then was introduced as a multivariate regression tool in the context of chemometrics [12, 13]. In recent years, PLS has also been found to be an effective feature extraction technique for tumor discrimination [14, 15].

The underlying assumption of PLS is that the observed data is generated by a system or process which is driven by a small number of latent (not directly observed or measured) features. Therefore, PLS aims at finding uncorrelated linear transformations (latent components) of the original predictor features which have high covariance with the response features. Based on these latent components, PLS predicts response features y, the task of regression, and reconstruct original matrix X, the task of data modeling, at the same time.

The objective of constructing components in PLS is to maximize the covariance between the response variable y and the original predictor variables X,

w K = arg max w T w = 1 ( Cov ( X w , y ) ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbeGae83DaC3aaSbaaSqaaiabdUealbqabaGccqGH9aqpdaWfqaqaaiGbcggaHjabckhaYjabcEgaNjGbc2gaTjabcggaHjabcIha4bWcbaGae83DaC3aaWbaaWqabeaacqWGubavaaWccqWF3bWDcqGH9aqpcqaIXaqmaeqaaOGaeiikaGIaee4qamKaee4Ba8MaeeODayNaeiikaGIaemiwaGLae83DaCNaeiilaWIae8xEaKNaeiykaKIaeiykaKcaaa@4B04@
(6)

subject to the constraint w i T S X w j = 0 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacbeGae83DaC3aa0baaSqaaiabdMgaPbqaaiabdsfaubaakiabdofatnaaBaaaleaacqWGybawaeqaaOGae83DaC3aaSbaaSqaaiabdQgaQbqabaGccqGH9aqpcqaIWaamaaa@37AD@ , 1 ≤ i <j. The central task of PLS is to obtain the vectors of optimal weights w i (i = 1, ..., K) to form a small number of components, while PCA is an "unsupervised" method that utilizes the X data only.

Like PCA, PLS reduces the complexity of microarray data analysis by constructing a small number of gene components, which can be used to replace the large number of original gene expression measures. Moreover, obtained by maximizing the covariance between the components and the response variable, the PLS components are generally more predictive of the response variable than the principal components.

PLS computes efficiently with a cost only at O(npK), i.e. the number of calculations required by PLS is a linear function in terms of n or p. Thus it is much faster than the method of PCA for K is always less than n.

Feature extraction methods extract components to represent original data, which are linear or non-linear transformations of original genes. Although the new subspace is effective for data analysis, no original gene is excluded during the process, which often obstructs the explanations of PCs. In order to solve this problem, eliminating redundant genes before dimension reduction is an alternative way.

Classifier – Support Vector Machines

Support vector machines (SVM) proposed by Vapnik and his co-workers in 1990s, have been developed quickly during the last decade [16], and successfully applied to biological data mining [17], drug discovery [18, 19] etc. Denoting the training sample as S = {(x, y)} {n× {-1, 1}}, SVM discriminant hyperplane can be written as

y = sgn(w·x + b)

where w is a weight vector, b is a bias. According to the generalization bound in statistical learning theory [20], we need to minimize the following objective function for a 2-norm soft margin version of SVM:

minimize w , b w w + C Σ i = 1 ξ i 2 subject to y i ( w x i + b ) 1 ξ i , i = 1 , ... , , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeaabiGaaaqaaiabb2gaTjabbMgaPjabb6gaUjabbMgaPjabb2gaTjabbMgaPjabbQha6jabbwgaLnaaBaaaleaaieqacqWF3bWDcqGGSaalcqWGIbGyaeqaaaGcbaGaeyykJeUae83DaCNaeyyXICTae83DaCNaeyOkJeVaey4kaSIaem4qamKaeu4Odm1aa0baaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabloriSbaakiabe67a4naaDaaaleaacqWGPbqAaeaacqaIYaGmaaaakeaacqqGZbWCcqqG1bqDcqqGIbGycqqGQbGAcqqGLbqzcqqGJbWycqqG0baDcqqGGaaicqqG0baDcqqGVbWBaeaacqWG5bqEdaWgaaWcbaGaemyAaKgabeaakiabcIcaOiabgMYiHlab=Dha3jabgwSixlab=Hha4naaBaaaleaacqWGPbqAaeqaaOGaeyOkJeVaey4kaSIaemOyaiMaeiykaKIaeyyzImRaeGymaeJaeyOeI0IaeqOVdG3aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqWGPbqAcqGH9aqpcqaIXaqmcqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcqWItecBcqGGSaalaaaaaa@803A@
(7)

in which, slack variable ξ i is introduced when the problem is infeasible. The constant C > 0 is a penalty parameter, a larger C corresponds to assigning a larger penalty to errors.

Data Sets

Two microarray data sets used in our study are listed in Table 6. They are briefly described as below, and the corresponding C4.5 format versions are available at [21]. We do not use the original split by their authors, we merge the data set before using it.

Table 6 Experimental data sets

Colon used Affymetrix oligonucleotide arrays to monitor expressions of over 6,500 human genes with samples of 40 tumor and 22 normal colon tissues. Expression of the 2,000 genes with the highest minimal intensity across the 62 tissues were used in the analysis [2].

Leukemia The acute leukemia data set was published by [1], which consists of 72 bone marrow samples with 47 ALL and 25 AML. The gene expression intensities are obtained from Affymetrix high-density oligonucleotide microarrays containing probes for 7,129 genes.

Experimental settings

We use the stratified 10-fold cross-validation procedure, where each data set is firstly merged and then split into ten subsets of equal size. Each subset is used as a test set once, and the corresponding left subsets are combined together and used as the training set. Within each cross-validation fold, the gene expression data is standardized. The expressions of the training set are transformed to zero mean and unit standard deviation across samples, and the test set are transformed according to the means and standard deviations of the corresponding training set. We use 10 fold cross validation because the 10 × 10 cross-validation measurement is more reliable than the randomized re-sampling test strategy and the leave-one-out cross-validation due to the correlations between the test and training sets, some detail discussions can be found at [22].

The linear Support Vector Machine (SVM) with C = 1 is used as the classifier, which is trained on the training set to predict the label of test samples. Figure 7 contains pseudo-code to describe the complete 10 × 10 cross-validation measurement procedure.

Figure 7
figure 7

Experimental procedure for comparing different algorithms.

In order to precisely characterize the performance of different learning methods, we define several performance measures below (see [23]). Here TP, TN, FP, and FN, stand for the number of true positive, true negative, false positive, and false negative samples, respectively.

Sensitivity is defined as T P T P + F N MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqWGubavcqWGqbauaeaacqWGubavcqWGqbaucqGHRaWkcqWGgbGrcqWGobGtaaaaaa@3443@ and is also known as Recall.

Specificity is defined as T N T N + F P MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqWGubavcqWGobGtaeaacqWGubavcqWGobGtcqGHRaWkcqWGgbGrcqWGqbauaaaaaa@343F@ .

BACC (Balanced Accuracy) is defined as 1 2 ( T P T P + F N + T N T N + F P ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqaIXaqmaeaacqaIYaGmaaWaaeWaaeaadaWcaaqaaiabdsfaujabdcfaqbqaaiabdsfaujabdcfaqjabgUcaRiabdAeagjabd6eaobaacqGHRaWkdaWcaaqaaiabdsfaujabd6eaobqaaiabdsfaujabd6eaojabgUcaRiabdAeagjabdcfaqbaaaiaawIcacaGLPaaaaaa@407C@ , which defines the average of sensitivity and specificity.

Precision is defined as T P T P + F P MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqWGubavcqWGqbauaeaacqWGubavcqWGqbaucqGHRaWkcqWGgbGrcqWGqbauaaaaaa@3447@ .

PPV (Positive Predictive Value) is defined as T P T P + F P MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqWGubavcqWGqbauaeaacqWGubavcqWGqbaucqGHRaWkcqWGgbGrcqWGqbauaaaaaa@3447@ .

NPV (Negative Predictive Value) is defined as T N T N + F N MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqWGubavcqWGobGtaeaacqWGubavcqWGobGtcqGHRaWkcqWGgbGrcqWGobGtaaaaaa@343B@ .

Correction is defined as T P + T N T P + T N + F P + F N MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqWGubavcqWGqbaucqGHRaWkcqWGubavcqWGobGtaeaacqWGubavcqWGqbaucqGHRaWkcqWGubavcqWGobGtcqGHRaWkcqWGgbGrcqWGqbaucqGHRaWkcqWGgbGrcqWGobGtaaaaaa@3DD3@ and measures the overall percentage of samples correctly classified.

References

  1. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Bioinformatics & Computational Biology. 1999, 286 (5439): 531-537.

    CAS  Google Scholar 

  2. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America. 1999, 6745-6750. 10.1073/pnas.96.12.6745.

    Google Scholar 

  3. Antoniadis A, Lambert-Lacroix S, Leblanc F: Effective dimension reduction methods for tumor classification using gene expression data. Bioinformatics. 2003, 19 (5): 563-570. 10.1093/bioinformatics/btg062.

    Article  CAS  PubMed  Google Scholar 

  4. Nguyen DV, David DM, Rocke M: On partial least squares dimension reduction for microarray-based classification: a simulation study. Computational Statistics & Data Analysis. 2004, 46 (3): 407-425. 10.1016/j.csda.2003.08.001.

    Article  Google Scholar 

  5. Dai JJ, Lieu L, Rocke D: Dimension reduction for classification with gene expression data. Statistical Applications in Genetics and Molecular Biology. 2006, 5: Article 6-10.2202/1544-6115.1147.

    Article  Google Scholar 

  6. Yu L, Liu H: Redundancy Based Feature Selection for Microarray Data. Proc. 10th ACM SIGKDD Conf. Knowledge Discovery and Data Mining. 2004, 22-25.

    Google Scholar 

  7. Yu L, Liu H: Efficient Feature Selection Via Analysis of Relevance and Redundancy. Journal of Machine Learning Research. 2004, 5 (Oct): 1205-1224.

    Google Scholar 

  8. Guyon I, Elisseefi A: An Introduction to Variable and Feature Selection. Journal of Machine Learning Research. 2003, 3 (7–8): 1157-1182. 10.1162/153244303322753616.

    Google Scholar 

  9. Forman G: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research. 2003, 3: 1289-1305. 10.1162/153244303322753670.

    Google Scholar 

  10. Hall MA, Holmes G: Benchmarking attribute selection techniques for discrete class data mining. IEEE Transactions on Knowledge and Data Engineering. 2003, 15 (6): 1437-1447. 10.1109/TKDE.2003.1245283.

    Article  Google Scholar 

  11. Jolliffe IT: Principal Component Analysis. 2002, Springer Series in Statistics, Springer, second

    Google Scholar 

  12. Wold S, Ruhe A, Wold H, Dunn W: Collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses. SIAM Journal of Scientific and Statistical Computations. 1984, 5 (3): 735-743. 10.1137/0905052.

    Article  Google Scholar 

  13. Boulesteix AL, Strimmer K: Partial Least Squares: A Versatile Tool for the Analysis of High-Dimensional Genomic Data. Briefings in Bioinformatics. 2006

    Google Scholar 

  14. Nguyen DV, Rocke DM: Multi-class cancer classification via partial least squares with gene expression profiles. Bioinformatics. 2002, 18 (9): 1216-1226. 10.1093/bioinformatics/18.9.1216.

    Article  CAS  PubMed  Google Scholar 

  15. Nguyen DV, Rocke DM: Tumor classification by partial least squares using microarray gene expression data. Bioinformatics. 2002, 18: 39-50. 10.1093/bioinformatics/18.1.39.

    Article  CAS  PubMed  Google Scholar 

  16. Cristianini N, Shawe-Taylor J: An Introduction to Support Vector Machines. 2000, Cambridge: Cambridge University Press

    Google Scholar 

  17. Guyon I, Weston J, Barnhill S, Vapnik V: Gene Selection for Cancer Classification Using Support Vector Machines. Machine Learning. 2002, 46: 389-422. 10.1023/A:1012487302797.

    Article  Google Scholar 

  18. Xue Y, Li ZR, Yap CW, Sun LZ, Chen X, Chen YZ: Effect of Molecular Descriptor Feature Selection in Support Vector Machine Classification of Pharmacokinetic and Toxicological Properties of Chemical Agents. Journal of Chemical Information & Computer Science. 2004, 44 (5): 1630-1638. 10.1021/ci049869h.

    Article  CAS  Google Scholar 

  19. Bhavani S, Nagargadde A, Thawani A, Sridhar V, Chandra N: Substructure-Based Support Vector Machine Classifiers for Prediction of Adverse Effects in Diverse Classes of Drugs. Journal of Chemical Information and Modeling. 2006, 46 (6): 2478-2486. 10.1021/ci060128l.

    Article  CAS  PubMed  Google Scholar 

  20. Vapnik V: Statistical Learning Theory. 1998, New York: Wiley

    Google Scholar 

  21. Li J, Liu H: Kent Ridge Bio-medical Data Set Repository. 2002, [http://www.cs.shu.edu.cn/gzli/data/mirror-kentridge.html]

    Google Scholar 

  22. Dietterich TG: Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation. 1998, 10: 1895-1923. 10.1162/089976698300017197.

    Article  PubMed  Google Scholar 

  23. Levner I: Feature Selection and Nearest Centroid Classification for Protein Mass Spectrometry. BMC Bioinformatics. 2005, 6: 68-10.1186/1471-2105-6-68.

    Article  PubMed Central  PubMed  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the Natural Science Foundation of China under grant no. 20503015, the STCSM "Innovation Action Plan" Project of China under grant no. 07DZ19726, Shanghai Leading Academic Discipline Project under no. J50103, Systems Biology Research Foundation of Shanghai University and Scientific Research Fund of Jiangxi Provincial Education Departments under grant no. 2007-57.

This article has been published as part of BMC Bioinformatics Volume 9 Supplement 6, 2008: Symposium of Computations in Bioinformatics and Bioscience (SCBB07). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S6.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guo-Zheng Li.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

Guo-Zheng Li and Xue-Qiang Zeng proposed the idea, designed the experiments and wrote the paper; Xue-Qiang Zeng performed experiments; Geng-Feng Wu helped in writing the paper; Mary Qu Yang helped design the experiments; Jack Y. Yang conceived and guided the project.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Zeng, XQ., Li, GZ., Yang, J.Y. et al. Dimension reduction with redundant gene elimination for tumor classification. BMC Bioinformatics 9 (Suppl 6), S8 (2008). https://doi.org/10.1186/1471-2105-9-S6-S8

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-9-S6-S8

Keywords