Abstract
Background
Analysis of gene expression data for tumor classification is an important application of bioinformatics methods. But it is hard to analyse gene expression data from DNA microarray experiments by commonly used classifiers, because there are only a few observations but with thousands of measured genes in the data set. Dimension reduction is often used to handle such a high dimensional problem, but it is obscured by the existence of amounts of redundant features in the microarray data set.
Results
Dimension reduction is performed by combing feature extraction with redundant gene elimination for tumor classification. A novel metric of redundancy based on DIScriminative Contribution (DISC) is proposed which estimates the feature similarity by explicitly building a linear classifier on each gene. Compared with the standard linear correlation metric, DISC takes the label information into account and directly estimates the redundancy of the discriminative ability of two given features. Based on the DISC metric, a novel algorithm named REDISC (Redundancy Elimination based on Discriminative Contribution) is proposed, which eliminates redundant genes before feature extraction and promotes performance of dimension reduction. Experimental results on two microarray data sets show that the REDISC algorithm is effective and reliable to improve generalization performance of dimension reduction and hence the used classifier.
Conclusion
Dimension reduction by performing redundant gene elimination before feature extraction is better than that with only feature extraction for tumor classification, and redundant gene elimination in a supervised way is superior to the commonly used unsupervised method like linear correlation coefficients.
Background
DNA microarray experiments are used to collect information from tissue and cell samples regarding gene expression differences for tumor diagnosis [1,2]. The output of microarray experiment is summarized as an n × p data matrix, where n is the number of tissue or cell samples, p is the number of genes (features). Here, p is always much larger than n, which hurts generalization performance of most classification methods. To overcome this problem, we either select a small subset of interesting genes (gene selection, feature selection) or construct K new components summarizing the original data as well as possible, with K <p (feature extraction).
Gene selection has been studied extensively in the last few years. The most commonly used procedures of gene selection are based on a score which is calculated for all genes individually and genes with the best scores are selected. Gene selection procedures output a list of relevant genes which can be experimentally analyzed by biologists. The method is often denoted as univariate gene selection, whose advantages are its simplicity and interpretability. However, interactions and correlations between genes are omitted during gene selection, although they are of great interest in system biology. Furthermore, gene selection often fails to pick relevant genes, because the score they assign to correlated genes is too similar, and none of the genes is strongly preferred over another.
Feature extraction is an alternative to gene selection to overcome curse of dimensionality. Unlike gene selection, feature extraction projects the whole data into a low dimensional space and constructs new dimensions (components) by analyzing the statistical relationship hidden in the data set. Although feature extraction is often criticized for the lack of interpretability, the new components often give good information or hints about the data's intrinsic structure. Researchers have developed different feature extraction methods in applications of bioinformatics and computational biology [35], which are generally divided into two groups, unsupervised and supervised. Among various methods, Principle Component Analysis (PCA), an unsupervised method, and Partial Least Squares (PLS), a supervised method, are widely used [5].
Considering of the fact that gene selection and feature extraction algorithms have complementary advantages and disadvantages. Feature extraction algorithms thrive on correlation among features but fail to remove irrelevant and redundant features from a set of complex features. Feature selection algorithms fail when all the features are correlated but do well with informative features. It would be an interesting work to combine gene selection and feature extraction into a general model. In practical, the simplest way is to apply a preliminarily gene selection procedure before feature extraction.
As to analysis of microarray data whose speciality is the huge amount of genes with few examples, it is believed that there exist many redundant genes among the full gene set [6]. Preserving the most discriminative genes and reducing other irrelevant and redundant genes still remain as an open issue. In this paper, we propose a novel metric of redundancy which can effectively eliminate redundant genes before feature extraction. By measuring the discriminative ability of each gene and the pairwise complementarity, the new method reduce the redundant genes with little contribution of discriminative ability. We also compare our method with commonly used redundant gene reduction methods based on linear correlation. Experiments on several real microarray data sets demonstrate the outstanding performance of our method.
Some notions used in this work are clarified here. Expression levels of p genes in n microarray samples are collected in an n × p data matrix X = (x_{ij}), 1 ≤ i ≤ n, 1 ≤ j ≤ p; of which an entry x_{ij }is the expression level of the jth gene in the ith microarray sample. As we only consider binary classification problems, the labels of the n microarray samples are collected in the vector y. When the ith sample belongs to class one, the element y_{i }is 1; otherwise it is 1. The matrix S_{X }denotes the p × p covariance matrix of the gene expressions.
Besides,  •  denotes the length of a vector. X^{T }represents the transpose of X, X^{1 }represents the inverse matrix of X. The matrices X and y used in the following are assumed to be centered to zero mean by each column.
Results and discussion
Results
According to the framework proposed in this paper, dimension reduction is performed by combining redundant gene elimination with feature extraction, then the classifier is used to perform classification on the extracted feature subsets. The novel proposed algorithm REDISC (Redundancy Elimination based on DIScriminative Contribution) is compared with the commonly used algorithm RELIC (Redundancy Elimination based on LInear Correlation) to perform redundant gene elimination on two microarray data sets, i.e. Colon and Leukemia, where the threshold of δ in REDISC and RELIC is varied from 0.1 to 0.9. Feature extraction is performed by principle component analysis (PCA) and partial least squares (PLS). The classifier is a linear support vector machine (SVM) with C = 1.
Statistical results of the number of remained genes after performing REDISC and RELIC are showed in Figure 1, the detailed results are also listed in Tables 14.
Table 1. Statistical results by performing PLS after REDISC and RELIC with different parameters on the Colon data set
Table 2. Statistical results by performing PCA after REDISC and RELIC with different parameters on the Colon data set
Table 3. Statistical results by performing PLS after REDISC and RELIC with different parameters on the Leukemia data set
Table 4. Statistical results by performing PCA after REDISC and RELIC with different parameters on the Leukemia data set
Figure 1. The number of selected genes by performing REDISC and RELIC with different parameters.
Comparative results of BACC obtained by SVM on the new feature sets by using PCA or PLS after performing REDISC and RELIC are illustrated in Figure 2 and Figure 3. Detailed results of Sensitivity, Specificity, BACC, Precision, PPV, NPV and correction on Colon and Leukemia are showed in Tables 14, where the results are averaged on ten times of run.
Figure 2. Comparative results of BACC scores by using different algorithms on the Colon data set.
Figure 3. Comparative results of BACC scores by using different algorithms on the Leukemia data set.
The results in Figures 13 and Tables 14 show that:
1. Both REDISC and RELIC dramatically reduce the number of genes from the original data. With the same value of δ, REDISC obtains more compact subsets than RELIC does.
2. With δ = 0:1, RELIC always obtains better results than REDISC, but when δ increases, the results of REDISC are better than those of RELIC. On average, REDISC obtains better results than RELIC does.
3. When the results of REDISC and RELIC reach their highest point, REDISC uses less features than RELIC.
4. The effect of REDISC is positive for both PCA and PLS, while RELIC loses in some case, i.e. BACC of PCA on the Leukemia data set.
5. REDISC and RELIC with different threshhold values produces different results, no one is optimal for all the data sets.
Discussion
The experimental results prove our assumption that redundant features hurt performance of feature extraction and classification, other considerations on the above results are listed as below:
1. The results confirm that there exist many redundant genes in the microarray data and it is necessary to perform redundant gene elimination. Usually, there are four types of features in one data set, I is strong relevant features, II is weak relevant but non redundant features, III is weak relevant and redundant features and IV is irrelevant features. I and II are the essential features in the data sets, and III and IV should be removed [7]. The previous works show III and IV should be removed for classifiers, and in this paper, we show they should also be removed for feature extraction like PCA and PLS.
2. REDISC obtains better results with less features than RELIC, which shows that REDISC has the higher ability to select relevant features and eliminate the redundant features than RELIC. Proper redundant feature elimination help improve performance of feature extraction and classification. Simply reducing redundant genes by linear correlation is not always positive, because without considering the label information in the data set, linear correlation does not give properly redundancy estimation. REDISC takes label information into account for redundant gene elimination, which may be viewed as a supervised way. Since the final step is classification, so a supervised redundant gene elimination is better than an unsupervised one like RELIC.
3. It shows the performances of dimension reduction is improved when redundant genes are properly eliminated. The improvement for PLS is much more dramatic than that of PCA. A possible reason is redundant genes obstruct the performance of supervised methods more obviously, since supervised methods often build more precisely model than unsupervised ones.
Conclusion
Dimension Reduction is widely used in bioinformatics and related fields to overcome the curse of dimensionality. But the existence of amounts of redundant genes in the microarray data often obscure the application of dimension reduction. Preliminarily redundant gene elimination before feature extraction for dimension reduction is an interesting issue, which was often neglected.
In this paper, a novel metric of redundancy based on Discriminative Contribution (DISC) is proposed, which directly estimates the similarity between two features by explicitly building linear classifiers on each genes. The REDISC algorithm (Redundancy Elimination based on Discriminative Contribution) is also proposed. REDISC is compared with a commonly used algorithm RELIC (Redundancy Elimination based on Linear Correlation) on two real microarray data sets. Experimental results demonstrate the necessariness of preliminarily redundant gene elimination before feature extraction for tumor classification and the superiority of REDISC to RELIC, a commonly used method. This work is an attempt to propose a general framework performing dimension reduction for tumor classification by combing redundant gene elimination and feature extraction. More investigation need to be done on the efficiency of fusion of feature selection with feature extraction in the future.
Methods
A framework of dimension reduction
In this paper, we propose a novel framework for dimension reduction by combining redundant feature elimination with feature extraction to improve performance of classification. The framework is illustrated as in Figure 4, where the microarray data is performed dimension reduction before classification, and dimension reduction consists of redundant gene elimination and feature extraction. The algorithms of redundant gene elimination before feature extraction in this paper actually remove irrelevant features and redundant features at the same time. We omit irrelevant gene elimination because irrelevant genes are few in the gene data set and are not the focus in this paper.
Figure 4. The novel framework of dimension reduction.
Redundant gene elimination is the critical part in the framework, we propose a novel algorithm based on discriminative ability to improve performance of commonly used linear correlation, which is described in detail in the following subsections. Feature extraction is performed by using two methods, one is supervised, i.e. partial least squares, another is unsupervised, i.e. principle component analysis, which are briefly introduced in the following subsection. As for the classifier, support vector machine is used.
Redundant gene elimination
As redundant features have no contribution for classification, we consider eliminating them preliminarily before feature extraction, which has the following benefits:
1. Eliminating redundant features improves classification accuracy. In general, original microarray data sets have many irrelevant and redundant genes, which hurts performance of feature extraction. In practical, biologists often expect noises are reduced, at least in some extent, during the stage of feature extraction. But, if some redundant genes are reduced beforehand, performance of feature extraction may be improved.
2. Preliminarily feature selection facilitates the application of feature extraction. Compared with modeling on the original data directly, the computational and RAM consumptions of feature extraction on preliminarily gene selected data are much less. Especially for the RAM consumption, most feature extraction methods are often not practical for high dimensional data, since the requirement of loading all data into RAM at one time. However, any additional gene selection procedure may bring some extra computation, so the computational complexity of preliminarily feature selection must not be too high.
3. Preliminarily feature selection improves the interpretability of the components. The meanings of the components are always difficult to be interpreted in feature extraction. Biologists often analyze the relation between extracted components and original features by the coefficients, but it is obscured by the large amount of genes. Reducing a number of original features is obviously helpful when the components are needed to be related with original genes manually.
The previous metrics
Discriminative ability (predictive ability) is a general notion which can be measured in various ways and be used to select significant features for classification. Many effective metrics had been proposed such as tstatistic, information gain, χ^{2 }statistic, odds ratio etc. [8,9]. Filter feature selection methods sort features by the discriminative ability scores, and some top rank features are retained to be essential for classification.
However, tstatistic and most of other discriminative ability measures are based on individual features, which do not consider the redundancy between two features. Because given two features with the same rank scores, they may be redundant to each other when they are completely correlated, otherwise, they may also be complementary to each other when they are nearly independent.
For the task of feature selection, we want to eliminate the redundant features and only retain the interactive ones. But there exist many redundant features in the top rank feature set produced by using the filter methods. The redundant features increase the dimensionality and contribute little for the final classification. In order to eliminate redundant features, metrics need to estimate the redundancy directly.
Notions of feature redundancy are normally in terms of feature correlation. It is widely accepted that two features are redundant to each other if their values are completely correlated. But in fact, it may not be so straightforward to determine feature redundancy when a feature is correlated with a set of features. The widely used way is to approximate the redundancy of feature set by considering the pairwise feature redundancy.
The linear correlation metric
For linear cases, the most well known pairwise redundancy metric is the linear correlation coefficient. Given a pair of features (x, y), the definition of the linear correlation coefficient Cor(x, y) is:
where and are the mean of x and y respectively. The value of Cor(x, y) lies between 1 and 1. If x and y are completely correlated, Cor(x, y) takes the value of 1 or 1; if x and y are independent, Cor(x, y) is zero. It is a symmetrical metric.
The linear correlation coefficient has the advantage of its efficiency and simplicity, but it is not suitable for redundant feature elimination when classification is the final target, since it does not use any label information. For example, two highly correlated features, whose differences are minor in values but happen to causing different critical discriminative ability, may be considered as a pair of redundancy features. Reducing any one of them will decrease classification accuracy. Guyon et al. has also pointed out that high correlation (or anticorrelation) of variables does not mean absence of variable complementarity [8]. The problem of the linear correlation coefficient is that it measures the similarity of the numerical values between two features, but not the similarity of discriminative ability between two features.
The ideal feature set should have both great discriminative ability and little feature redundancy, where redundancy could not be obtained by estimating their properties separately. A more elaborate measure of redundancy is required to estimate the differences of the discriminative ability between two features.
The proposed novel metric
In order to measure the similarity of discriminative ability of two features, the discriminative ability need be defined more precisely. That is to say, we want to know which example can be rightly classified by the given feature and which can not. Upon the new metric, it is possible to compare the discriminative ability of two features by the corresponding correctly classified examples.
In the field of text classification, Training Accuracy on Single Feature (TASF) has been proved to be an effective metric of discriminative ability [9], which builds a classifier for each feature, and the corresponding training accuracy is used as the discriminative score.
Various classifiers can be used to calculate TASF, in simplification, we consider a linear learner here. Given a feature z, the classification function is given as:
where and n^{1 }are the feature mean and the sample size of class one, and n^{2 }are the feature mean and the sample size of class two. This is a weighted centroid based classifier, which predicts examples as the class label whose weighted distance to its centroid is smaller. The computational complexity of this classifier is O(n).
Putting the whole training set back, we can estimate training accuracy of each classifier by different features, which is used to represent discriminative ability of the corresponding feature. The higher training accuracy, the greater discriminative ability. Since only one feature is used to build the classifier, a part of training examples can be correctly separated in most cases. So the value of TASF ranges from 0 to 1. One feature is considered as an irrelevant one if its TASF value is no greater than 0.5.
Based on TASF, we propose a novel metric of feature redundancy. Given two features of z_{1 }and z_{2}, two classifiers C_{1 }and C_{2 }can be constructed. Feeding the whole training set to the classifiers, both C_{1 }and C_{2 }can correctly classify a sample subset. The differences of the correctly classified examples are used to estimate the similarity of discriminative abilities. We record the concrete classification results as in table 5, where a + b + c + d equals to the size of the training set n. The values of (a + b)/n and (a + c)/n are training accuracy of C_{1 }and C_{2 }respectively. The score of a + d measures the similarity of the features, and the score of b + c measures the dissimilarity. When b + c = 0, the two features z_{1 }and z_{2 }have exactly the same discriminative ability.
Table 5. Statistical relative classification results of two classifiers
Our feature elimination problem is becoming whether the contribution of the additional feature to the given feature is significant. The additional feature is considered as redundant if its contribution is tiny. Then, we propose a novel metric of Redundancy based on DIScriminative Contribution (DISC). DISC of z_{1 }and z_{2}, which estimates z_{2}'s redundancy to z_{1}, is defined as follows,
The pairwise DISC metric is asymmetrical, and the computation complexity is O(n).
It is clear that c + d is the number of examples which could not be discriminated by C_{1}, c is that which could be correctly classified by the collaboration of C_{1 }and C_{2}. So the proportion of c/(c + d) is the discriminative contribution of C_{2 }to C_{1}, and the value of d/(c + d) is the DISC metric of redundancy, which varies from 0 to 1. When the DISC score takes 1, C_{2}'s discriminative ability is covered by C_{1}'s and then z_{2 }is completely redundant to z_{1}. When the DISC value is 0, all training examples could be correctly classified by the union of C_{2 }and C_{1 }and we consider z_{2 }is complementary to z_{1}.
DISC is proposed in a linear way, which shows in two respects, one is the linear classifier, another is the linear way of counting the cross discriminative abilities. The microarray problems meet the assumption, since most microarray data sets are binary classification problems, where each gene has equal position to perform classification.
The proposed redundant gene elimination algorithms
The REDISC algorithm
Based on the DISC redundancy metric, we propose the REDISC algorithm (Redundancy Elimination based on Discriminative Contribution), which eliminates redundant features by the pairwise DISC scores. REDISC is illustrated in Figure 5, its basic idea is that, firstly, REDISC filters out trivial features, which do not have discriminative ability on itself, by the TASF score threshold of 0.5. Then the features are ordered by their TASF scores. As we usually want to retain the more discriminative one between two redundant features, REDISC tries to preserve the top TASF score ranked features. REDISC uses two nested iterations to eliminate redundant features whose discriminative ability are covered by any higher ranked features. The computational complexity of REDISC is O(np^{2}).
Figure 5. The REDISC algorithm.
The RELIC algorithm
In order to compare our method with commonly used redundant feature elimination methods, we present the algorithm of RELIC (Redundancy Elimination based on Linear Correlation) [10], which filters out redundant features by the pairwise linear correlation. A threshold is needed to control how many features should be eliminated. RELIC is given in Figure 6, whose computational complexity is also O(np^{2}).
Figure 6. The RELIC algorithm.
Feature extraction techniques
Principle component analysis
Principle component analysis (PCA) is a wellknown method of feature extraction [11]. The basic idea of PCA is to reduce the dimensionality of a data set, while retaining as much as possible the variation in the original predictor variables. This is achieved by transforming the p original variables X = [x_{1}, x_{2}, ..., x_{p}] to a new set of K predictor variables, T = [t_{1}, t_{2}, ..., t_{K}], which are linear combinations of the original variables. In mathematical terms, PCA sequentially maximizes the variance of a linear combination of the original predictor variables,
subject to the constraint , ∀ 1 ≤ i <j. The orthogonal constraint ensures that the linear combinations are uncorrelated, i.e. Cov(Xu_{i}, Xu_{j}) = 0, i ≠ j. These linear combinations
are known as the principal components (PCs).
The maximum number of components K is determined by the number of nonzero eigenvalues, which is the rank of S_{X}, and K ≤ min(n, p). But in practical, the maximum value of K is not necessary. Some tail components, which have tiny eigenvalues and represent few variances of original data, are often needed to be reduced. The threshold of K often determined by crossvalidation or the proportion of explained variances [11]. The computational cost of PCA, determined by the number of original predictor variables p and the number of samples n, is in the order of min(np^{2 }+ p^{3}, pn^{2 }+ n^{3}). In other words, the cost is O(pn^{2 }+ n^{3}) when p >n.
Partial Least Squares
Partial Least Squares (PLS) was firstly developed as an algorithm performing matrix decompositions, and then was introduced as a multivariate regression tool in the context of chemometrics [12,13]. In recent years, PLS has also been found to be an effective feature extraction technique for tumor discrimination [14,15].
The underlying assumption of PLS is that the observed data is generated by a system or process which is driven by a small number of latent (not directly observed or measured) features. Therefore, PLS aims at finding uncorrelated linear transformations (latent components) of the original predictor features which have high covariance with the response features. Based on these latent components, PLS predicts response features y, the task of regression, and reconstruct original matrix X, the task of data modeling, at the same time.
The objective of constructing components in PLS is to maximize the covariance between the response variable y and the original predictor variables X,
subject to the constraint , ∀ 1 ≤ i <j. The central task of PLS is to obtain the vectors of optimal weights w_{i }(i = 1, ..., K) to form a small number of components, while PCA is an "unsupervised" method that utilizes the X data only.
Like PCA, PLS reduces the complexity of microarray data analysis by constructing a small number of gene components, which can be used to replace the large number of original gene expression measures. Moreover, obtained by maximizing the covariance between the components and the response variable, the PLS components are generally more predictive of the response variable than the principal components.
PLS computes efficiently with a cost only at O(npK), i.e. the number of calculations required by PLS is a linear function in terms of n or p. Thus it is much faster than the method of PCA for K is always less than n.
Feature extraction methods extract components to represent original data, which are linear or nonlinear transformations of original genes. Although the new subspace is effective for data analysis, no original gene is excluded during the process, which often obstructs the explanations of PCs. In order to solve this problem, eliminating redundant genes before dimension reduction is an alternative way.
Classifier – Support Vector Machines
Support vector machines (SVM) proposed by Vapnik and his coworkers in 1990s, have been developed quickly during the last decade [16], and successfully applied to biological data mining [17], drug discovery [18,19] etc. Denoting the training sample as S = {(x, y)} ⊆ {ℝ^{n }× {1, 1}}^{ℓ}, SVM discriminant hyperplane can be written as
where w is a weight vector, b is a bias. According to the generalization bound in statistical learning theory [20], we need to minimize the following objective function for a 2norm soft margin version of SVM:
in which, slack variable ξ_{i }is introduced when the problem is infeasible. The constant C > 0 is a penalty parameter, a larger C corresponds to assigning a larger penalty to errors.
Data Sets
Two microarray data sets used in our study are listed in Table 6. They are briefly described as below, and the corresponding C4.5 format versions are available at [21]. We do not use the original split by their authors, we merge the data set before using it.
Table 6. Experimental data sets
Colon used Affymetrix oligonucleotide arrays to monitor expressions of over 6,500 human genes with samples of 40 tumor and 22 normal colon tissues. Expression of the 2,000 genes with the highest minimal intensity across the 62 tissues were used in the analysis [2].
Leukemia The acute leukemia data set was published by [1], which consists of 72 bone marrow samples with 47 ALL and 25 AML. The gene expression intensities are obtained from Affymetrix highdensity oligonucleotide microarrays containing probes for 7,129 genes.
Experimental settings
We use the stratified 10fold crossvalidation procedure, where each data set is firstly merged and then split into ten subsets of equal size. Each subset is used as a test set once, and the corresponding left subsets are combined together and used as the training set. Within each crossvalidation fold, the gene expression data is standardized. The expressions of the training set are transformed to zero mean and unit standard deviation across samples, and the test set are transformed according to the means and standard deviations of the corresponding training set. We use 10 fold cross validation because the 10 × 10 crossvalidation measurement is more reliable than the randomized resampling test strategy and the leaveoneout crossvalidation due to the correlations between the test and training sets, some detail discussions can be found at [22].
The linear Support Vector Machine (SVM) with C = 1 is used as the classifier, which is trained on the training set to predict the label of test samples. Figure 7 contains pseudocode to describe the complete 10 × 10 crossvalidation measurement procedure.
Figure 7. Experimental procedure for comparing different algorithms.
In order to precisely characterize the performance of different learning methods, we define several performance measures below (see [23]). Here TP, TN, FP, and FN, stand for the number of true positive, true negative, false positive, and false negative samples, respectively.
Sensitivity is defined as and is also known as Recall.
BACC (Balanced Accuracy) is defined as , which defines the average of sensitivity and specificity.
PPV (Positive Predictive Value) is defined as .
NPV (Negative Predictive Value) is defined as .
Correction is defined as and measures the overall percentage of samples correctly classified.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
GuoZheng Li and XueQiang Zeng proposed the idea, designed the experiments and wrote the paper; XueQiang Zeng performed experiments; GengFeng Wu helped in writing the paper; Mary Qu Yang helped design the experiments; Jack Y. Yang conceived and guided the project.
Acknowledgements
This work was supported in part by the Natural Science Foundation of China under grant no. 20503015, the STCSM "Innovation Action Plan" Project of China under grant no. 07DZ19726, Shanghai Leading Academic Discipline Project under no. J50103, Systems Biology Research Foundation of Shanghai University and Scientific Research Fund of Jiangxi Provincial Education Departments under grant no. 200757.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 6, 2008: Symposium of Computations in Bioinformatics and Bioscience (SCBB07). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/9?issue=S6.
References

Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression.
Bioinformatics & Computational Biology 1999, 286(5439):531537.

Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
Proceedings of the National Academy of Sciences of the United States of America 1999, 67456750. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Antoniadis A, LambertLacroix S, Leblanc F: Effective dimension reduction methods for tumor classification using gene expression data.
Bioinformatics 2003, 19(5):563570. PubMed Abstract  Publisher Full Text

Nguyen DV, David DM, Rocke M: On partial least squares dimension reduction for microarraybased classification: a simulation study.
Computational Statistics & Data Analysis 2004, 46(3):407425. Publisher Full Text

Dai JJ, Lieu L, Rocke D: Dimension reduction for classification with gene expression data.
Statistical Applications in Genetics and Molecular Biology 2006, 5:Article 6. Publisher Full Text

Yu L, Liu H: Redundancy Based Feature Selection for Microarray Data.
Proc. 10th ACM SIGKDD Conf. Knowledge Discovery and Data Mining 2004, 2225.

Yu L, Liu H: Efficient Feature Selection Via Analysis of Relevance and Redundancy.
Journal of Machine Learning Research 2004, 5(Oct):12051224.

Guyon I, Elisseefi A: An Introduction to Variable and Feature Selection.
Journal of Machine Learning Research 2003, 3(7–8):11571182. Publisher Full Text

Forman G: An Extensive Empirical Study of Feature Selection Metrics for Text Classification.
Journal of Machine Learning Research 2003, 3:12891305. Publisher Full Text

Hall MA, Holmes G: Benchmarking attribute selection techniques for discrete class data mining.
IEEE Transactions on Knowledge and Data Engineering 2003, 15(6):14371447. Publisher Full Text

Jolliffe IT: Principal Component Analysis. second edition. Springer Series in Statistics, Springer; 2002.

Wold S, Ruhe A, Wold H, Dunn W: Collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses.
SIAM Journal of Scientific and Statistical Computations 1984, 5(3):735743. Publisher Full Text

Boulesteix AL, Strimmer K: Partial Least Squares: A Versatile Tool for the Analysis of HighDimensional Genomic Data.
Briefings in Bioinformatics 2006. PubMed Abstract  Publisher Full Text

Nguyen DV, Rocke DM: Multiclass cancer classification via partial least squares with gene expression profiles.
Bioinformatics 2002, 18(9):12161226. PubMed Abstract  Publisher Full Text

Nguyen DV, Rocke DM: Tumor classification by partial least squares using microarray gene expression data.
Bioinformatics 2002, 18:3950. PubMed Abstract  Publisher Full Text

Cristianini N, ShaweTaylor J: An Introduction to Support Vector Machines. Cambridge: Cambridge University Press; 2000.

Guyon I, Weston J, Barnhill S, Vapnik V: Gene Selection for Cancer Classification Using Support Vector Machines.
Machine Learning 2002, 46:389422. Publisher Full Text

Xue Y, Li ZR, Yap CW, Sun LZ, Chen X, Chen YZ: Effect of Molecular Descriptor Feature Selection in Support Vector Machine Classification of Pharmacokinetic and Toxicological Properties of Chemical Agents.
Journal of Chemical Information & Computer Science 2004, 44(5):16301638. Publisher Full Text

Bhavani S, Nagargadde A, Thawani A, Sridhar V, Chandra N: SubstructureBased Support Vector Machine Classifiers for Prediction of Adverse Effects in Diverse Classes of Drugs.
Journal of Chemical Information and Modeling 2006, 46(6):24782486. Publisher Full Text

Vapnik V: Statistical Learning Theory. New York: Wiley; 1998.

Li J, Liu H: Kent Ridge Biomedical Data Set Repository. [http://www.cs.shu.edu.cn/gzli/data/mirrorkentridge.html] webcite
2002.

Dietterich TG: Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms.
Neural Computation 1998, 10:18951923. PubMed Abstract  Publisher Full Text

Levner I: Feature Selection and Nearest Centroid Classification for Protein Mass Spectrometry.
BMC Bioinformatics 2005, 6:68. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text