Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Methodology article

Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data

Xuegong Zhang1*, Xin Lu23, Qian Shi3, Xiu-qin Xu4, Hon-chiu E Leung4, Lyndsay N Harris3, James D Iglehart3, Alexander Miron3, Jun S Liu5 and Wing H Wong6

Author Affiliations

1 Bioinformatics Div, TNLIST and Dept of Automation. Tsinghua University, Beijing, 100084, China

2 Dept of Biostatistics, Harvard School of Public Health, 655 Huntington Ave., Boston, MA 02115, USA

3 Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, 02115, USA

4 Medical Proteomics and Bioanalysis Section, Genome Institute of Singapore, Singapore

5 Dept of Statistics, Harvard University, 1 Oxford St., Cambridge, MA 02138, USA

6 Department of Statistics, Stanford University, Stanford, CA 94305, USA

For all author emails, please log on.

BMC Bioinformatics 2006, 7:197  doi:10.1186/1471-2105-7-197

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/7/197


Received:23 January 2006
Accepted:10 April 2006
Published:10 April 2006

© 2006 Zhang et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Like microarray-based investigations, high-throughput proteomics techniques require machine learning algorithms to identify biomarkers that are informative for biological classification problems. Feature selection and classification algorithms need to be robust to noise and outliers in the data.

Results

We developed a recursive support vector machine (R-SVM) algorithm to select important genes/biomarkers for the classification of noisy data. We compared its performance to a similar, state-of-the-art method (SVM recursive feature elimination or SVM-RFE), paying special attention to the ability of recovering the true informative genes/biomarkers and the robustness to outliers in the data. Simulation experiments show that a 5 %-~20 % improvement over SVM-RFE can be achieved regard to these properties. The SVM-based methods are also compared with a conventional univariate method and their respective strengths and weaknesses are discussed. R-SVM was applied to two sets of SELDI-TOF-MS proteomics data, one from a human breast cancer study and the other from a study on rat liver cirrhosis. Important biomarkers found by the algorithm were validated by follow-up biological experiments.

Conclusion

The proposed R-SVM method is suitable for analyzing noisy high-throughput proteomics and microarray data and it outperforms SVM-RFE in the robustness to noise and in the ability to recover informative features. The multivariate SVM-based method outperforms the univariate method in the classification performance, but univariate methods can reveal more of the differentially expressed features especially when there are correlations between the features.

Background

Accurate classification of patients with complex diseases such as cancer is crucial for successful treatment of the diseases. High-throughput proteomics techniques based on mass spectrometry (MS) have made it possible to investigate proteins over a wide range of molecular weights in a style similar to gene expression studies with DNA microarrays. The advancement of these techniques has generated many analytic challenges, among which a central task is to discover 'signature' protein profiles specific to each pathologic state (e.g. normal vs. cancer) or differential profiles between experimental conditions (e.g drug responses) from high-dimensional data [1]. The technique of Surface Enhanced

Laser Desorption/Ionization Time-of-Flight Mass Spectrometry (SELDI-TOF-MS) [2] has been used in many recent disease studies [3-5]. Although it is a convenient method for screening a cohort of samples and finding promising protein markers from serum or plasma samples, it suffers from a relatively low sensitivity and specificity and a high noise level [6].

Like in many other biological studies, a key difficulty in such high-throughput studies is the noisy nature of the data, which can be caused by the intrinsic complexity of the biological problems, as well as experimental and technical imperfections. Another difficulty arises from the high dimensionality of the data. Similar to the situation in microarray studies, typically one proteomics investigation only involves several tens of samples but the measured points on the mass spectrum can be in the thousands or more. Even after pre-processing steps such as peak and/or biomarker detection, the dimensionality is usually still larger than or comparable to the sample size. This makes many standard pattern classification algorithms fail. For those that do work theoretically, there is a high risk of overfitting due to the small sample size. Thus, there is an algorithmic need for feature selection in addition to the biological need of discovering a manageable set of key molecular factors (genes or biomarkers) behind the disease. As observed in [7] and other investigations, for machine learning methods such as support-vector machines (SVMs) that can work at high-dimensionality, dimension reduction could still improve the performance dramatically. However, it should be noted that when validating the performance of a classification algorithm with feature-selection steps, the feature selection procedure should also be validated simultaneously to avoid bias in the assessment. Also, due to the small sample size, the cross-validation prediction of the algorithm's performance tends to have a high variance. Thus, we should pay more attention to properties related to generalization ability rather than the prediction performance per se. We suspect that failing to do so may be a reason why good results published from one investigation may not be reproduced by other investigations.

Guyon et. al. [7] proposed a SVM-RFE (support vector machine recursive feature elimination) algorithm to recursively classify the samples with SVM and select genes according to their weights in the SVM classifiers. We proposed a method R-SVM with a similar recursive strategy but used a different criterion to evaluate and select the most important genes [8], and a correct scheme to estimate the prediction performance. In this paper, we describe the R-SVM method with a voting scheme for feature selection and compare its performance with SVM-RFE on simulation data and two SELDI-TOF-MS datasets, one on rat liver cirrhosis and another on human breast cancer. We found that cross-validation prediction performances of R-SVM and SVM-RFE were nearly the same, but R-SVM was more robust to noise and outliers in discovering informative genes and therefore has better accuracy on independent test data.

R-SVM and SVM-RFE represent typical machine-learning-based multivariate approaches for the task. Conventional univariate methods are also frequently used for feature selection and classification. We compared the SVM-based methods with the weighted-voting (WV) method [9] on simulation studies, and discussed their respective strengths and weaknesses. Although our real applications were conducted on MS data, the method can be applied broadly to microarray data and other high-throughput genomics and proteomics data.

Results and discussion

Simulated data sets

We generated three types of simulated data to investigate the characteristics of the feature selection and classification methods. The basic simulation model for the first two types of data is as follows: Each sample contains simulated expression values of 1000 genes. Among all the genes, 300 are "informative" ones, each following independently the Gaussian distribution N(0.25, 1) for class 1 and N(-0.25, 1) for class 2. The rest 700 "uninformative" genes follow independently N(0,1) for both classes. For each simulation experiment, we generated a training set of 100 samples (50 for each class) and an independent test set of 1000 samples (500 for each class). The two types of simulated data were generated by adding noises to this general model in different ways. The first type (Data-G) mimics the situation where some of the gene expressions in some samples are outliers. For the informative genes, we randomly chose 5% of expression values in all samples as outliers by making them to follow N(0.25, 100) for the class-1 sample and N(-0.25, 100) for the class-2 sample. The second type of simulated datasets (Data-S) is constructed to contain 5% "outlier samples," which were made by randomly picking 5% of the samples and increasing the standard deviation of every gene in these samples by 10 fold. We did 100 simulations for each type of the data (Data-G and Data-S).

The above two simulation models are over-simplified in many aspects. To mimic more realistic situations, we generated the third type of simulated data based on a real human breast cancer microarray dataset obtained with Affymetrix U133 Plus 2.0 microarrays. The dataset originally contains 78 estrogen receptor positive (ER+) cases and 54 estrogen receptor negative (ER-) cases. The data were normalized by the GCRMA algorithm [10,11], and the gene (probe-set) expression levels were log2-transformed. According to our previous experiments as well as published work (e.g., [12,13]), the ER status is one of the most predominant partitioning factors for molecular classification of breast cancer. We therefore took the differentially expressed genes between the two classes as "truly informative" genes. We extracted the genes whose average differences between the two classes are greater than 0.3, which gave us 16,722 genes (if we use t-test tofind the differentially expressed genes, there will be 24,414 genes come out with FDR-adjusted p-value cutoff 0.05). From these genes, we randomly selected 300 to be the "informative genes" in the simulation dataset. Then, we randomly took another 700 genes and "force" them to be uninformative by permuting their sample labels. By combining the "informative genes" and "uninformative genes", we obtained a simulated dataset in which there are 300 informative genes and 700 uninformative genes. The correlations among the informative genes and among the uninformative genes are the same as those in the original dataset. From this dataset, we randomly took 45 ER+ and 30 ER- samples as the training set, and used the remaining samples as independent test set. We generated 100 sets of data by this strategy. We call this type of datasets Data-R in our experiments.

Real SELDI proteomics data sets

We applied the two SVM-based methods to two real SELDI-TOF-MS proteomics datasets. The first dataset is from a rat model used to discover serum biomarkers of liver cirrhosis [14]. It contains serum protein profiles of 26 normal rats and 69 thioacetamide-induced liver cirrhosis rats. The biomarker function of Ciphergen ProteinChip software 3.0 detected 93 biomarkers from the raw data. They were normalized according to their mean values, small values were truncated to 1 and all biomarkers were log transformed. The second proteomics dataset came from a human breast cancer study. We obtained plasma samples from 75 breast cancer patients and 61 healthy women [15]. The plasma samples were pH-fractionated and analyzed by SELDI-TOF-MS. The Ciphergen ProteinChip software 3.1 detected 98 biomarkers, which were preprocessed in the same way as for the rat liver cirrhosis data.

Comparison of R-SVM with SVM-RFE on simulated datasets

We first compared the performance of R-SVM to SVM-RFE on Data-G and Data-S. For each of these datasets, we applied R-SVM and SVM-RFE to perform gene selection, built SVM models on training data with selected genes, and tested the models on the independent test data. Experiments were done 100 times for each data type. The following aspects of performances are inspected at each level of gene selection: number of SVs (support vectors, see Method) used in the SVM model, test error, the percentage of true informative features recovered in the selection, and for Data-S the number of outlier samples used as SVs. Tables 1 and 2 show the relative improvements of R-SVM over SVM-RFE with regard to these factors averaged on the 100 experiments for each data type, as well as the p-values (by paired t-test) of the differences. The cross-validation (CV) errors on training sets are similar between the two methods so the comparison is not shown in the tables (Slight improvement of R-SVM over SVM-RFE can be observed on average errors but the improvement is not significant).

Table 1. Comparison of R-SVM and SVM-RFE on Data-G (with gene outliers)

Table 2. Comparison of R-SVM and SVM-RFE on Data-S (with sample outliers)

It can be seen on both Data-G and Data-S that at most of the selection levels, especially at those lower levels (with fewer selected genes), the number of SVs used by R-SVM is 5 %~8 % fewer than that of SVM-RFE, indicating the better generalization ability of R-SVM. One can also see that at the same selection level, R-SVM recovers significantly more of the informative genes than SVM-RFE, and the improvement is about 3 %~7 % at lower selection levels. These two factors can explain the observation that independent test errors of R-SVM were significantly lower than that of SVM-RFE (by 5–10%) at lower selection levels. On the Data-S with outlier samples, R-SVM also shows a better ability to avoid taking the outlier samples as SVs (R-SVM reduces up to 94% of the outlier SVs than SVM-RFE).

As simulation models for Data-G and Data-S are too simplistic, we compared on Data-R and the results are shown in Table 3. It can be seen that the improvement of R-SVM over SVM-RFE with regard to both the number of SVs used and the number of informative genes recovered are more significant. In terms of testing errors on the independent data, SVM-RFE outperformed R-SVM initially at high selection levels. When fewer genes were selected, R-SVM gradually out-raced SVM-RFE, and the improvement of R-SVM over SVM-RFE became more obvious as the number of selected features decreased. This tendency was also observed on Data-G and Data-S. It is likely due to the fact that R-SVM selected more informative features and that it is more important to include truly informative ones when fewer features are used in a classifier.

Table 3. Comparison of R-SVM and SVM-RFE on Data-R

Application on the rat liver cirrhosis data

We applied R-SVM and SVM-RFE to the two real SELDI-TOF-MS datasets. Table 4 shows the results on the rat liver cirrhosis data. Since there is no independent test set and there is no standard answer about the true informative features, we only list the cross-validation errors and the number of SVs in Table 4. It can be seen that at some selection levels R-SVM achieved smaller CV errors, while at others SVM-RFE gave smaller CV errors, but the differences are not significant. However, at most levels, R-SVM uses fewer SVs than SVM-RFE. (Note that in Tables 4 and 6, the number of SVs is the average among the cross-validation experiments, different from that in Tables 1 and 2.)

Table 4. The CV results on the rat cirrhosis data

Table 5. The top 6 R-SVM-selected biomarkers with their t-test and ROC statistics

Table 6. The CV results on the human breast cancer dataset

The top 6 biomarkers selected by R-SVM were reported for further biological investigation. They are listed in Table 5 along with their t-statistics and ROC statistics. We see that the top 6 markers are all significantly correlated with the sample classification on their own, but not necessarily be ranked at the top according to the univariate criterion. Among the top 6 biomarkers, the 3,495 Da protein was down-regulated in the liver cirrhosis rats. On-chip purification and tryptic digestion was conducted. Combined data from PMF (peptide mass fingerprint) and MALDI-TOF/TOF MS/MS spectra suggested that this 3495 Da protein shares homology to a histidine-rich glycoprotein [14]. It has been reported that the mRNA of this gene was found to be specifically expressed only in liver [16]. We speculate that down-regulation of histidine-rich glycoprotein in cirrhotic liver may be a manifestation of loss of normal liver function, including secretary pathways upon treatment with thioacetamide [14].

Application on the human breast cancer data

The results of R-SVM and SVM-RFE on the breast cancer dataset are listed in Table 6. Error rates and SV numbers on this dataset were higher than in the rat liver cirrhosis study, due to the complexity of the problem under study and individual variations among human individuals. Still, we found that R-SVM tends to use fewer SVs at most selection levels and, hence, may have a better generalizability than SVM-RFE.

The minimal CV error rate (17.6%) is achieved at the 8-feature level by R-SVM. These 8 markers and their t-statistics and ROC statistics are listed in Table 7. Six of the top 8 R-SVM selected markers are called significant by t-statistics (p-value<0.01; 5 of them with p-value<0.0001). The top marker, Marker-5, has an AUC (area under the ROC curve) of only 0.775, but the 8 markers jointly classify 82.4% of all cases correctly. The AUC of the SVM model built on the R-SVM-selected with 8 markers is 0.928, much larger than that of the best single marker, Marker-5. Using direct on-chip sequencing provided by Ciphergen, one important peptide (Peptide A) was identified and was selected for further study. Follow-up biological study showed that this peptide may be an important indicator of the disease status of breast cancer patients [15].

Table 7. T-statistics and ROC statistics of the 8 R-SVM-selected markers on the breast cancer data

We also applied the Random Forest (RF) method [17] to this human breast cancer dataset. With the DecreaseGiniDistance criterion [17], we selected the 8 most important markers from this data, of which 6 are also among the 8 markers selected by R-SVM. The out-of-bag error reported by the RF method was 25.7%, which is higher than the minimal CV error of 17.6% reached by R-SVM. However, these two error rates are not directly comparable. The out-of-bag error reported by RF is based on a resampling approach, which makes the effective sample size of its training set smaller than the full size and, therefore, causes the estimated prediction error to be biased upwards. When the sample size is small, the bias of resampling error without any correction can be large. On the other hand, although cross validation error by the CV2 scheme (see Method) is unbiased, selecting the minimal error among multiple levels of feature eliminations may lead to a slightly over-optimistic estimate. Taking these two factors together, we conclude that the two tested methods performed similarly on this dataset.

Comparison with the univariate method

Many researchers believe that it is beneficial to use multivariate methods to analyze microarray and proteomics data as genes usually work in collaborative ways rather than as independent factors. However, univariate methods are still useful for identifying differentially expression features and continue to play important roles in many applications. Therefore it is worthwhile to compare the performances of the SVM-based methods with more conventional univariate approaches, of which we take the weighted-voting (WV) method as a representative in this work.

Tables 8 and 9 show the comparison of R-SVM and WV on Data-G and Data-S. It can be seen that R-SVM outperformed WV on Data-G in both test accuracy and the recovery of informative genes at all selection levels, and the improvement is very significant. However, R-SVM was significantly inferior to WV in both aspects on Data-S, and the difference was more obvious when fewer genes are selected. These observations suggest that R-SVM is more robust than WV to outlier values spreading randomly in the dataset, but suffers more when some samples are entirely corrupted. Indeed, although R-SVM uses the class means to represent the samples for feature selection, the standard SVM uses only boundary samples (SVs) in building the classifiers, thus a few outlier samples can make big effect on degrading its performance. (On this specific simulation model, the effect of the outlier samples can be easily cancelled by taking a normalization step on the samples.)

Table 8. The comparison of SVM vs. WV on Data-G

Table 9. The comparison of R-SVM vs. WV on Data-S

Table 10 gives the comparison of R-SVM and WV on Data-R. It is interesting to note that, unlike on Data-G or Data-S, R-SVM is superior to WV in terms of test accuracy on independent test sets, but is inferior to WV in terms of the number of recovered informative genes. The differences become smaller when fewer genes were used, but they are still significant. This phenomenon has not been observed in the other types of simulated data, and is likely caused by the fact that some genes in Data-R are correlated as in real microarray and MS data, whereas the genes in Data-G and Data-S are independently generated. As a multivariate method, SVM uses the genes not according to their individual differences between the classes, but rather according to the collaborative information of multiple genes. Thus, some correlated genes will not be selected due to the redundancy of information, even if they are all differentially expressed. This is not a disadvantage for SVM when the major goal is better classification. However, if the goal includes discovering all genes that are informative, new strategies will be needed for improving SVM-like methods on this aspect.

Table 10. Comparison of R-SVM vs. WV on Data-R

We should note that all our simulation designs (Data-G, Data-S, and Data-R) are favorable to univariate methods since the true underlying classification information is in the differential expression of individual genes. The models haven't consider any collaborative effects (not even additive effects), thus they bias in favor of the single-variable approach a priori. We expect that in more complicated situations when combinatorial effects play major roles, SVM-based methods will perform even better.

Conclusion

High-throughput genomics and proteomics data open a new route to the classification of complex diseases. Machine learning methods for feature selection and classification have been playing active roles in analyzing such data. We compared two similar methods, SVM-RFE and R-SVM, both adopting recursive procedures to select features according to their importance in SVM classifiers. The major difference between the two methods is the criteria used for evaluating the contribution of genes. Although the two methods did not differ significantly in their cross-validation performances, it appeared that R-SVM is more robust to severe noise and outliers and can recover more informative genes. The successful application of R-SVM on the rat liver cirrhosis SELDI data and human breast cancer SELDI data show that the proposed strategy can help to identify biologically important markers.

We compared R-SVM with a representative univariate method, the weighted-voting method on simulated data. Although the comparison is limited in scope (especially, no combinatorial effects have been simulated in the models), some interesting insights regarding their respective strengths and weaknesses are observed. The SVM-based method performs better in terms of the classification accuracy, but univariate methods can reveal more of the differentially expressed individual features. A more systematic comparative study of the univariate and multivariate methods can be helpful for better understanding the nature of the methods and problems.

Availability and requirements

AR code package of the R-SVM method and a Linux-based executable package are freely available.

Project name: R-SVM (R and Linux versions)

Project home page: http://www.hsph.harvard.edu/bioinfocore/R-SVM.html webcite

Operating systems: R version: Windows XP, Linux; Linux version: Linux.

Programming languages: R, C/C++

Other requirements: R for the R version; SVMTorch (provided) for the Linux version

License: free

Methods

Assessing the performance of feature selection

Since an independent test set is not available in many investigations, cross-validation (e.g., leave-one-out cross-validation or LOOCV) is often used to assess the accuracy of classifiers. It should be noted that feature selection results may vary with even a single-case difference in the training set when the sample size is small. In some literature, feature selection steps were external to the cross validation procedures, i.e., the feature selection was done with all the samples and the cross-validation was only done for the classification procedure. We call this kind of cross validation CV1, with examples including [7,9,18-21]. As pointed out by [22-24], CV1 may severely bias the evaluation in favor of the studied method due to "information leak" in the feature selection step. For example, in a pilot study, we generated 100 samples with 1000 features for each sample, all coming invariably from the Gaussian distribution N(0,1). We randomly assigned the samples into two classes ("fake-classes"). Since the data set is totally non-informative, the faithful CV error should be around 50% no matter what method is used. But by CV1 scheme we could achieve a CV error as low as 0.025 after recursive feature selection, which shows that the bias caused by the improper cross-validation scheme can be surprisingly large. A more proper approach is to include the feature selection procedure in the cross validation, i.e., to leave the test sample(s) out from the training set before undergoing any feature selection. In this way, not only the classification algorithm, but also the feature selection method is validated. We call this scheme CV2 and use it in all of our investigations throughout. For the above "fake-class" data, the error rate evaluated by CV2 was always around 50% regardless of the specific method used for feature selection and classification.

The relative importance of features in SVM classifiers

As a powerful and popular multivariate machine-learning method, SVMs have been widely used in biological classification problems. The key idea of the SVM [25-27] is to maximize the margin separating the two classes while minimizing the total classification errors. The theory and algorithm of SVM has been described in many papers and books (e.g., [26-29]), and several sets of SVM codes are publicly available such as the SVMTorch [30] we use. The SVM can be extended to its nonlinear forms by using proper kernels to replace the inner-products. However, information is usually far from sufficient for reliably estimating nonlinear relations for microarray or MS data. We therefore use the linear SVM here and take it as a reasonable first-order approximation to the "truth" with our limited data. The decision function of a linear SVM is:

where x is the gene/protein expression vector of a sample, xi is that of sample i in the training set (i = 1,2,...,n), yi ∈{+1,-1} is its corresponding class label, is the vector of weights of the features, and b is a scalar offset. The ai 's and b are estimated from the training set. Only those samples closest to the separating boundary (called support vectors or SVs) have non-zero ai 's, and therefore the function f(x) is a linear combination of only the SVs. For a new sample x, the sign of the decision function f(x) predicts the class it belongs to, and the absolute value of f(x) represents how far the sample is from the boundary. Theoretical investigations show that the proportion of SVs in the training set reflects an upper bound of the expected generalization error (error rate of predictions on future samples) [28].

Our goal is to select a subset of features with the maximum discriminatory power between the two classes. Since the feature dimension is large and the sample size is small, there are usually many combinations of features that can give zero error on the training data. Therefore, the "minimal error" criterion cannot work. Intuitively, it is desirable to find a set of features that give the maximal separation between the two classes of samples. Taking the possible unbalanced sample size into consideration, we define the following measure:

where n1 and n2 are the numbers of samples in class 1 and 2. The larger S is, the better separated the two classes are. Considering (1) and denoting the means of feature j in the two classes as and , we get:

where d is the total number of features, and wj is the jth component of the weight vector w . S is equivalent to the cosine of the angle between the normal vector of the separation plane (hyperplane) f(x) = 0 and the vector m+-m- connecting the two class-means. It is the sum of terms defined on the single features, so we can define the contribution of feature j in S as

We call sj the contribution factor of feature j. It is not only decided by the weight wj in the classifier function, but also decided by the data (the class-means).

An alternative way for measuring the importance of the features is using the square of weight as derived in SVM-RFE by sensitivity analysis [7]. This can be viewed as using the weighted sums of support vectors in each class as the representatives, i.e.,

The separation between the two representatives can be written as

which can be decomposed into the sum the contributions of every features:

We note that, in the case where distributions of the two classes are both high-dimensional Gaussian with identical covariance matrix ∑, the optimal w is Fisher's linear discriminant function of the form w = (m+ - m-)T-1. Thus, (4) corresponds to the individual component of (m+ - m-)T-1(m+ -m-), whereas (7) corresponds to the individual component of (m+ - m-)T-2(m+ - m-). It thus appears that (7) may have put too much emphasis on the covariance matrix implicitly estimated from the observed high dimensional data. In contrast, our scheme of using class-means as representatives ties in well with the classical linear discriminant analysis. Using class-means as representatives of the two classes makes the method less sensitive to noise and possible outliers, comparing to using only the few support vectors [31]. A more robust representative for a class of samples is the class-medians (or medoids in higher dimensions), as studied in [32]. However, in datasets we have tested, we did not observe significant differences between using medians and means. This is probably because the distributions of genes are reasonably symmetric so that the medians are close to the means.

Recursive classification and feature selection

To select a subset of features that contribute the most in the classification, we rank all the features according to sj defined in (4) and choose the top ones from the list. We use this strategy recursively in the following procedures:

Step 0. Define a decreasing series of feature numbers d0> d1> d2 > ... > dk to be selected in the series of selection steps. Set i = 0 and d0 = d(i.e., start with all features).

Step 1. At step i, build the SVM decision function with current di features.

Step 2. Rank the features according to their contribution factors sj in the trained SVM and select the top di+1 features (eliminate the bottom di di+1 features).

Step 3. Set i = i+1. Repeat from Step 1 until i = k.

This is an implementation of the backward feature elimination scheme described in pattern recognition textbooks (e.g., [33]) with criteria defined on SVM models at each feature-selection level. It should be noted that this scheme is suboptimal as it does not exhaustively search in the space of all possible combinations. Our choices of the number of iterations and the number of features to be selected in each iteration are very ad hoc. Although different settings of these parameters may affect the results, we have observed that, for most cases when the two classes can be reasonably separated with the expression data, the classification performances achieved with different settings were very close to each other, and the majority of features ranked at the top positions were also very stable.

We follow the CV2 scheme to estimate the error rate at each level. In cross-validation experiments, different training subsets generate different lists of features (although many or most of them overlap in usual experiments). A frequency-based selection method is adopted to decide the lists of features to be reported [34]. That is, after the recursive feature selection steps on each subset, we count at each of the di levels the frequency of the features being selected among all rounds of cross-validation experiments. The top di most frequently selected features are reported as the final di features (called the top features).

In most situations, CV2 errors usually follow a U-shaped curve along the selection steps (feature numbers). Finding the minimal number of features that can give the minimal CV2 error rate is often desirable for real applications. Another realistic consideration is the limited ability of follow-up biological investigations on the selected features. As a compromise, we decide the final number of features to be reported in an experiment by considering both the error rates and the limitation of follow-up biological investigations. For example, in our proteomics applications, we chose to report the number of features that is less than 10 and gives the minimum CV2 error rate for less than 10 features. The entire workflow is depicted in Figure 1; we call this whole scheme R-SVM (recursive SVM).

thumbnailFigure 1. Workflow of the R-SVM algorithm.

The SVM-RFE method

We noted that the original SVM-RFE [7] ranked the genes only once using all samples, and used the top ranked genes in the succeeding cross-validation for the classifier. This is a typical CV 1 scheme which will generate a biased estimation of errors. In order to compare SVM-RFE with the proposed R-SVM algorithm fairly, we wrote our own version of SVM-RFE following the same workflow of R-SVM with the correct cross-validation scheme and the frequency-based selection method, and using the criterion instead of sj to sort genes.

The weighted-voting method

The basic idea of the weighted voting method by Golub et al [9] is closely related to the two-sample t-statistics. First, one defines the "correlation" between the expression values of a gene g to the classes

cg = [μ1(g) - μ2(g)]/[σ1(g) + σ2(g)], (8)

where [μ1(g), σ1(g)] and [μ2(g), σ2(g)] denote the means and standard deviations of the expression levels of the gene for the samples in class 1 and class 2, respectively. The larger the absolute value |cg| is, the more important the gene is for predicting the class memberships. The genes are ranked by their |cg|'s and the top ones are selected. The class predictor is trained on the set of training samples with the selected genes. For predicting the membership of a new sample x with expression xg of gene g, the vote of gene g is vg = cg (xg - bg) where bg = [μ1(g) + μ2(g)]/2. A positive vote means a vote for class 1 and a negative vote is for class 2. The sign of the total vote is used as the final predictor.

Authors' contributions

XZ and WHW initiated this project. XZ invented the basic strategy of R-SVM with the help of WHW and they did the study on the two schemes of cross-validation. XZ developed the initial version of the algorithm. XL added the voting scheme, wrote the codes implementing the current algorithm, and did the simulation experiments and most of the computation on the proteomics data. XL and JSL derived the theoretical relationship between R-SVM and SVM-RFE. JSL and WHW guided XZ and XL on analyzing the experiments, comparing the methods, and improving the algorithms. QS, LNH, JDI and AM produced the breast cancer proteomics data and worked on the biological analysis of the results. XX and HEL produced the rat liver cirrhosis data and did the biological analysis and follow-up validation of the results. XZ, XL and JSL executed the writing with input from all authors.

Acknowledgements

The authors thank Dr. Andrea Richardson for helpful discussion on the breast cancer experiment, and Lih-yin Lim for technical assistance in protein profiling in the rat liver cirrhosis study. Benoit Valin of Tsinghua University re-wrote the Linux version of the R-SVM code (as provided on the website) after this work was done. This work is supported in part by NSF grant DMS-09000166, NIH grant 1R01HG02341, R01HG02518, NSFC grants (60575014, 10228102/A010201) and the National Basic Research Program (2004CB518605, 2004BA711A21) of China, and NCI SPORE in Breast Cancer at Harvard University and the Women's Cancer Program at Dana Farber Cancer Institute. We thank Dr. Cristian Castillo-Davis at Harvard University for helping us to improve the writing, and the referees for numerous constructive comments.

References

  1. Yasui Y, Pepe M, Thompson ML, Adam BL, Wright GL, Qu YS, Potter JD, Winget M, Thornquist M, Feng ZD: A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection.

    Biostatistics 2003, 4(3):449-463. Publisher Full Text OpenURL

  2. Fung ET, Enderwick C: ProteinChip clinical proteomics: computational challenges and solutions.

    Biotechniques 2002, (Suppl):34-38.

    40–41

    PubMed Abstract OpenURL

  3. Petricoin EF III, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC, Loitta LA: Use of proteomic patterns in serum to identify ovarian cancer.

    The Lancet 2002, 359:572-577. Publisher Full Text OpenURL

  4. Petricoin EF III, Zoon KC, Kohn EC, Barrett JC, Liotta LA: Clinical proteomics: Translating benchside promise into bedside reality.

    Nature Reviews Drug Discovery 2002, 1(9):683-695. PubMed Abstract | Publisher Full Text OpenURL

  5. Rai AJ, Chan DW: Cancer proteomics – Serum diagnostics for tumor marker discovery.

    Annals of the New York Academy of Sciences 2004, 1022:286-294. PubMed Abstract | Publisher Full Text OpenURL

  6. Diamandis EP: Analysis of serum proteomic patterns for early cancer diagnosis: drawing attention to potential problems.

    Journal of the National Cancer Institute 2004, 96(5):353-356. PubMed Abstract | Publisher Full Text OpenURL

  7. Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using support vector machines.

    Machine Learning 2002, 46(1):389-422. Publisher Full Text OpenURL

  8. Zhang X, Wong WH: Recursive sample classification and gene selection based on SVM: method and software description. [http://www.hsph.harvard.edu/bioinfocore/r-svm.pdf] webcite

    Technical Report, Department of Biostatistics, Harvard School of Public Health 2001. OpenURL

  9. Golub TR, Slonim DK, Tamayo P, Huard C, Gassenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

    Science 1999, 286:531-537. PubMed Abstract | Publisher Full Text OpenURL

  10. Wu Z, Irizarry RA: Preprocessing of oligonucleotide array data.

    Nat Biotechnol 2004, 22:656. PubMed Abstract | Publisher Full Text OpenURL

  11. Barash Y, Dehan E, Krupsky M, Franklin W, Geraci M, Friedman N, Kaminski N: Comparative analysis of algorithms for signal quantitation from oligonucleotide microarrays.

    Bioinformatics 2004, 20:839. PubMed Abstract | Publisher Full Text OpenURL

  12. Sorlie T, Perou CM, Tibshirani R, et al.: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications.

    Proc Natl Acad Sci USA 2001, 98:10869-10874. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  13. Perou CM, Sorlie T, Eisen MB, et al.: Molecular portraits of human breast tumors.

    Nature 2000, 406:747-752. PubMed Abstract | Publisher Full Text OpenURL

  14. Xu XQ, Leow CK, Lu X, Zhang X, Liu JS, Wong WH, Asperger A, Deininger S, Eastwood , Leung HC: Molecular classification of liver cirrhosis in a rat model by proteomics and bioinformatics.

    Proteomics 2004, 4(10):3235-45. PubMed Abstract | Publisher Full Text OpenURL

  15. Shi Q, Harris LN, Lu X, Petkovska A, Li X, Hwang J, McElroy NP, Gentleman R, Iglehart JD, Miron A: Declining plasma fibrinogen alpha fragment identifies HER2-positive breast cancer patients and reverts to normal levels post-surgery.

    Clin Cancer Research 2005.

    submitted

    OpenURL

  16. Hulett MD, Parish CR: Murine histidine-rich glycoprotein: Cloning, characterization and cellular origin.

    Immunology and Cell Biology 2000, 78(3):280-287. Publisher Full Text OpenURL

  17. Breiman L: Random Forest.

    Machine Learning 2001, 45:5-32. Publisher Full Text OpenURL

  18. Mukherjee S, Tamayo P, Slonim D, Verri A, Golub T, Mesirov JP, Poggio T: Support vector machine classification of microarray data. [ftp://publications.ai.mit.edu] webcite

    MIT AIMemo 1998.

    No. 1677, CBCL-182

    OpenURL

  19. Gruvberger S, Ringner M, Chen Y, Panavally S, Saal LH, Borg A, Ferno M, Peterson C, Meltzer PS: Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns.

    Cancer Res 2001, 61:5979-5984. PubMed Abstract | Publisher Full Text OpenURL

  20. Zhang H, Yu C, Singer B, Xiong M: Recursive partitioning for tumor classification with gene expression microarray data.

    PNAS 2001, 98:6730-6735. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  21. van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schrelber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer.

    Nature 2002, 415:530-536. PubMed Abstract | Publisher Full Text OpenURL

  22. Ambroise C, McLachlan GJ: Selection bias in gene extraction on the basis of microarray gene-expression data.

    PNAS 2002, 99:6562-6566. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  23. Ben-Dor A, Bruhn L, Firedman N, Nachman I, Schummer M, Yakhini Z: Tissue classification with gene expression profiles.

    RECOMB 2000, 54-64. OpenURL

  24. Furlanello C, Serafini M, Merler S, Jurman G: Entropy-based gene ranking without selection bias for the predictive classification of microarray data.

    BMC Bioinformatics 2003, 4:54-73. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  25. Furey TS, Cristianini N, Duffy N, Bedarski DW, Schummer M, Haussler D: Support vector machine classification and validation of cancer tissue samples using microarray expression data.

    Bioinformatics 2000, 16(10):906-914. PubMed Abstract | Publisher Full Text OpenURL

  26. Cortes C, Vapnik V: Support-vector networks.

    Machine Learning 1995, 20:273-297. OpenURL

  27. Vapnik VN: The Nature of Statistical Learning Theory. Springer-Verlag, New York; 1995. OpenURL

  28. Vapnik VN: Statistical Learning Theory. Wiley, New York; 1998. OpenURL

  29. Vapnik VN: An overview of statistical learning theory.

    IEEE Trans Neural Networks 1999, 10:988-999. Publisher Full Text OpenURL

  30. Collobert R, Bengio S: SVMTorch: support vector machines for large-scale regression problems.

    Journal of Machine Learning Research 2001, 1:143-160. Publisher Full Text OpenURL

  31. Zhang X: Using class-center vectors to build support vector machines.

    Neural Networks for Signal Processing IX 1999, 3-11. OpenURL

  32. Kou Z, Xu J, Zhang X, Ji L: An improved support vector machine using class-median vectors.

    Proc of 8th Intl Conf on Neural Information Processing 2001, 2:883-887. OpenURL

  33. Duda RO, Hart RE: Pattern Classification and Scene Analysis. New York: John Wiley & Sons; 1973. OpenURL

  34. Li L, Darden T, Weinberg C, Levine A, Pederson L: Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method.

    Combinational Chemistry and High Throughput Screening 2001, 4(8):727-739. OpenURL