Abstract
Background
Although mass spectrometry based proteomics demonstrates an exciting promise in complex diseases diagnosis, it remains an important research field rather than an applicable clinical routine for its diagnostic accuracy and data reproducibility. Relatively less investigation has been done yet in attaining highperformance proteomic pattern classification compared with the amount of endeavours in enhancing data reproducibility.
Methods
In this study, we present a novel machine learning approach to achieve a clinical level disease diagnosis for mass spectral data. We propose multiresolution independent component analysis, a novel feature selection algorithm to tackle the large dimensionality of mass spectra, by following our local and global feature selection framework. We also develop highperformance classifiers by embedding multiresolution independent component analysis in linear discriminant analysis and support vector machines.
Results
Our multiresolution independent component based support vector machines not only achieve clinical level classification accuracy, but also overcome the weakness in traditional peakselection based biomarker discovery. In addition to rigorous theoretical analysis, we demonstrate our method’s superiority by comparing it with nine stateoftheart classification and regression algorithms on six heterogeneous mass spectral profiles.
Conclusions
Our work not only suggests an alternative direction from machine learning to accelerate mass spectral proteomic technologies into a clinical routine by treating an input profile as a ‘profilebiomarker’, but also has positive impacts on large scale ‘omics' data mining. Related source codes and data sets can be found at: https://sites.google.com/site/heyaumbioinformatics/home/proteomics webcite
Background
With recent surges in proteomics, mass spectral proteomic pattern diagnostics has become a highly promising way of diagnosing, predicting, and monitoring cancers or other advanced diseases for its costeffectiveness and efficiency [1]. Recent studies not only demonstrate that proteomic profiling can detect the anonymous protein peaks differently expressed between cancer patients and healthy subjects, but also show the absence or presence of disease can be discovered through proteomic pattern classification. However, this novel technology remains an important research field rather than a clinical routine because of the unresolved problems in data reproducibility and classification. The data reproducibility issue refers to that no two independent studies have been found to produce same proteomic patterns. On the other hand, the data classification issue refers to that the classification accuracy obtained from mass spectral data is inadequate to attain a clinical level (e.g., 99.5%) in most studies. Although impressive sensitivities and specificities were reported in some case studies, their classification methods have no guarantee to extend to other mass spectral data to maintain a same level performance.
Many methods and protocols are proposed and being developed to enhance mass spectral data reproducibility from biological and technological aspects. They include employing peptide profiling to replace proteomics profiling to get extremely high resolution data, improving experimental designs to avoid mingles between biological and technological variables, and developing more robust preprocessing algorithms [25]. However, mass spectral data reproducibility enhancement seems to be facing a builtin challenge from the technology itself [6], i.e., almost any small, even tiny changes in the part of proteome will be amplified to rather large even huge differences in mass spectra, no matter whether the sources of the changes are from biological factors or experimental conditions. The sensitive signal amplification mechanism somewhat limits the potential of these reproducibility enhancement techniques and presents difficulties in achieving reproducible and consistent diagnosis.
On the other hand, rather fewer studies have been invested in improving mass spectral proteomic pattern classification than those of enhancing data reproducibility. To attain high disease diagnostic accuracy, many studies focus on identifying biomarkers from mass spectral profiles, which are generally a small set of protein expression peaks at selected m/z (mass/charge) ratios, through different machine learning approaches (e.g., peak selection), [7,8]. These studies are definitely important and interesting. However, they bear the following limitations. (1) The biomarker selection processing is generally individual data oriented case study. There is no guarantee to generalize it to other profiles. (2) The biomarkers obtained from these studies by nature are not reproducible because of the irreproducibility of their source data. In other words, the identified mass spectral biomarkers may lose their reusability and predictability, even if they can achieve exceptional sensitivity and specificity in classification. It is highly likely that another totally different set of biomarkers would be identified if the same type of mass spectra were generated from another set of cancer patients and healthy individuals under the same experimental conditions. (3) The sensitivity and specificity levels from the biomarkers’ classification are still inadequate to qualify this young technology as a robust clinical routine.
How could we accelerate mass spectral proteomics to become a clinical routine in complex disease diagnosis while the studies on data reproducibility enhancement are still underway? We address this challenge from a machinelearning viewpoint by developing a highperformance mass spectral pattern recognition algorithm in this study. Although data reproducibility plays a very important role in mass spectral proteomics, the essential factor to determine whether this exciting technology can fully explore its potential, to a large degree, may rely on the levels of sensitivity and specificity from mass spectral pattern classification.
If there exists a novel pattern recognition algorithm able to attain a 99.5% level accuracy in mass spectra classification for an input proteomic profile, then the profile can be viewed as a profile biomarker in disease diagnosis. This is because the highaccuracy diagnostic results would be reproducible for all input profiles by taking advantage of the novel classification technique. Under such a situation, the data reproducibility probably may not be a major concern to prevent reproducible biomarker discovery because the profile biomarker is able to “reproduce itself” by attaining clinical level diagnosis.
The high or even huge dimensionality of mass spectral data presents a challenge for highperformance proteomic pattern classification, especially for most traditional classification algorithms that were developed under the assumption that input data with a small or medium dimensionality. A mass spectral profile can be represented as a p × n matrix after preprocessing, where a row represents the ionintensities of a set of observations (samples) at a mass charge ratio (m/z), which is similar to a gene in microarray data, and a column represents the ionintensities of a single sample across a set of m/z ratios. Unlike traditional data (e.g., financial data), the number of variables in a mass spectral profile is much greater than the number of observations, i.e., n>>p. In addition, only a small portion of testing points (m/z ratios) among the thousands of them have meaningful contribution to data variations or demonstrate biological relevance in disease detection. Furthermore, mass spectral data by nature are notnoise free due to the nonlinearity in proteomic profiling. Preprocessing techniques are unable to remove some builtin systematic noise completely. The information redundancy, noise, and highdata dimensionalities in mass spectral data not only make some traditional classification methods (e.g., Fisher discriminant analysis) lose discriminative power, but also present an urgent challenge in computational proteomics.
Local features and global features
Many feature selection methods are employed to decrease dimensionalities, remove noise, and extract meaningful features before mass spectra classification. These methods can be categorized as inputspace feature selection and subspace feature selection. The inputspace feature selection reduces the dimensionality of data by selecting a subset of features to conduct a hypothesis testing or create a model under some selection criteria in the same space as input data (e.g., ttest). On the other hand, the subspace feature selection, also called transformbased feature selection, reduces data dimensionality by transforming data into a lowdimensional subspace induced by a linear or nonlinear transformation. The subspace feature selection methods are probably the most used data reduction techniques in proteomics for their popularity and efficiency. They include principal component analysis (PCA) [9], independent component analysis (ICA) [10,11], nonnegative matrix factorization (NMF) [12], and their different extensions [13,14]. We mainly focus on the subspace feature selection methods in this study.
These algorithms, however, are generally good at selecting global features rather than local features. The global and local features consist of high frequency and low frequency features (signals) respectively. For example, a testing point (an m/z ratio) with several exceptionally high peaks on cancer samples, which are seldom found at most testing points, can be viewed as a local feature. On the other hand, a testing point whose expression value plot curve is similar to those of other testing points is a global feature. As different frequency signals capturing different data behaviour, the global and local features interpret the global and local behaviour of data, and contribute to the global and local characteristics of data respectively. Since there is no robust screening mechanism available to distinguish the two types of features in most subspace feature selection methods, the global features may demonstrate ‘obvious’ advantages over the local features in the feature selection. That is, the low frequency signals have less likelihood to contribute to the inferred lowdimensional data, which usually are the linear combinations of all input variables, than the high frequency signals. For example, the positive and negative weights in the linear combination to calculate each principal component in PCA are likely to partially cancel each other. However, it causes that the weights representing contributions from local features are more likely to be cancelled out because of their frequencies. As such, unlike the global features, the local features are hard to extract for most subspace featureselection algorithms. Finally, the low dimensional data inferred from the transformbased feature selection may miss some local datacharacteristics described by the local features. In other words, the global features dominate the feature selection and these algorithms demonstrate a global feature selection mechanism.
Although difficult to extract out, the local features are probably the key to attaining a highperformance mass spectral pattern classification for its subtle data behaviour capturing, especially because many mass spectral samples share very similar global characteristics but different local characteristics. For example, it’s easy to distinguish a 10years old, fivefeet girl Jean between a 25year old sixfeet male Mike, because they have different global features. However, it is not easy to distinguish Mike with his twin brother Peter because they share almost same global characteristics: height, weight, hair color, etc. Nevertheless, some careful people can still detect them because Peter has a mole near his mouth but Mike does not, i.e., the mole here works as the local feature to facilitate such detection. For another example, some benign tumor samples may display very similar global characteristics but quite different local characteristics with malignant tumor samples. To attain a highaccuracy diagnosis, it is must to capture the local data characteristics to distinguish these samples sharing the similar global characteristics from each other. It may be particularly important in mass spectral proteomics because some subtype samples may demonstrate very similar ‘global patterns’ under the same profiling technology.
Reasons for the global feature selection mechanism
A major reason for the global feature selection mechanism displayed in these algorithms is that there is no screening technique available to separate two types of features in feature selection. In other words, PCA, ICA, NMF, and their variants all belong to a singleresolution feature selection method, where all features are indistinguishably analyzed in a singleresolution despite the nature of their frequencies. Such an indistinguishable treatment causes the mostoften data entries to have a high likelihood to dominate feature selection and the lessoften data entries may lose opportunities. In other words, the global features are more likely to be selected than the local features and prevents effective local datacharacteristics capturing. As such, the low dimensional data inferred from these methods (e.g., the projection data onto the three principal components in PCA) may probably only demonstrate the global data characteristics. Obviously, the mass spectral samples with similar global characteristics but different local characteristics will not be recognized in the following classification. Moreover, the global feature selection mechanism may bring redundant global features in the following classification because almost only the features that interpreting global characteristics are involved in training the corresponding learning machine (e.g., SVM). The redundant global features will unavoidably decrease the generalization of the learning machine and increase the risk of misclassifications or overfitting. Finally, the learning machines integrated with the global feature selection algorithms will display instabilities in classifications, i.e., they may perform well on some data but fail badly on the others due to different contributions of the global features to the classification.
To avoid the global feature selection mechanism, it is desirable to distinguish features (e.g., sort) according to their frequencies by building some screening techniques to separate two types of features in the feature selection. In this study, we conduct multiresolution data analysis via a discrete wavelet transform (DWT) [15] to separate features according to their frequencies. The discrete wavelet transform (DWT) hierarchically organizes data in a multiresolution way by low and high pass filters. The low (high)pass filters only pass low (high)frequency signals but attenuate signals with frequencies higher (lower) than a cutoff frequency. As such, the DWT coefficients at the coarse level capture the global features of the input data and the coefficients at the fine levels capture the local features of the data, i.e., the low frequency and high frequency signals are represented by the coefficients in the coarse and fine resolutions respectively. Obviously, we can overcome the global feature selection mechanism after such a multiresolution feature separation by selectively extracting local features and filtering redundant global features.
In this study, we present a novel multiresolution independent component analysis (MICA) algorithm for effective feature selections for mass spectral data. Unlike the traditional feature selection methods, it suppresses redundant global features and extracts local features to capture gross and subtle data characteristics via multiresolution data analysis. Then, we propose a multiresolution independent component analysis based support vector machines (MICASVM) to achieve a highperformance proteomic pattern classification. In addition to rigorous machine learning analysis, we demonstrate the proposed classifier’s superiority by comparing it with nine stateoftheart peers on six heterogeneous profiles generated from different profiling technologies and processed by different preprocessing algorithms. The exceptional classification performance (~99.5% average classification ratios) and excellent stability suggest this algorithm a great potential to facilitate mass spectral proteomics into a clinical routine, even if data reproducibility is not guaranteed.
Methods
Multiresolution independent component analysis (MICA) is built from the discrete wavelet transforms (DWT), principal component analysis (PCA), the first loading vector based data reconstruction, inverse discrete wavelet transforms (IDWT) induced metadata approximation, and independent component analysis (ICA) based subspace spanning. The DWT decomposes input data in a multiresolution form by using a wavelet and scaling function. Mathematically, it is equivalent to multiplying input data by a set of orthogonal matrices block by block. The coefficients at the coarse and fine levels represent input data’s global and local features respectively. Alternatively, ICA seeks to represent input data as a linear combination of a set of statistically independent components by minimizing their mutual information. Theoretically, it is equivalent to inverting the central limit theorem (CLT) by searching maximally nonnormal projections of the original data distribution. More detailed information about DWT, PCA, and ICA can be found in [15,11].
Multiresolution independent component analysis (MICA)
MICA seeks the low dimensional metasample (prototype) for each highdimensional mass spectral sample in the subspace generated by the statistically independent components from a metaprofile of the input data. As the same dimensional approximation of the original highdimensional data, the metaprofile keeps the most important global features, drops the redundant global features, and exacts almost all local features of the original data. The metaprofile is computed by conducting an inverse DWT for the updated coefficient matrices, where the coarse level coefficients are selectively suppressed by the first loading vector reconstruction to filter the redundant global features, and the fine level coefficients are kept to extract the local features. It is worth pointing out that the independent components in MICA are calculated by conducting independent component analysis for the metaprofile. Unlike the independent components in the classic ICA that are mainly retrieved from the global features, the independent components calculated by MICA are statistically independent signals that contain contributions from almost all local features and the most important global features. As such, the latter is more representative in revealing the latent data structure than the former. Moreover, MICA brings an automatic denoising mechanism via its redundant global feature suppressing. Since the coarse level coefficients (e.g., the first level coefficients) in the DWT generally contain “contributions” from noise, suppressing the coarse level coefficients not only filters unnecessary global features, but also removes the noise automatically. The automatic denoising prevents noise from entering feature selection and the following classifier training, which will contribute to the robust mass spectral pattern classification. The MICA algorithm can be described as following steps.
Algorithm 1 multiresolution independent component analysis (MICA)
1. Wavelet transforms. Given a protein expression profile with p samples across n m/z ratios MICA conducts a Llevel columnwise DWT for input data to obtain wavelet coefficients, which consist of total L detail coefficient matrices: and an approximation coefficient matrix i.e., where
2. Redundant global feature suppressing and local feature extraction. A level threshold is selected to suppress redundant global features and maintain local features.
a). If
1). conduct principal component analysis for each detail coefficient matrix D_{j} to obtain its principal component (PC) matrix and corresponding score matrix
2). reconstruct and update the detail coefficient matrix D_{j} by using the first loading vector u_{1} in the PC matrix as where is a n_{j} × 1 vector with all entries being ‘1’s.
b). If j >τ keep all detail coefficient matrices intact.
3). Inverse discrete wavelet transforms. Conduct the corresponding inverse discrete wavelet transforms using the updated coefficient matrices to get the metaprofile of i.e.,
4). Independent component analysis. Conduct the classic independent component analysis for X^{*} to obtain components and the mixing matrix: X^{*} = AZ^{,} where
5). Subspace decomposition. The metaprofile X^{*} is the approximation of X by removing the redundant global features and retaining almost all local features by selecting features on behalf of their frequencies. It is easy to decompose each sample in the subspace spanned by all independent components Each statistically independent component is a basis in the subspace, i.e., where the mixing matrix and In other words, each sample can be represented as where the metasample a_{i} is the i^{th} row of the mixing matrix recording the coordinate values of the sample x_{i} in the subspace. As a low dimensional vector, the metasample a_{i} retains almost all local features and the most important global features of the original highdimensional sample x_{i}. Thus, it can be viewed as a datalocality preserved prototype of x_{i}. It is worthwhile to note that each metasample in the subspace is the data locality persevered prototype of its corresponding highdimensional mass spectral sample.
The redundant global feature suppressing and local feature extraction in MICA decrease the total data variances for the following metaprofile by only keeping the data variance on the first PC of each coefficient matrix before or at the level threshold τ. As a samedimensional but a low variance approximation for the original data by keeping the most important global data characteristics and capturing local data characteristics, the metaprofile X* makes the following independent component analysis more sensitive in catching subtle data behavior than applying ICA directly applying to the original data. Figure 1 visualizes three control and cancer samples of the colorectal (CRC) data [7]. Each sample is a 16331×1 vector, and their lowdimensional metasamples are obtained from MICA at the thresholds τ=2,4,6 with a Daubechies family wavelet ‘db8’. We indicate the control and cancer samples and their corresponding metasamples by red and blue lines respectively. It is clear that there is no any way to detect two types of samples from the plot of the original data (subfig 1 at the NW corner). However, their metasamples at the three thresholds demonstrate clear separations between the controls and cancers (subfig 2,3,4 at the NE, SW, and SE corners). The extracted local features and selected important global features make two types of samples display two distinct prototypes in the lowdimension subspace. With the increase of the level thresholds, the two groups of prototypes tend to show more capabilities to separate cancer and control samples. Interestingly, two types of metasamples demonstrate a “selfclustering” mechanism in that the metasamples belonging to the same type show very close spatial proximities. Obviously, the clear sample separation information conveyed by the selfclustering mechanism of the metasamples is almost impossible to obtain from the original highdimensional data directly, and the key discriminative features captured by our proposed MICA method would be able to facilitate the subsequent classification step and contribute to highaccuracy disease diagnosis. It is also worth pointing out that similar results can be also obtained for the other mass spectral data.
Figure 1. Metasamples computed from MICA. Metasamples computed from MICA for six original samples (three controls and three cancers) in the colorectal data at the three levels thresholds: τ=2,4,6 with the wavelet ‘db8’. The lowdimensional metasamples separate two types of samples clearly in visualization (xaxis represents the dimensionality of the subspace spanned by the independent components).
MICAbased support vector machines
The MICAbased support vector machine applies the classic support vector machine (SVM) [16] to the metasamples calculated from MICA to gain classification in a lowdimensional space. Unlike the traditional SVM that builds a maximum margin hyperplane in the original highdimensional space where n ~ 10^{3} – 10^{4}, MICASVM separates biological samples by constructing the maximum margin hyperplane in the spanned subspace where using the metasamples. If we assume the number of support vectors N_{s} is much less than the training points l, then, the time complexity of the MICASVM is which is much lower than that of the classic SVM: provided the same number of training points and support vectors. We briefly describe the MICASVM algorithm for binary classification at first.
Given a training dataset and sample class type information where a metadataset is computed by MICA. Then, a maximum margin hyperplane: is constructed to separate the '+1' (‘cancer’) and '1' (‘control’) types of metasamples. It is equivalent to solving the following quadratic programming problem,
Eq. (1) can be solved through its Lagrangian dual that is also a quadratic programming problem, where are the dual variables of primal variables W and b.
The normal of the maximummargin hyperplane is calculated as and the intercept term b can be calculated as The decision function is used to determine the class type of a testing sample x', where are the corresponding metasamples of samples computed from MICA respectively, and is a SVM kernel function mapping the metasamples into a samedimensional or highdimensional feature space. In this work, we mainly focus on the linear kernel for its efficiency in proteomic pattern classification. In fact, we have found that a SVM classifier under a standard Gaussian ( ‘rbf’) kernel kernel) inevitably encounters overfitting for mass spectral proteomic data through rigorously theoretical analysis. The details can be found in the additional file 1.
Additional file 1. Overfitting analysis A rigorous analysis on SVM overfitting under a standard Gaussian kernel for mass spectral proteomic data.
Format: PDF Size: 58KB Download file
This file can be viewed with: Adobe Acrobat Reader
Results
To demonstrate the superiority of our algorithm, we include five publicly available largescale mass spectral profiles: colorectal (CRC) [7], hepatocellular carcinoma (HCC) [8], ovarianqaqc, prostate [17], and cirrhotic [8], in our experiments. They are heterogeneous data generated from different profiling technologies and preprocessed by different algorithms. The HCC and cirrhotic datasets are two binaryclass datasets separated from a threeclass profile consisting of 78 HCC, 72 control, and 51 cirrhotic samples [8].
To address the data heterogeneity, we employed different preprocessing methods for these profiles. We conducted baseline correction, smoothing, normalization, and peak alignment for the ovarianqaqc data. The baseline for each profile was estimated within multiple shifted windows of widths 200 m/z, and the spline approximation was applied to predict the varying baseline. The mass spectra were further smoothed using the ‘lowess’ method, and normalized by standardizing the area under the curve (AUC) to the group median. Moreover, the spectrograms were aligned to two reference peaks: (3883.766, 7766.166). Alternatively, we only conducted the baseline correction, normalization and smoothing for the HCC, prostate, and cirrhotic data, where the smoothing method was selected as the ‘leastsquare polynomial’ smoothing instead of the ‘lowess’ smoothing. We did not conduct our own preprocessing for the colorectal data because it was preprocessed data [7]. Table 1 shows detailed information about the five data sets.
Table 1. Five heterogeneous mass spectral profiles
Cross validations and comparison peers
We compared our algorithm with six stateoftheart peers in terms of average classification rates, sensitivities, and specificities under kfold (k=10) and 100trial of 50% holdout cross validations (HOCV). The classification accuracy in the i^{th} classification is the ratio of the correctly classified testing samples over total testing samples: The sensitivity and specificity are defined as the ratios: respectively, where tp (tn) is the number of positive (negative) targets correctly classified, and fp (fn) is the number of negative (positive) targets incorrectly classified respectively. In the 100trial of 50% holdout cross validation, all samples in each data set are pooled together and randomly divided into half to get training and testing data. Such a partition is repeated 100 times to get 100 sets of training and testing data sets. In the kfold cross validation, an input dataset is partitioned into k disjoint, equal or approximately equal proportions. One proportion is used for testing and the other k1 proportions are used for training alternatively in the total k rounds of classifications. These cross validations are able to decrease potential biases in algorithm performance evaluations compared with the prespecifying training or testing data approach.
The six comparison algorithms can be categorized into two types. The first type consists of the standard support vector machines (SVM) and linear discriminant analysis (LDA), both of which are the stateoftheart classification methods. The second type consists of four methods embedding subspace featureselections in SVM and LDA: they are support vector machines with principal component analysis (PCA) / independent component analysis (ICA) / nonnegative matrix factorization (NMF), and linear discriminant analysis (LDA) with principal component analysis. We refer to them as PCASVM, ICASVM, NMFSVM, and PCALDA respectively. The implementation details of these algorithms can be found in [14].
Experimental results
We employ the ‘db8’ wavelet in MICA to conduct a 12level discrete wavelet transform for each dataset and select the level threshold as τ=2 for all profiles uniformly. Although not an optimal level threshold for all data, it guarantees automatic denoising and “fair” algorithm comparisons. Moreover, the metasamples obtained from MICA at τ=2 can clearly distinguish two types of samples. Although other level threshold selections may be possible, any too ‘coarse’ (e.g.τ=1) or too ‘fine’ (e.g.τ=10) level threshold selection may miss some important global or local features and affect following classifications.
Table 2 and Table 3 illustrate the average performance of MICASVM and its six peers in terms of classification rates, sensitivities, specificities and their standard deviations under two types of cross validations respectively. The NMFSVM and LDA algorithms are excluded from Table 3 for their relatively low performance. The best performance is highlighted for each data set. It is clear that the MICASVM algorithm achieved exceptionally leading advantages over the others. For example, the average prediction ratios attain >99.0% for all data under the 100 trials of 50% HOCV. It is interesting to see that our results are superior to those of the peakselection based biomarker discovery methods. For instance, the peakselection method employed by Alexandrov et al [7] achieved the SVM classification rate: 97.3% (sensitivity: 98.4% and specificity: 95.8%) on the colorectal (CRC) data under a double cross validation (a leaveoneout CV and 5fold CV). Alternatively, another peakselection biomarker discovery method induced by nonnegative principal component analysis (NPCA) attained 98.21% (sensitivity: 95.83% specificity: 100%) under a SVM classifier with the leaveoneout cross validation (LOOCV) on the same data set [14].
Table 2. Performance of seven algorithms under the 100 trials of 50% HOCV
Table 3. Five classifier performance under the 10fold CV
However, our algorithm achieved the average 99.05% classification rate (sensitivity: 98.84% and specificity: 99.28%) under 100 trials of 50% HOCV where much less priori knowledge are available in classification than the LOOCV and 5fold cross validation. In addition, under the 10fold cross validation, the proposed algorithm achieves 99.33% and 99.52% predication ratios on the HCC and ovarianqaqc data respectively. More impressively, it attains 100% classification ratios on the colorectal, prostate, and cirrhotic data. Unlike the other methods displaying instabilities in classifications, our algorithm demonstrates strong stability in attaining highaccuracy pattern detections for all the five profiles. This observation is also supported by its lower standard deviations of the three classification measures of MICASVM than those of the others.
We also have found that there are almost no statistically significant differences between SVM and its subspace feature selection based extensions (e.g., PCASVM), which achieve same level or slightly lower performance than the standard SVM. The reason seems to be rooted in the global feature selection mechanisms of the PCA, ICA, and NMF methods. As we pointed out before, since some mass spectral samples may display very similar globalcharacteristics but different localcharacteristics, a SVM classifier integrated with a global feature selection method may inevitably encounter difficulty in distinguishing these samples. Although extracted by different transformation methods, the global features seem to have nearly same level contributions to proteomic data classification statistically. Moreover, the redundant global features brought by the global feature selection mechanism may get involved in the SVM learning, which would limit all the SVMrelated classifiers’ generalization and cause instability in classification. This point can be also observed through their relatively high standard deviations of the classification rates, sensitivities and specificities. For example, the standard deviations of the three measures from the PCASVM classifier are 3.76%, 8.39%, and 3.86% respectively, which are much higher than those from the MICASVM classifier (0.85%, 1.65%, and 0.95%) on the cirrhotic profile. Similar observations can also be found for the other data sets.
However, it is interesting that MICA’s local feature capturing and redundant global feature suppressing mechanism appear to contribute to the MICASVM classifier’s exceptional performance and good algorithm stability on the five heterogeneous data sets. Figure 2 compares the distribution of the MICASVM classifier’s classification rates with those of the ICASVM, PCASVM and SVM classifiers under the 100 trials of 50% HOCV. It clearly demonstrates that MICASVM has statistically significant advantages over the other three classifiers on all five data sets. Moreover, Figure 3 shows MICASVM’s leading advantages over its four peers: PCALDA, PCASVM, ICASVM, and SVM, in terms of the average classification rates, sensitivities, specificities, and positive prediction ratios under the 10fold CV. Consistent to the cases in the 100 trials of 50% HOCV, the four peers also show a nearly same level performance on the four classification measures.
Figure 2. Comparison of four SVM algorithms’ classification rate distributions under the 100 trials of 50% HOCV. The distributions of the classification rates for the MICASVM, ICASVM, PCASVM and SVM algorithms on the five mass spectral datasets.
Figure 3. Comparison of five algorithm performance under the 10fold CV. Comparisons of the classification performance of five algorithms under 10fold CV on the five mass spectral profiles: ‘‘C’ (colorectal), ‘H’ (hcc) and ‘O’ (ovarianqaqc), ‘P’ (prostate), and ‘C1’ (‘cirrhotic’). The MICASVM algorithm strongly demonstrates stably leading performance over the others.
Multiclass classification
The MICAbased support vector machines can be also extended to handle the multiclass classification, which has not been seriously addressed in mass spectral proteomics. However, it can be more practical in cancer diagnosis because detecting different pathologic states of cancers is essential in early cancer discovery. We ‘merge’ the HCC and cirrhotic data into a threeclass profile to seek highaccuracy detection between healthy individuals (controls) and patients with hepatocellular carcinoma (HCC) and cirrhosis, where cirrhosis can be viewed as an early HCC stage to some degree because chronic hepatitis C causes HCC via the stage of cirrhosis.
We employ the ‘oneagainstone’ method in our MICAbased multiclass SVM classification for its proved advantage over the ‘oneagainstall’ and ‘directed acyclic SVM’ methods [18]. The ‘oneagainstone’ method builds k(k1)/2 binary SVM classifiers for a data set with k classes: {1,2,…k}. Each classifier is trained on data from two classes, i.e., training samples are from the ith and jth classes where i,j=1,2,..k. We describe our MICAbased ‘oneagainstone’ SVM as follows.
Given a training data set consisting of samples across m testing points from the ith and jth classes i.e., and their corresponding labels a corresponding low dimensional metasample data is computed by MICA. Then, maximizing the margin between two types of data is equivalent to the following problem:
where a_{t} is the metasample calculated for the training sample x_{i}. After building all k(k1)/2 classifiers, we first determine if a testing sample x' is from class the ith or jth class by a local decision function where a' is the metasample of x'. Then, we use the ‘Maxwins’ voting approach to infer its final class type: if the local decision function says x' is in the ith class, then the ith class wins one vote; Otherwise, the jth class wins one vote. Finally, x' will belong to the class with the largest vote.
We also implemented the ‘oneagainstone’ method in SVM, PCASVM and ICASVM multiclass classification for a fair comparison. It was interesting to find that the four classifiers: PCALDA, SVM, PCASVM, and ICASVM had equivalent performance under the two types of cross validations for this trinary data. Just as before, the LDA and NMFSVM algorithms had lower level performance than those of the four algorithms. However, the MICASVM algorithm achieved average classification ratios: 97.37% and 98.52% respectively under the 100 trials of 50% HOCV and 10fold CV, which were much higher than the corresponding average 83.79% and 86.61% level classification ratios attained by the four peers under the same cross validations.
Figure 4 compares the classification performance of our proposed algorithm with those of the PCASVM, ICASVM and SVM algorithms under the 100 trials of 50% HOCV by visualizing the distributions of their classification rates, sensitivities, and specificities. The similar or even identical distributions of the three random variables suggest there are no statistically significant differences between the three classifiers. However, the distributions of the three random variables for the MICASVM algorithm imply it is significantly different from those comparison algorithms by attaining highaccuracy pattern prediction. On the other hand, it appears that that integrating an ‘oneagainstone’ SVM with the global feature selection algorithms (e.g., PCA, ICA) may not contribute to enhancing multiclass data classification either. However, integrating the ‘oneagainstone’ SVM with MICA demonstrates a statistically significant improvement in multiclass classification for its effective local feature capturing. Such results are also consistent to those of the previous binary classification.
Figure 4. Multiclass classification performance. The distributions of the classification rates, sensitivities and specificities of the MICASVM, ICASVM, PCASVM and SVM algorithms on a threeclass data set. The distributions of the three random variables: classification rates, sensitivities and specificities of the MICASVM algorithm are significantly different from those of the other three algorithms for its exceptional classification performance.
MICAbased linear discriminant analysis
Although linear discriminant analysis (LDA) had the worst performance among all seven algorithms in our investigation, it would be interesting to generalize MICA to LDA classification by designing a MICALDA classifier to further verify the effectiveness of MICA in enhancing proteomic pattern detection, and take advantage of LDA’s builtin multiclass handling mechanism. Similar to the MICASVM algorithm, the multiresolution independent component analysis based linear discriminant analysis (MICALDA) applies the classic LDA to the metasamples obtained from MICA to gain sample classification. Table 4 shows the MICALDA algorithm’s performance on the six profiles. To keep consistency with the previous experiments, we still use the ‘db8’ wavelet and set the level threshold τ=2 in MICA. Interestingly, this algorithm’s performance is only secondary to that of the MICASVM algorithm. It achieves a 96.84% average classification rate with 98.69% sensitivity and 96.21% specificity on the threeclass profile under the 100 trials of 50% HOCV. Furthermore, it outperforms the other comparison algorithms on the colorectal, cirrhotic, and HCC data.
Table 4. MICALDA performance on six mass spectral data sets
Three partial least square (PLS) based regression methods
We also compare our algorithm with three PLSbased regression methods. As an interesting dimension reduction algorithm originally developed in the field of chemometrics, PLS recently draws more and more attention in machine learning and statistical inference. The three PLSbased regression methods consist of the PLSbased regression, PLSbased linear logistic regression proposed by Nguyen and Roche [19], and PLSbased ridge penalized logistic regression proposed by Fort and LambertLacroix [20]. In our context, all the three algorithms treat classification as a regression one with discrete outputs under few observations and many predictor variables. We refer to them as PLSREG, NRLLD, and RPLSLLD respectively. Since the NRLLD and RPLSLLD algorithms require feature selection before classification, we conduct a twosample ttest with pooled variance estimate to select the 2000 most differentially expressed features from each data set for the two methods, where the threeclass data set is treated as a binary data set with 72 controls and 129 diseased samples (78 hepatocellular carcinoma +51 cirrhosis samples). The number of PLS components are uniformly selected as 10 for all the three methods. Table 5 shows MICASVM and the three algorithms’ average classification rates and their standard deviations from the two types of cross validations. It is interesting to see that our proposed MICASVM algorithm still hold obvious advantages over the three peers in performance.
Table 5. Performance of MICASVM, PLSREG, NRLLD, and RPLSLLD
Algorithmic stability analysis
The instabilities of current classification methodologies are widely found in mass spectral proteomics. In fact, almost all of these classification methods were proposed through analyzing an individual dataset [13,5,7,8]. They may work efficiently on the individual data but lack stability when applied to other heterogeneous data generated from different profiling technologies or processed by different preprocessing methods. In fact, such instabilities not only present difficulties in reproducible biomarker discovery, but also hamper exploring the clinical potentials of this technology. Although algorithmic stability analysis is essential in computational proteomics, there is even no adhoc investigation on this topic. To evaluate the algorithmic stabilities of mass spectral proteomic data classification algorithms, we present an algorithmic stability analysis by introducing two scalefree measures: algorithm stability index and relative stability. The algorithm stability index measures the stability of an algorithm across a number of datasets. A high algorithm index value indicates better stability of an algorithm. Alternatively, the relative stability measures the stabilities of a set of classification algorithms with respect to a specific algorithm, which is selected as the MICASVM algorithm in this study. A small relative stability indicates an algorithm with a relatively close performance to that of the MICASVM algorithm.
Given a classification algorithm running on M heterogeneous profiles under a cross validation, the algorithm stability index δ_{a} and the relative stability δ_{r} are defined as, where μ_{i}, s_{i} are the average classification rate and the corresponding standard deviation of the algorithm on the i^{th} profile respectively, and the parameter is the average classification ratio of the MICASVM algorithm on the i^{th} profile.
The two left figures in Figure 5 show the algorithm stability index and relative algorithm stability values of all eight algorithms on the six profiles under the 100 trials of 50% HOCV. It is interesting to see that the PCASVM, ICASVM, and SVM algorithms have almost same level stabilities for their close δ_{a} values. The two smallest δ_{a} values suggest the least stabilities of the NMFSVM and LDA algorithms. The δ_{a} values of the MICASVM and MICALDA algorithms are the largest and 2^{nd} largest among the eight algorithm index values. The relative stability value of the MICALDA algorithm suggests it achieve the closest performance with respect to the MICASVM algorithm. At the same time, the two right figures in Figure 5 illustrate similar observations for the two measures on the six algorithms (The two least stable algorithms NMFSVM and LDA are excluded) under the 10fold CV. Obviously, the MICASVM algorithm still maintains its highest stability when more priori knowledge is available in classification. Although the relative stabilities of the PCASVM, ICASVM, SVM, PCALDA, and MICALDA algorithms have the same ‘ordering’ as those of the five methodologies under the 50% HOCV, all the five algorithms have smaller relative stability values because more prior knowledge is available in the classifications under the 10fold CV.
Figure 5. Algorithmic stability analysis. The algorithm stability index and relative stability values under the 100 trials of 50% HOCV and 10fold CV. The MICASVM algorithm has the largest stability among all eight algorithms, and MICALDA has the closest performance to that of the MICASVM algorithm.
Optimal level threshold selection
A remaining question is how to determine the optimal level threshold in MICA so that the following SVM classifier achieves best performance. It is reasonable to believe an optimal level threshold will contribute to capturing important local and global features of the original data in the metasamples. We here employ a logcondition number of the mixing matrix A to estimate the status of global and local feature capturing, where λ_{max} and λ_{min} are the maximum and minimum singular values of the mixing matrix. A large logcondition number indicates the better global and local feature capturing. The levelthreshold is counted ‘optimal’ if the logcondition number of the mixing matrix is the largest. If logcondition numbers from two level thresholds are same numerically, the lower level threshold (which is required to be > 1) is counted as the optimal one. For instance, the largest and 2^{nd} largest α values are achieved at τ=1 and τ=7 respectively on the ovarianqaqc data. However, our algorithm achieved the best average classification performance at τ=7, where the average classification rate, sensitivity and specificity are 99.74%, 99.73% and 99.76% respectively (The average classification rate is 95.28% at τ=1).
Figure 6 shows the MICASVM average classification rates and corresponding α values under the 100 trials of 50% HOCV on the colorectal, cirrhotic, and prostate data, when the level threshold values are from 1 to 11 in MICA. It is interesting to see that the average classification rates have some or significant decreases when the level threshold values τ≥6 where the corresponding logcondition numbers show some level ‘stability’. However, it seems that the level threshold corresponding to the maximum logcondition number indicate the optimal or near optimal level classification performance in our experiment. Furthermore, we also have found that the MICASVM algorithm’s performance may decrease with too coarse level thresholds (e.g., τ =1) and too fine level thresholds (e.g., τ ≥8). Since the optimal level threshold selection method may increase computing complexities in classification for its maximum logcondition number computing. In practice, we suggest the empirical level threshold as 2≤τ≤L/3 for its robust performance and automatic denoising property. In addition, we discuss possibly optimal wavelet selection for MICASVM under different cross validations, which can be found in the additional file 2.
Figure 6. Optimal level threshold selections. Average classification rates and corresponding logcondition numbers at 11 level thresholds on the colorectal, cirrhotic and prostate data under the 100 trials of 50% HOCV.
Additional file 2. Wavelet selection for MICASVM
Format: PDF Size: 348KB Download file
This file can be viewed with: Adobe Acrobat Reader
Discussion
In this study, we present a multiresolution feature selection algorithm: multiresolution independent component analysis (MICA) for effective feature selection for mass spectral data, propose a highperformance classification algorithm for heterogeneous proteomic profiles, and demonstrate its superiority by comparing it with nine peers. Our approach seeks reproducible highaccuracy diagnosis by treating an input profile a whole biomarker from a machinelearning viewpoint. It shows a great potential to facilitate mass spectral proteomics technology into a clinical routine, even if the data reproducibility is not guaranteed. It is worthwhile to note that independent component analysis is a necessary step to achieve good classification performance. We have found that a similar multiresolution principal component analysis based SVM algorithm is not able to reach a comparable performance as our algorithm because of the loss of statistical independence in the feature selection. Although our methodology can achieve the clinicallevel disease diagnosis for mass spectra even if the data reproducibility is not guaranteed, we do not intend to deemphasize the importance in enhance mass spectral proteomic profile reproducibility because of its potential in identifying reproducible biomarkers. In fact, previous studies [21] pointed out that data reproducibility may affect data analysis and bring biases. For example, hierarchical clustering may bring different results for mass spectra acquired in day one and the same data a month later. However, it is also reasonable to expect the proposed algorithm’s exceptional performance on the mass spectral data with robust reproducibility for its generality on heterogeneous data.
Conclusions
Our study suggests a new direction to accelerate mass spectral proteomic technologies into a clinical routine. The novel concepts of global and local feature selection, multiresolution data analysis based redundant global feature suppressing, and effective local feature extraction techniques proposed in this study will also have positive impacts on large scale ‘omics’ data mining. The exceptional discriminative power demonstrated by MICAbased classifiers in multiclass proteomic data classification also contributes to early stage cancer diagnosis. It is interesting to find the MICAbased methods can be also applied to achieve exceptional gene expression pattern classification and meaningful biomarker discovery [22]. In the following work, in addition to further polishing our algorithm by comparing them with other stateoftheart methodologies or data analysis tools [23], we are interested in investigating the multiresolution independent component analysis based unsupervised or semisupervised learning algorithms in proteomic pattern discovery by integrating the multiresolution feature selection with the stateoftheart clustering or semisupervised learning algorithms, and generalize corresponding methods to the related topics such as gene subnetwork identification [24], and biomedical text classification in our future work.
Competing interests
The author declares that there is no competing interest.
Authors' contributions
HEY did all work for this paper
Acknowledgements
The author wants to thank the anonymous reviewers for their valuable comments in improving this manuscript.
This article has been published as part of BMC Systems Biology Volume 5 Supplement 2, 2011: 22nd International Conference on Genome Informatics: Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/17520509/5?issue=S2.
References

de Godoy L, Olsen J, Cox J, Nielsen M, Hubner N, et al.: Comprehensive massspectrometrybased proteome quantification of haploid versus diploid yeast.
Nature 2008, 455:12511255. PubMed Abstract  Publisher Full Text

Dost B, Bandeira N, Li X, Shen Z, Briggs S, Bafna V: Shared Peptides in Mass Spectrometry Based Protein Quantification.

CruzMarcelo A, Cuerra R, Vannucci M, Li Y, Lau C, Man T: Comparison of algorithms for preprocessing of SELDITOF mass spectrometry data.
Bioinformatics 2008, 24(19):21292136. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Villanueva J, Lawlor K, ToledoCrow R, Tempst P: Automated serum peptide profiling.
Nat. Protoc. 2006, 1:880891. PubMed Abstract  Publisher Full Text

Shen C, Sheng Q, Dai J, Li Y, Tang H: On the estimation of false positives in peptide identifications using decoy search strategy.

Coombes KR, Morris JS, Hu J, Edmonson SR, Baggerly KA: Serum proteomics profiling – a young technology begins to mature.
Nat. Biotechnol. 2005, 23:291292. PubMed Abstract  Publisher Full Text

Alexandrov T, Decker J, Mertens B, Deelder A, Tollenaar R, et al.: Biomarker discovery in MALDITOF serum protein profiles using discrete wavelet transformation.
Bioinformatics 2009, 25(5):643649. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ressom HW, Varghese RS, Drake SK, Hortin GL, AbdelHamid M, Loffredo CA, Goldman R: Peak selection from MALDITOF mass spectra using ant colony optimization.
Bioinformatics 2007, 23(5):619626. PubMed Abstract  Publisher Full Text

Jolliffe I: Principal component analysis. Springer, New York; 2002.

Mantini D, Petrucci F, Boccio P, Pieragostino D, Nicola M, et al.: Independent component analysis for the extraction of reliable protein signal profiles from malditof mass spectra.
Bioinformatics 2008, 24(1):6370. PubMed Abstract  Publisher Full Text

Hyvärinen A: Fast and robust fixedpoint algorithms for independent component analysis,.
IEEE Transactions on Neural Networks 1999, 10(3):626634. PubMed Abstract  Publisher Full Text

Brunet J, Tamayo P, Golub T, Mesirov J: Molecular pattern discovery using matrix factorization.
Proc. Natl Acad. Sci. USA 2004, 101(12):41644169. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kim H, Park H: Sparse nonnegative matrix factorizations via alternating nonnegativityconstrained least squares for microarray data analysis.
Bioinformatics 2007, 23(12):14951502. PubMed Abstract  Publisher Full Text

Han H: Nonnegative Principal Component Analysis for Mass Spectral Serum Profiles and Biomarker Discovery.
BMC Bioinformatics 2010, 11:S1. PubMed Abstract  Publisher Full Text

Mallat S: A wavelet tour of signal processing. Acad. Press, CA; 1999.

Vapnik V: Statistical Learning Theory. John Wiley, New York; 1998.

NCI Proteomics: http://home.ccr.cancer.gov/ncifdaproteomics webcite

Hus C, Lin C: A Comparison of Methods for Multiclass Support Vector Machines.

Nguyen D, Rocke D: Tumor classification by partial least squares using microarray gene expression data.
Bioinformatics 2002, 18:3950. PubMed Abstract  Publisher Full Text

Fort G, LambertLacroix S: Classification using partial least squares with penalized logistic regression.
Bioinformatics 2005, 21(7):11041111. PubMed Abstract  Publisher Full Text

Ransohoff DF: Bias as a threat to the validity of cancer molecularmarker research.
Nat Rev Cancer 2005, 5(2):142149. PubMed Abstract  Publisher Full Text

Han H, Li X: Multiresolution Independent Component Analysis for HighPerformance Tumor Classification and Biomarker Discovery.
BMC Bioinformatics 2011, 12:S1. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y: Robust biomarker identification for cancer diagnosis with ensemble feature selection methods.
Bioinformatics 2009, 26:392398. PubMed Abstract  Publisher Full Text

Kim Y, Kim TK, Kim Y, Yoo J, You S, Lee I, Carlson G, Hood L, Choi S, Hwang D: Principal network analysis: Identification of subnetworks representing major dynamics using gene expression data.
Bioinformatics 2011, 27(3):3918. PubMed Abstract  Publisher Full Text  PubMed Central Full Text