Department of Bioengineering, Binghamton University, Binghamton, NY 13902, USA
Department of Mathematical Sciences, Binghamton University, Binghamton, NY 13902, USA
Department of Biochemistry, Rush University Medical Center, Chicago, IL 60612, USA
Department of Radiation Oncology Massachusetts General Hospital and Harvard Medical School Boston, MA 02114, USA
Department of Internal Medicine, Rush University Cancer Center, Rush University Medical Center, Chicago, IL 60612, USA
Abstract
Background
The primary objectives of this paper are: 1.) to apply Statistical Learning Theory (SLT), specifically Partial Least Squares (PLS) and Kernelized PLS (KPLS), to the universal "featurerich/casepoor" (also known as "large
Results
Our results for PLS and KPLS showed that these techniques, as part of our overall feature reduction process, performed well on noisy microarray data. The best performance was a good 0.794 Area Under a Receiver Operating Characteristic (ROC) Curve (AUC) for classification of recurrence prior to or after 36 months and a strong 0.869 AUC for classification of recurrence prior to or after 60 months. KaplanMeier curves for the classification groups were clearly separated, with
Conclusions
SLT techniques such as PLS and KPLS can effectively address difficult problems with analyzing biomedical data such as microarrays. The combinations with established biostatistical techniques demonstrated in this paper allow these methods to move from academic research and into clinical practice.
Introduction
One of the most popular and challenging topics in bioinformatics research is gene selection from microarray data because it involves both statistical processing as well as biological interpretation. The statistical problems are daunting because of the large number of represented genes relative to the small number of samples. This provides a prime opportunity to overfit the data during the model building process. Biology is a significant component because identifying significant genes representative of a given clinical endpoint is a critical step toward understanding the biological process. Several consequences arise as a result of the statistical overfitting problem. Very large Receiver Operating Characteristic (ROC) Area Under the Curve (AUC) values can be achieved on both training and validation data sets, but the results provided by these trained Complex Adaptive Systems (CAS) frequently fail to generalize to data sets other than training and validation sets. Furthermore, these CAS system designs do not necessarily operate on similar data sets with larger representative samples. Different CAS solutions may produce different gene sets from the same set of microarray data. Consequently, any CAS should first attempt to achieve some sort of generalization ability. Secondly, because of the overfitting problem described above, each proposed feature (or gene) reduction CAS generally is based on a unique theoretical analysis, which means that how these separate CAS are connected is not well understood. Consequently, this difficulty results in the same problem stated above: different algorithms will generate different prognostic gene sets using the same microarray data. This means that developing an underlying theory for feature selection would help to understand these algorithms as well as classify which of these are the "most" useful for gene selection. Song
Background of lung cancer
Lung cancer is the leading cause of death in cancer patients worldwide. The American Cancer Society predicts that 156,940 people will fall victim to the disease in 2011, accounting for 27% of all cancer deaths
Data set description and modifications
The experiments designed used the gene expression profiles of 442 lung adenocarcinomas compiled by Shedded
It is important to note that in Dobbin
Gene expression profiles of all samples were quantified using the Affymetrix Human GenomeU133A GeneChip. The resulting CEL files generated at each of the four institutions were quantile normalized using the NCI_U133A_61L array as a reference. Final expression values were calculated using the DChip software (Build version February 2006) using the default settings. Each sample is characterized by 22,283 probes/genes (also referred to as features in this paper) as well as a host of clinical covariates including age, gender, and T/N cancer stage. A few minor discrepancies were found in the probe data obtained from the caArray website. First, probe 207140 at contained expression values of "NA" for all patients in the study. To mitigate this problem, the data corresponding to this probe were removed prior to our analysis. Secondly, patients Moff 18351, Moff 2362A and Moff 3009D did not have expression values for the 222086_s_at probe. In lieu of removing this probe entirely, the data for these patients were assigned an expression value equal to the mean (18.37114) of that probe's expression values across all other patients. The CEL files, DChip normalized expression values and clinical information for all patients involved in this study are available through the caArray website
Experimental design for 3 and 5 Years
To address the clinical issue of determining risk of recurrence delineated above, two classification experiments were designed. The first experiment classified NSCLC patients as "high risk" if cancer were likely to recur within 3 years of surgery and "low risk" otherwise. The 3 year cutoff was chosen because the majority of patients that do relapse will do so within the first 3 years
The first experiment contained 295 patients obtained from the Shedden
The second experiment was composed of 257 patients, which were also obtained from the Shedden
Methods
Overview of Feature Reduction/Classification Process
Microarray data sets have a significant featurerich/casepoor problem which can lead to overfitting (i.e. models that produce excellent results on the training data exist, but none of which may be valid and have good performance on the test data) unless the number of features are significantly reduced prior to the generation of any classification or prediction model. The objective of this threestep process is to identify those significant features which are most useful in producing an accurate classification or prediction model. The process of feature reduction/classification is depicted in Figure
Feature (probe) reduction process
Feature (probe) reduction process. Description of feature reduction process
Coarse Feature Reduction
The automated CFR employees a simple two sample
Partial Least Squares
This section contains a brief, heuristic overview of Partial Least Squares (PLS). PLS is an extension of least squares regression (LSR). In LSR, the response variable
where:
•
•
•
For this microarray data set, we began with 271 features after CFR and reduced this set to a minimum of 1 latent variable and a maximum of 5 latent variables (see Results section). Therefore, the principle advantage of PLS for a problem of this type is its ability to handle a very large number of features: a fundamental problem of a featurerich/casepoor data set. PLS then performs a leastsquares fit (LSF) onto these latent variables, where this LSF is a linear combination that is highly correlated with the desired
• PLS algorithms are very resistant to overfitting, when compared to LSR, and are fast and reasonably easy to implement.
• For most problems with few data points and high dimensionality where PLS excels, a least squares solution may not be possible due to the singularity problem.
• PLS regression maps the original data into a lowerdimensional space using a
• What makes PLS especially interesting for biomedical and data mining applications is its extension using kernels, which leads to kernelized PLS (KPLS), similar to the treatment in SVM.
• PLS may be considered a better principal component analysis (PCA).
 The first key difference from PCA is that PLS computes an orthogonal factorization of the input vector
 The second key difference from PCA is that the least squares model for KPLS is based on approximation of the input and response data, not the original data.
 PLS and PCA use different mathematical models to compute the final regression coefficients. Specifically, the difference between PCA and PLS is that a new set of basis vectors (similar to the eigenvectors of
An algorithm of PLS paradigm follows:
1. Let:
2. For
(a) Compute direction of maximum variance
(b) Project
(c) Normalize
(d) Deflate
(e) Deflate
(f) Normalize
3. Finally, compute the regression coefficients using latent variables:
where:
•
•
•
Deflation
Deflation. The geometric interpretation of the 'deflation' step in the PLS Algorithm. This 'Deflation' effectively removes one dimension by projecting the data onto a subspace that is one dimension less than the row number of the current data matrix, and orthogonal to the vector
Kernelized Partial Least Squares
Nonlinear relationships between variables may be found by embedding this data into a kernel induced feature space. See
The mapping will "recode" the data set as:
This mapping of the data set is from nonlinear input space to a linear feature space. That is, although the environment data representation in the input
Adding this kernelinduced capability to the PLS approach means that a real time, nonlinear optimal training method now exists which can be used to perform computer aided diagnosis. A second advantage of this approach is that a kernel function
Computationally, kernel mappings have the following important properties: (1) they enable access to exceptionally high (even infinitely) dimensional and, consequently, very flexible feature space, with a correspondingly low time and space computational cost, (2) they solve the convex optimization problem without becoming "trapped" in local minimal and, more importantly, (3) the approach decouples the design of the algorithms from the specifications of the feature space. Therefore, both learning algorithms and specific kernel designs are not as difficult to analyze.
The algorithm used to develop the KPLS model, is given below. Details can be found in
1. Let
2. For
(a)
(b)
(c)
(d)
(e)
3. Finally, compute the regression coefficients using:
•
•
•
4. The regression equation then becomes:
Note
Evolutionary Programming derived KPLS machines
The particular KPLS kernel types and kernel parameters were derived using an evolutionary process based on the work of Fogel
This process iteratively generates, valuates, and selects candidates to produce a nearoptimal solution without using gradient information, and is therefore well suited to the task of simultaneously generating both the KPLS model architecture (kernel) and parameters. Figure
Evolutionary Programming
Evolutionary Programming. Shown is the process of the Evolutionary Programming optimization technique utilized to find the optimal kernel parameters. The process creates an initial population of candidate solutions (chromosomes) which undergo a stochastic search for the optimal parameter through the subprocesses of mutation and tournament selection of the 'mostfit' genes.
1.
2.
where
The second step of this mutation process comprises the updating of each configurable parameter for all elements of the evolving population. If we let the vector γ_{i }denote these elements for each of the individual member of the population, this update process will be accomplished as follows:
Here
3.
For more details on the EP process, refer to our previous work
Support Vector Machine and its capacity to reach the global optimum
The KPLS results were validated by using another kernelbased Statistical Learning Theory model called a Kernelized Support Vector Machine (KSVM). SVMs was developed by Vapnik
The discussion below provides the theoretical explanation for why SVMs can always be trained to a global minimum, and thereby should provide better diagnostic accuracy, when compared with neural network performance trained by back propagation.
Assume there exist
where,
The first term on the right hand side,
Empirical risk is a measure of the error rate for the training set for a fixed, finite number of observations. This value is fixed for a particular choice of
In contrast to neural network, in a SVM design and implementation, not only is the empirical risk minimized, the VC confidence interval is also minimized by using the principles of structural risk minimization (SRM). Therefore, SVM implementations simultaneously minimize the empirical risk as well as the risk associated with the VC confidence interval, as defined in the above expression. The bound in (9) also shows that as
Measures of similarity for classification provided by various kernels
Understanding what similarity as applied to KPLS and KSVM often provides additional insight in proper kernel selection. Therefore, we now consider kernel functions and their application to KPLS and KSVMs. KPLS and KSVM solutions in nonlinear, nonseparable learning environments utilize kernel based learning methods. Consequently, it is important to understand the practical implications of using these kernels. Kernel based learning methods are those methods which use a kernel as a nonlinear similarity to perform comparisons. That is, these kernel mappings are used to construct a decision surface that is nonlinear in the input space, but has a linear image in the feature space. To be a valid mapping, these inner product kernels must be symmetric and also satisfy Mercer's theorem
A kernel function should yield a higher output from input vectors which are very similar than from input vectors which are less similar. An ideal kernel would provide an exact mapping from the input space to a feature space which was a precise, separable model of the two input classes; however, such a model is usually unobtainable, particularly for complex, realworld problems, and those problems in which the input vector provided contains only a subset of the information content needed to make the classes completely separable. As such, a number of statisticallybased kernel functions have been developed, each providing a mapping into a generic feature space that provides a reasonable approximation to the true feature space for a wide variety of problem domains. The kernel function that best represents the true similarity between the input vectors will yield the best results, and kernel functions that poorly discriminate between similar and dissimilar input vectors will yield poor results. As such, intelligent kernel selection requires at least a basic understanding of the source data and the ways different kernels will interpret that data.
Some of the more popular kernel functions are the (linear) dot product (11), the polynomial kernel (12), the Gaussian Radial Basis Function (GRBF) (13), and the Exponential Radial Basis Function (ERBF) (14), which will be discussed below.
The dot and polynomial kernels are given by,
respectively, both use the dot product (and therefore the angle between the vectors) to express similarity; however, the input vectors to the polynomial kernel must be normalized (
Dot product kernel
Dot product kernel. The outputs of dot product kernel as functions of the angles between vectors. Four functions are depicted in solidblue, longdashedred, shortdashedgreen and dottedpurple curves, corresponding to the cases where the product of the norms of the
Polynomial Kernel
Polynomial Kernel. The outputs of the polynomial kernel as functions of the cosine of the angles between vectors. Three functions are depicted in dasheddoubledottedblue, dashedsingledottedred, dottedgreen, and dashedpurple curves, corresponding to the cases where the polynomial degree is 1, 2, 3, and 4 respectively.
The Gaussian and Exponential RBF kernels are given by:
respectively.
The Gaussian and Exponential RBF kernels use the Euclidean distance between the two input vectors as a measure of similarity instead of the angle between them (see Figures
Gaussian RBF kernel
Gaussian RBF kernel. The outputs of the Gaussian radial basis function kernel as functions of the Euclidean distance between vectors. Four functions are depicted in dasheddoubledottedblue, longdashedred, dashedgreen, and dottedpurple curves, corresponding to the cases where the sigma is 0.5, 0.7, 1.0, and 2.0 respectively.
Exponential RBF kernel
Exponential RBF kernel. The outputs of the exponential radial basis function kernel as functions of the Euclidean distance between vectors. Four functions are depicted in dasheddoubledottedblue, longdashedred, dashedgreen, and dottedpurple curves, corresponding to the cases where the sigma is 0.5, 0.707, 1.0, and 2.0 respectively.
Since u 
It is clear from Figures
Using PLS, KPLS, and SVM in clinical research
While the methods covered in this paper offer statistically significant improvements in diagnostic and prognostic biomedical applications, there has been great difficulty in utilizing advances such as these in clinical research. The statistics used to evaluate the performance of these techniques are not readily converted into direct clinical information that may help in patient care or pharmaceutical research. In order to address this, we have devised a framework to combine these techniques with well accepted and understood traditional biostatistics methods, the Cox Proportional Hazard model and the KaplanMeier (KM) Curve. These two techniques each help address the question of how important a particular parameter is to evaluating risk/survival. The following subsections will give a basic overview of how Cox and KM can be combined with our techniques. For simplicity, such a combination will only be described with PLS, though it could just as easily be done with KPLS or SVM.
PLS and KaplanMeier curves
Developed in the 1950s, the KM curve is the gold standard in survival analysis
PLS and Cox Hazard Ratios
Another common survival analysis technique is the Cox Proportional Hazard model
Results
The goal of the experiments discussed herein were to derive models from the microarray data to classify each sample as belonging to either the class of recurrent or nonrecurrent patients. The class of nonrecurrent samples are those samples belonging to patients which, after being treated did not recur cancer before the given cutoff period. Patients that did recur cancer before the cutoff period are considered to belong to the recurrent class. Two separate experiments were performed with cutoff periods of 36 and 60 months respectively.
As mentioned in the Methods section, the data were preprocessed using CFR, followed by FFS, and finally classification model building and evaluation.
Coarse Feature Reduction
For the 36 month classification experiment, CFR was used to reduce the original number of features (probes) from 22,282 to 2,675 using a hard cutoff
Fine Feature Selection/Classification
Fine Feature Selection using Partial Least Squares
In this section, we use the AUC value as the fitness metric to evaluate the relative worth of the classification model. Higher AUC values are indicative of better classifiers, with an AUC value of 1.0 indicating a perfect classifier, which is arguably impossible for any nontrivial classification task.
The FFS process utilizes the weight vector of the first latent variable generated by the Linear PLS (LPLS) algorithm to ascertain feature importance. The most important features (those with the largest corresponding weight vector components) are ranked highest and features with lower corresponding components are discarded. This step, called Fine Feature Selection, provides a ranking of importance, which means the magnitude of each feature's respective component is directly correlated with its predictive power in the model.
The FFS process builds this "importance metric" by iterating the analysis of the weight vectors of randomly assigned training folds 10,000 times employing three sensitivity settings, where these three sensitivities score the top 20, 30, and 150 most influential performers for each of their respective 10,000 runs, based on each feature's weight in the weight vector of LPLS. For example, if 'Age' has the largest component and 'Sex' has the second largest in the top 30 sensitivity setting, the score for 'Age' would be 30 and that for 'Sex' would be 29. For each run time, the data is split randomly into training and validation folds. These data are normalized then analyzed using Linear PLS and the weight vector is extracted, sorted, and 'winning' features have their scores updated by position.
In each of the three settings, a number,
where
In our study, we have selected 361 and 102 features using this FFS process for the 36 and 60month experiment respectively, from the 594 and 212 features that were selected by CFR.
Comparisons using PLS classification
As noted, we compared four separate models' performances based on different data: LPLS and KPLS Polynomial Kernel (KPLSPoly) based on the Coarse Feature Reduced (CFR) data, and on the Fine Feature Selected (FFS) data respectively (the FFSdata is actually processed by both CFR and FFS).
• We sought out to determine which model produced the most accurate prediction of recurrence.
• We also sought to determine whether the data was linear or nonlinear, which was determined by which class of model yielded better results: LPLS or KPLS with nonlinear kernels.
• Finally, we sought to determine the effectiveness of our
What we found was that both the 36month and 60month data sets were inherently linear in nature, meaning the LPLS gave better AUC values on validation folds. These results can be seen in Table
Model Comparison The comparison of optimal performance values and number of latent variables for three independent models on the 36 and 60month data.
(CFRdata) Model
Top Validation AUC Value (36 mo/60 mo)
Number of Latent Variables (30 mo/60 mo)
LPLS
.791/.831
3/2
KPLSPoly (Degree = 1)
.784/.830
3/1
SVM
.78/

The best performance was seen with the LPLS, outcompeting the nonlinear SVM and KPLS techniques in AUC performance. The number of latent variables required for the PLSbased techniques was no more than three for both data sets.
In addition to these findings, the number of latent variables required to reach optimal performance, by LPLS and KPLSPoly when they are applied to the FFS processed data was roughly the same (see Figures
36mo. Training AUC Values vs. Latent Variables
36mo. Training AUC Values vs. Latent Variables. 36mo. Training AUC Values vs. Latent Variables
60mo. Training AUC Values vs. Latent Variables
60mo. Training AUC Values vs. Latent Variables. 60mo. Training AUC Values vs. Latent Variables
The analysis of the efficacy of our
CFR and FFS Comparison The comparison of model performance on data from the Fine Feature Selection process and the Coarse Feature Reduction.
Model
Top Validation AUC Value CFRdata (36 mo/60 mo)
Top Validation AUC Value FFSdata (36 mo/60 mo)
LPLS
.791/.831
.794/.869
KPLSPoly (Degree = 1)
.784/. 830
.780/.711
The FFS process enhanced performance for only the LPLS while the KPLSPoly suffered in both the 36 and 60month data.
SVM Verification of KPLS polynomial results
The 36 month KPLSPoly AUC result of 0.784 was not expected when compared with the LPLS result AUC result of 0.791 because these classification problems are generally nonlinear. We therefore validated this result with an independent analysis using SVM using several kernels with the exact same data set and crossvalidation process. Specifically, the data was normalized and formatted for use with LibSVM
KaplanMeier and Cox
KM curves for both PLS using 36 and 60 months can be seen in Figures
PLS at 36 Month Threshold
PLS at 36 Month Threshold. KaplanMeier curve of PLS predicted groups using 36month threshold
PLS at 60 Month Threshold
PLS at 60 Month Threshold. KaplanMeier curve of PLS predicted groups using 60month threshold
Conclusions
Our microarray analysis and information extraction method comprised three basic components drawing from Statistical Learning Theory: 1.) Coarse Feature Reduction, 2.) Fine Feature Selection and 3.) Classification.
In Coarse Feature Reduction, the original 22,282 probes were reduced to 594 for the 3 year cutoff (97.5% reduction) and to 212 for the 5 year cutoff (99.04% reduction) using basic ttest and variance pruning techniques. The Fine Feature Selection was able to further reduce the number of features to 361 for the 60month and 102 for the 36month data sets (a further reduction of 39.2% and 51.9%). The FFS process has been demonstrated to reduce the noise in the data by filtering out noisy features from the data set produced by the CFR process. By implementing the FFS process in our analysis, we were able to enhance the performance of our classifier.
After utilizing the FFS process, classification comparison is made for the refined data. The optimal classifying performance of LPLS was observed at 3 latent variables and 2 latent variables for the 36 and 60month experiments, respectively. Similar results were obtained, a reduction to 3 and 1 latent variables, when using LPLS on data refined only by CFR. The Area Under the Curve (AUC) measure of performance varied from 0.791 to 0.869, depending upon the particular LPLS or KPLS and SVM model used (see Tables
This research also provided a secondary and clinically important result, which is that the improved SLT methods/paradigms can be integrated into the widely accepted and well understood traditional biostatistical Cox Proportional Hazard model and the KM methods. For example, using the SLT paradigms as preprocessors for KM, the resultant probability vs. survival time categories have a very significant difference (
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
WHL directed the overall effort and WHL and XQ developed the new and refined existing (as appropriate) statistical learning theory and complex adaptive systems approaches employed in this paper. WSF, DEM, CTP and JFPR were involved in the experimental design used to ascertain the efficacy of these SLT algorithms in assessing the treatment of nonsmall cell lung cancer. WHL, WSF, DEM, XQ, CTP and JFPR were involved in the results analysis and provided many useful scientific insights. YD coordinated and directed the whole project. YD, JYY and JAB provided the data sets and provided clinical insight/analysis for these data sets. All coauthors participated in the final analysis and reviews of the results of these experiments and agreed on the content of the paper.
Acknowledgements
The research of YD was supported in part by the grant of NIH/NCMHD/RIMI P20MD002725. The research of XQ was supported in part by Binghamton University Harpur College Dean's New Faculty Startup Fund. The research of CTP and JFPR was supported in part by a grant to the State University of New York at Binghamton from the Howard Hughes Medical Institute through the Precollege and Undergraduate Science Education Program.
This article has been published as part of