Department of Mathematical Sciences, University of Massachusetts, Lowell, MA, USA

Division of Medical Oncology, Department of Medicine, University of Colorado Denver School of Medicine, Anschutz Medical Campus, Aurora, CO, USA

Abstract

Background

Molecular classification of tumors can be achieved by global gene expression profiling. Most machine learning classification algorithms furnish global error rates for the entire population. A few algorithms provide an estimate of probability of malignancy for each queried patient but the degree of accuracy of these estimates is unknown. On the other hand

Results

We devise a new learning method that implements: (i) feature selection using the k-TSP algorithm and (ii) classifier construction by local minimax kernel learning. We test our method on three publicly available gene expression datasets and achieve significantly lower error rate for a substantial identifiable subset of patients. Our final classifiers are simple to interpret and they can make prediction on an individual basis with an individualized confidence level.

Conclusions

Patients that were predicted confidently by the classifiers as cancer can receive immediate and appropriate treatment whilst patients that were predicted confidently as healthy will be spared from unnecessary treatment. We believe that our method can be a useful tool to translate the gene expression signatures into clinical practice for personalized medicine.

Background

As developing gene expression signature from microarray data becomes a routine strategy to predict clinical outcome or to classify molecular tumor subtypes, computational methods that are capable of extracting

Motivated by these computational challenges, we devise a novel statistical framework for personalized prediction with accurate and simple decision rules from microarray gene expression data

A local minimax learning algorithm may be applied directly to (i) raw predictor data or to (ii) predictor data of reduced dimensionality after feature selection via virtually any favorite machine learning algorithm. However, although it is believed that local minimax learning will compete favorably with many machine learning algorithms, for case (i) difficult optimization problems need to be solved (hopefully they will be solved as a result of future computational research) to implement the techniques of optimal fusion and optimal local kernel shape determination derived in

Our framework consists of two major steps: 1) feature selection and 2) prediction and error estimation. We used the

Methods

Feature selection method

We used _{i }_{j }

We note that in the feature selection the true classes of the patients are used. This will affect the local minimax probability estimate as discussed below but the width of the confidence interval will remain valid.

Prediction and error estimation

Suppose we have _{j}_{j }_{0}. Let _{0 }in _{j }_{0 }to the furthest _{j}

Since taking any linear combination of kernels above will be equivalent to assuming the set of such _{1}_{2}_{l }_{i}_{l }_{α }(_{α }_{i}_{i }_{i}_{i}_{0}) of the form

First assume the patient training vectors are ordered by their distance to the query _{0}. _{i }_{0 }of radius equal to the distance from _{0 }to the

where _{1 }+ ... + _{i}_{β }

where the **σ*** = ^{2}).

In fact our estimator is identical to that furnished by first least squares fitting the data in the above _{i }_{i }^{2 }^{2}, where _{o})

A complete outline of Reproducing Kernel Hilbert space (RKHS) and the derivation from 3 stated local minimax results (to which the reader may give his/her own proofs) is given in the appendix. References are given to the full proofs of these three results in

Now the preceding results are valid provided the feature selection process does not depend on the patient classes _{j }_{j}'_{i }'_{j }_{j}_{j }_{j }_{j }_{j }

Results

Microarray gene expression datasets

To demonstrate the utility of the local minimax algorithm, we tested it on three publicly available microarray gene expression data sets.

Leave-one-out cross-validation

For all the experiments, we employed leave-one-out cross-validation to assess the classification performance. In brief, for each data set of size N, we left out one sample, and performed feature extraction and classifier constructed based on the N-1 member training set. The classifier constructed from the N-1 member training set is used to predict the left out sample. This procedure is repeated N times.

Predictability threshold and accuracy

When applying one of our classification algorithms to a given patient's predictor vector (gene expression profile), we not only obtain an estimate F of the probability that a patient has cancer (whence we classify the patient into class 0 for normal if F is less than 0.5, or else we classify the patient as class 1 for cancer), but we also obtain a 90% one-sided confidence interval for that patient's predicted probability. Since we are estimating a probability -∞ and +∞ may be replaced in the confidence interval expressions (3) by 0 and 1 respectively.

One use of such an algorithm is to divide patients into a group of those for whom we may confidently predict and those for whom lack of sufficient confidence or predictability may warrant further (possibly invasive) testing. This dichotomy is achieved using what we call a predictability threshold p as follows: we fix a probability value, call it p; for example p = 0.35. We then designate as confidently predictable at level p, all patients whose confidence interval associated with their prediction lies entirely inside of the interval [0, p] or lies entirely inside the interval [1-p, 1]. All other patients are considered to be non-confidently predictable. It follows that if a patient is confidently predictable (CP) with threshold p then either [0, p] or [1-p, 1] contains the true probability with 90% or more confidence.

Table

Kernel method, sigma = 0.5 and 0.7, threshold, p = 0.35, adjustment, e = 0.00 and 0.05

**Data set**

**Sigma**

**Adjustment (e)**

**CP (%)**

**Error in CP (%)**

**Total Error (%)**

**Leukemia**

0.5

0

36 (50.0%)

0 (0%)

3 (4.17%)

0.05

25 (34.7%)

0 (0%)

3 (4.17%)

0.7

0

43 (59.7%)

0 (0%)

3 (4.17%)

0.05

36 (50.0%)

0 (0%)

3 (4.17%)

**Prostate**

0.5

0

32 (36.4%)

4 (12.5%)

21 (23.9%)

0.05

30 (34.1%)

3 (10.0%)

21 (23.9%)

0.7

0

34 (38.6%)

4 (11.8%)

22 (25.0%)

0.05

25 (28.4%)

2 (8.00%)

22 (25.0%)

**GCM**

0.5

0

154 (55.0%)

6 (3.90%)

39 (13.9%)

0.05

134 (47.9%)

3 (2.24%)

39 (13.9%)

0.7

0

160 (57.1%)

6 (3.75%)

42 (15.0%)

0.05

134 (47.9%)

5 (3.73%)

42 (15.0%)

Sigma (σ) denotes the bandwidth of the Gaussian kernel in units of distance in predictor feature space from the queried patient's predictor to the furthest predictor among the rest of the patients. We report two sigma values (0.5 and 0.7) using the threshold p = 0.35 on these data sets in Table

From Table

Varying sigma (σ)

In Figure

The relationships between sigma, percentage of confidence predictable (CP) patients and the percentage of error in CP patients in the three microarray data sets

**The relationships between sigma, percentage of confidence predictable (CP) patients and the percentage of error in CP patients in the three microarray data sets**.

Confident predictability with 3-nearest neighbor predictor

We only demonstrate improved classification for the kernel classifier using squared error loss. Finite sample local accuracy bounds are not available for other machine learning algorithms. However, for nearest neighbor algorithms, we may use asymptotic properties to define confident predictability and compare with our finite sample methods. The 3-nearest neighbor algorithm represents a simple local learning method whose predictions represent a baseline predictive power of asymptotic local learning approaches. The gain of our proposed methods can be assessed by comparing to the 3-nearest neighbor predictor. For this purpose, we implemented a 3-nearest neighbor predictor in this study. Employing the same leave-one-out cross validation procedure as previously described, we compared the 3-nearest neighbor predictor against the kernel predictors with sigma 0.5 and 0.7 on the three microarray data sets. Assuming the true probability of cancer for any patient is either at most 0.30 or at least 0.70, confident predictability for p = 0.35 (as sample size approaches infinity) can be defined as "all 3-nearest neighbors belong to the same class". Table

Leave-one-out comparisons of the local minimax learning with 3-nearest neighbor predictor and the kernel predictors (sigma = 0.5 and 0.7, p = 0.35).

**3-nearest neighbor predictor**

**Kernel predictor with sigma = 0.5, e = 0**

**Kernel predictor with sigma = 0.7, e = 0**

**Data Set**

**CP (%)**

**Error in CP. (%)**

**Total Error (%)**

**CP (%)**

**Error in CP (%)**

**Total Error (%)**

**CP (%)**

**Error in CP (%)**

**Total Error (%)**

**Leukemia**

71

3

3

36

3

43

3

(98.6%)

(4.23%)

(4.17%)

(50.0%)

0 (0%)

(4.17%)

(59.7%)

0 (0%)

(4.17%)

**Prostate**

54

11

21

32

4

21

34

4

22

(61.4%)

(20.4%)

(23.9%)

(36.4%)

(12.5%)

(23.9%)

(38.6%)

(11.8%)

(25%)

**GCM**

204

15

38

154

6

39

160

6

42

(72.9%)

(7.35%)

(13.6%)

(55.0%)

(3.9%)

(13.9%)

(57.1%)

(3.75%)

(15%)

As indicated in Table

Application to individual patients

One of the novel contributions of the current method is its ability to provide both predictive power and confidence level for individual patients. We selected four patients from the GCM data set to illustrate these unique features of personalized prediction. For the patients in the GCM data set, the kernel method for 10-TSP with sigma = 0.5 was used. In this data set, we applied confident predictability (CP) with the assumption p = 0.35. We took the adjustment parameter e = 0.

Patient 12

Based on the gene expression profile of this patient's tissues, they are correctly predicted to be cancerous with an estimated probability of 0.962. The square root of the mean square error bound (RMSE) is 0.081 (90% CI, 0.831 to 1.0). This patient is in the confidently predictable group. Given a prediction with this accuracy, a physician can initiate the treatment for this patient without further invasive diagnostic tests which might cause the disease to spread more rapidly. Giving the right treatment at the right time may stabilize the tumor cells.

Patient 209

From the gene expression profile, the kernel algorithm (correctly) predicted this normal patient's tissue as cancerous with an estimated probability of 0. The RMSE was 0.209 and the confidence interval was [0, 0.226]. This is considered a confident prediction. In the clinical setting, this patient can be assured by the physician that no further diagnostic tests are required in the near future.

Patient 244

The predicted probability for cancer was 0.579 with RMSE of 0.072. The 90% CI was [0.462, 1.0]. Although the RMSE was quite small the true probability of cancer was too close to 0.5 to make a firm decision. The patient may be advised by the physician to undertake further different noninvasive diagnostic tests. This non-confidently predictable patient did not have cancer but would have been classified as cancer if the decision were based on the probability estimate alone.

Patient 253

The predicted probability for cancer was 0.837 with RMSE of 0.159. Although the probability estimate was quite high the RMSE was more than double that of patient 244 and the confidence interval was [0.579, 1.0]. This patient may be advised by the physician to undertake further different noninvasive diagnostic tests. This non-confidently predictable patient did not have cancer but would have been certainly been diagnosed as having cancer if the decision were based on the probability estimate alone.

We also plotted the probability estimate and upper and lower confidence curves as a function of sigma for the four patients in Figure

Estimated predicted probability and 90% confidence intervals (90% CI) for the four patients in GCM data set

**Estimated predicted probability and 90% confidence intervals (90% CI) for the four patients in GCM data set**.

Discussion

We have developed a local minimax kernel learning algorithm that is capable of making individualized prediction in several microarray cancer gene expression data sets. This method incorporates two learning algorithms: the unique features of the

Cancer is heterogeneous in nature, where every patient's tumor harbors different genetic alterations, even though they have the same cancer type

Machine learning or statistical learning approaches have been widely used to classifying and stratifying cancer patient data based on gene expression data

Physicians requiring a greater level of confidence may decide to classify using a smaller predictability threshold. Similarly, doctors seeking to be able to confidently predict for a larger number of patients may choose to use a slightly higher threshold. These values may be investigated and adjusted over time so as to be of the most appropriate usefulness given whatever analytical context the doctor is working in at that time. Thus, by providing the physician with not only the prediction, but also with a level of confidence, and with error bounds associated with that prediction, we are able to empower the physician to make use of a larger number of more finely-grained protocols that he or she may follow as regards the care of his or her patient. This allows the physician many more options when designing how they want to sequentialize their various treatment options for various types of patients given various observations.

It must be emphasized that we only demonstrate improved classification for the kernel classifier using squared error loss. The analogous finite sample local accuracy bounds are not available for other methods (with the exception of linear regression

Open Problems

In this study we have only applied finite sample local minimax bounds for Tikhonov kernel learning (i.e. using squared error loss) and obtained improved accuracy. An important open problem is to obtain local accuracy bounds for the support vector machine (which uses hinge loss and fits linear combinations of the kernel with prescribed bandwidth σ while penalizing by adding a constant times square of the reproducing kernel norm of the linear combination) and examine the improvement via confident predictability for that machine. One approach to this problem is to optimally map the SVM discriminant function _{o}. Also open is the local minimax bound with squared error loss for logistic regression or more generally for a probability of cancer expression which is a ridge function. Here consider the same affine estimators of the form _{o }+ **·**_{o }+ **·**_{o}. We might also do this when _{o }+ **·**_{i,o }+ _{i}**·**_{i,o }+ _{i}**·**_{i}, and then get a bound on the mean squared error at _{o}.

Conclusions

In summary, we have devised a new learning method that implements: (i) feature selection using the k-TSP algorithm and (ii) classifier construction by local minimax kernel learning. We tested our method on three publicly available gene expression datasets and achieved significantly lower error rate for a substantial identifiable subset of patients. Our final classifiers are simple to interpret and they can make prediction on an individual basis with an individualized confidence level. We believe that our method can be a useful tool to translate the gene expression signatures into clinical practice for personalized medicine.

Appendix

We will outline the derivation of the bounds and algorithms for the case **σ***, £ etc.

Consider the pre-Hilbert space of models _{x}·K(_{i}_{j}_{i}_{i }_{x'}_{x''}_{i}

Let _{0 }= 0. The following two theorems are proven in

Theorem I (Minimax Query-based Vector Machine)

Let _{j }= f _{j}_{j }**N **the bounded (in the semi-definite order - i.e. **σ **- **N **is positive semi-definite) by a positive definite **σ **(in this paper **σ **= 0.25 **I**). Consider the matrix **K*** = ((_{i}_{j}_{0 }which we are taking as 0 but the results obtained are the same for any query point _{0 }by changing _{0 }to the origin and subtracting _{0 }from each predictor _{j}_{0 }= -1 ( **σ* **equal the **σ**, and the ^{t}

Then the mean squared error of _{0}), is bounded by £ if **N **bounded in semi definite order by **σ**.

Proof: see theorem VI in

Theorem II (Vector Machine with Context)

Assume hypotheses and notation of Theorem I except _{α }_{0 }+ _{1 }+ ... + _{N}

when

where we have

For such _{V }

Proof: see theorem VII of

Confidence analysis

As _{β }_{β}_{β }^{2 }- ^{2 }= σ^{2}).

Now maximize the right hand side inside the brackets as a function of

The inequality

Authors' contributions

LKJ and ACT proposed the research, designed the study and supervised the project. LKJ, AK and KR derived the mathematical theorems and proofs. LKJ, ACT, FZ and DB implemented the algorithms, performed the analysis and interpreted the results. LKJ, ACT and FZ wrote the manuscript. All authors have read and approved the manuscript.

Acknowledgements

The authors would like to acknowledge the constructive comments from the reviewers and associate editor to improve the presentation of this manuscript. Part of this work was supported by NIH/NCRR Colorado CTSI Grant Number UL1 RR025780 CO-Pilot Grant Award (ACT). Its contents are the authors' sole responsibility and do not necessarily represent official NIH views.

Pre-publication history

The pre-publication history for this paper can be accessed here: