Abstract
Background
To estimate a classifier’s error in predicting future observations, bootstrap methods have been proposed as reducedvariation alternatives to traditional crossvalidation (CV) methods based on sampling without replacement. Monte Carlo (MC) simulation studies aimed at estimating the true misclassification error conditional on the training set are commonly used to compare CV methods. We conducted an MC simulation study to compare a new method of bootstrap CV (BCV) to kfold CV for estimating clasification error.
Findings
For the lowdimensional conditions simulated, the modest positive bias of kfold CV contrasted sharply with the substantial negative bias of the new BCV method. This behavior was corroborated using a realworld dataset of prognostic geneexpression profiles in breast cancer patients. Our simulation results demonstrate some extreme characteristics of variance and bias that can occur due to a fault in the design of CV exercises aimed at estimating the true conditional error of a classifier, and that appear not to have been fully appreciated in previous studies. Although CV is a sound practice for estimating a classifier’s generalization error, using CV to estimate the fixed misclassification error of a trained classifier conditional on the training set is problematic. While MC simulation of this estimation exercise can correctly represent the average bias of a classifier, it will overstate the betweenrun variance of the bias.
Conclusions
We recommend kfold CV over the new BCV method for estimating a classifier’s generalization error. The extreme negative bias of BCV is too high a price to pay for its reduced variance.
Keywords:
Crossvalidation; Bootstrap Crossvalidation; Classification Error Estimation; Mean Squared ErrorFindings
Background
Class prediction involves the use of statistical learning techniques to develop algorithms for classifying unknown samples through supervised learning on samples of known class. In assessing the performance of a classification algorithm, the goal is to estimate its ability to generalize, i.e., to predict the outcomes of samples not included in the data set used to train the classifier. The performance may be assessed on the basis of a number of different indices. For problems having a dichotomous outcome variable (e.g., positive or negative), the sensitivity, specificity, positive predictive value and negative predictive value are indices that may be of interest in addition to the overall prediction accuracy [1]. In this paper, attention is focused on the overall prediction accuracy, or equivalently, on its counterpart, the prediction error.
Crossvalidation (CV) is a widely used method for performance assessment in class prediction [24]. With kfold CV, a data set of n samples is randomly divided into k subsets each having (approximately) n/k samples. Each of these k subsets serves in turn as a test set. For each of these k test sets of size n/k, a classifier is trained on the remaining (k1)×(n/k) observations (the training set). The trained classifier is then used to classify the n/k samples in the test set, and the prediction error (perhaps, along with other indices) is calculated. The combined value of the prediction error over the k test sets, which is based on the prediction of all n samples one time each, is the crossvalidated estimate of that error. Generally, several replicates of kfold crossvalidation are performed based on different random permutations of the n samples in order to account for the random resampling variance, and the average and standard deviation of these replicates are used to assess the performance of the classifier [5,6]. When k = n, the exercise is called leaveoneout crossvalidation (LOOCV); there is only one unique way to do LOOCV and, hence, it cannot be replicated. A common choice of k is 10, and 10 to 30 replicates of 10fold CV have been shown to be sufficient to achieve stable values of the prediction error [7].
Before 10fold CV became popular, efforts were directed toward reducing the variability of LOOCV, recognizing that it gave nearly unbiased estimates of the prediction error [8]. The .632 and .632+ bootstrap methods are well known alternatives to LOOCV [9]. Recently, Fu, Carroll and Wang [10] introduced a new bootstrap version of LOOCV (bootstrap crossvalidation or BCV), which they compared to LOOCV and to the .632 bootstrap method (BT632) on problems with lowdimensional predictor spaces. Like Efron and Tibshirani [9], Fu et al. [10] used a mean squared error (MSE) represented by the mean squared bias (MSB) over N Monte Carlo simulations (discussed in Methods Section) as the primary criterion for evaluating estimators of the true conditional error, i.e., the true misclassification error of the trained classifier conditional on the training set [9]. These and similar investigations into estimating the true conditional error via crossvalidation (e.g., see [7,11]) have been interpreted as assessing a classifier’s error in predicting future observations, i.e., its generalization error [8,9]. It is argued in this paper that while crossvalidation is a sound, generallyaccepted method for evaluating a classifier’s generalization error, it may be problematic to use crossvalidation to assess this generalizability in terms of estimating a true conditional error defined as a single fixed quantity for a given set of data. With that approach the variance of crossvalidation will tend to be overstated, even though its bias can still be appropriately characterized, as will be shown in this paper via Monte Carlo simulation and will be explained more fully in the Discussion.
While Efron and Tibshirani used the traditional absolute scale to calculate the MSB and its square root (the root mean square or RMS in their notation), Fu et al. focused on what they termed the mean squared ‘relative’ error (MSRE), stating that calculations on the absolute scale gave similar results. Here, the mean squared error and associated quantities calculated on the absolute scale are used.
The purpose of this paper is to report the results of a more extensive comparison of BCV to conventional CV done via a simulation study like that of Fu et al. [10], based on kfold CV in addition to LOOCV, and to use those results to fuel a discussion of several issues related to crossvalidation. Finally, the performance of BCV and kfold CV are demonstrated for a realworld data set by classifying patients with breast cancer according to prognosis based on their geneexpression profiles.
Methods
Mean squared error
In order to facilitate the definition of terms, suppose for the moment that kfold crossvalidation (kCV) will be used to assess a classifier’s true conditional error rate based on the results of N Monte Carlo simulations. Let k < n, where n is the sample size, and assume that kCV is repeated R times. Then the MSE for the i^{th} simulation is given by
where e_{i} denotes the true conditional error for the i^{th} simulation,
which can be decomposed into average variance and average (squared) bias components,
In (3)
With BCV, like kCV, it is possible to calculate the MSE in (1) for each value of the true conditional error (k < n). Each of the R bootstrap samples is drawn first, and then each of the n observations in the withreplacement sample is left out one at a time to get an estimate of the prediction error. With BT632, however, it is not possible to calculate the MSE in (1) because only one estimate of the true conditional error can be calculated from the R bootstrap samples in each of the N simulation runs. That is, with BT632 each of the n samples is left out one at a time and then R bootstrap samples are drawn from the remaining n1 samples. These R bootstrap samples give an estimate of the prediction error for the leftout observation, and the average of these estimates over the n samples is the BT632 estimate. (Efron and Tibshirani [9] presented an efficient algorithm for computing BT632 that uses only R total bootstrap samples instead of R × n samples; the expected number of bootstrap samples used to estimate the prediction error for each leftout observation is (1–0.632) × R.) Hence, the decomposition in (3) can be achieved with both kCV and BCV, but not with BT632.
Of necessity, because of the construction of BT632 and associated estimators, Efron and Tibshirani [9] used only the MSB (second term in (3)) to evaluate the performance of crossvalidation methods, where ē_{Ri} for BT632 has a different connotation than for kCV and BCV, but is still an average calculated from R (or fewer) bootstrap samples per observation. Similarly, Molinaro et al. [7] employed the MSB in their investigation. Although it was not explicitly shown, in both papers the MSB was further decomposed as
where
Monte Carlo simulation study
It was assumed that there were two populations (classes) defined by p ≥ 1 predictors or features having underlying Gaussian distributions [10]. The first population was assumed to be distributed N(μ_{1}, Σ_{1}) with μ_{1} = 0_{(p)}^{'} and the second N(μ_{2}, Σ_{2}) with
Feature dimensions of p = 1 and 5 were simulated, along with Δ = 1 and 3. For p = 1, sample sizes of n = 20, 50 and 100 were simulated (n/2 in each class), while for p = 5, only n = 50 and 100 were considered. Whereas Fu et al. [10] used quadratic discriminant analysis (QDA) to classify samples for some comparisons and a knearest neighbor (kNN) classifier for others, here QDA was used for all comparisons in light of the lowdimensionality. For higher dimensions, where p > n, a method like kNN would be required.
As mentioned above, there is only one way to do LOOCV with a given sample. On the other hand, BCV as defined by Fu et al. [10] uses an average of the LOOCV prediction errors over B bootstrap (re)samples. Hence, BCV is based on B × n recomputations (retrainings of the classifier) while LOOCV is based on only n recomputations. For a more extensive comparison, three approaches were taken here in order to compare BCV to kfold CV (henceforward kCV), as summarized in Table 1.
Table 1. Crossvalidation methods to be compared
First, LOOCV was compared to BCV as was done by Fu and colleagues [10]. These methods are denoted by kCVn and BCVn, respectively, in Table 1. Second, n/2fold CV (leavetwoout CV, denoted kCVn/2) was done in order to stay as close as possible to LOOCV (kCVn) while allowing multiple retrainings of the classifier. To keep the number of recomputations the same as for BCV, 2 × B repetitions of kCVn/2 were run (2×B×n/2 = B × n). Also, a version of BCV based on n/2fold CV (BCVn/2) was implemented with 2 × B repetitions for a headtohead comparison with kCVn/2 based on the same number, B×n, of total retrainings. Third, a version of BCV based on 10fold CV (BCV10) was implemented and compared to traditional 10fold CV (kCV10), where again the number of recomputations was the same. Here, both BCV10 and kCV10 were based on B×n/10 repetitions for a total of B×n retrainings. BCVn/2 and BCV10 were defined like kCVn/2 and kCV10, except that in each repetition, a bootstrap sample of size n was randomly divided into n/2 or 10 subsets, while with kCVn/2 and kCV10 the original n observations were randomly redivided into n/2 or 10 subsets (these are the same when n = 20). In this study, B = 50 [9,10].
The simulation study was implemented as follows. For each combination of p and Δ, a “superpopulation” of size 10000 was drawn, 5000 from N(μ_{1}, Σ _{1}) and 5000 from N(μ_{2}, Σ _{2}). Then, for each value of n, N = 1000 simulations were run. For each simulation run, a stratified random sample of size n was drawn without replacement, n/2 observations from the 5000 N(μ_{1}, Σ _{1}) population values and n/2 observations from the 5000 N(μ_{2}, Σ _{2}) population values. The QDA classifier was trained on the sample. Following Molinaro et al. [7], the true conditional error rate for each classifier was calculated as the proportion of times the trained classifier misclassified the remaining 10000n members of the superpopulation. Then kCVn, BCVn, kCVn/2, BCVn/2, kCV10 and BCV10 were each conducted on the sample to estimate the true conditional error. Their MSE, variance and bias were calculated from expression (1).
For p=1 it was found that at least four distinct observations were needed in each class to avoid numerical problems in training the QDA classifier for the BCVn/2 and BCV10 methods. Hence, this requirement was imposed on all three BCV methods. (Fu et al., [10], required at least three distinct observations in each class for the original BCV method, BCVn.) In addition, for p=5, the BCV methods were implemented with stratified sampling, i.e., n/2 bootstrap samples from each class, along with a requirement of at least eight distinct observations in each class.
The mean and standard deviation of the MSE, variance, and bias, as well as the MSB over the N = 1000 simulations were calculated for BCV and kCV. With R representing the number of repetitions of each method (Table 1, column 2), the means are defined by
The three standard deviations for each method are defined by
The means and standard deviations defined in (5) – (11) were used to compare the performance of the methods.
R Version 2.6.0 was used to conduct the Monte Carlo simulation study, with an independently written SAS/IML program being used to verify the mean calculations for the equalvariance case with p=1 [13,14].
Results and discussion
The results of the simulation study are summarized in Tables 2 and 3 and Figures 1, 2, 3, 4. Table 2 is the same case as covered in Table 1 of Fu and colleagues [10]. For brevity, all configuration results are discussed but only a limited portion of the results are displayed in Tables 2 and 3 (i.e., cases for LOOCV, BCVn, kCV10, BCV10 where Σ_{1} = Σ_{2} = Ι_{(p)} ). The interested reader is referred to the supplementary material section for the tables in their entirety and for the cases where Σ_{1} = Ι_{(p)} and Σ_{2} = 2Ι_{(p)}.
Table 2. Simulation results for p = 1, Σ_{1} = Σ_{2} = I_{(1)}, N = 1000
Table 3. Simulation results for p = 5, Σ_{1} = Σ_{2} = Ι_{(5)}, N = 1000
Figure 1. The individual values of(ē_{Ri} − e_{i})that contribute to
Figure 2. The individual values of(ē_{Ri} − e_{i})that contribute to
Figure 3. The mean relative bias for each of the twenty simulation configurations in Tables2
and3
and inAdditional file1: Table S1, Additional file2: Table S2, Additional file3: Table S3 and Additional file4: Table S4, where each point is the average of N = 1000 values,
Beginning with the MSB (i.e.,
The individual values of (ē_{Ri} − e_{i}) that contribute to
Why there is large variation in general
The large variation shown in Figures 1 and 2, along with the correspondingly large values of SD(BIAS) in Tables 2 and 3 for both kCV and BCV, are consistent with results of Efron and Tibshirani ([9], Tables Three to eight on pages 554556) and Molinaro et al. ([7], Tables One and Four on pages 3304 and 3305), both of whom showed large standard deviations and, in some cases, large values of bias, for the classifiers and error estimation methods they studied. In fact, Efron and Tibshirani [9] noted that none of the methods correlates very well with the conditional error rate on a samplebysample basis. This lack of correlation in the present investigation, reflected by the large values of SD(BIAS), appears to be partly due to a problem with the way the true classification error is defined and estimated. As mentioned in the Introduction, the problem appears to be that the quantity purportedly being estimated, the true misclassification error of the trained classifier conditional on the training set, is defined as a single fixed quantity for a given set of data.
Additional file 1 Table S1. Simulation results for p = 1, ∑_{1} = ∑_{2} = I_{(1)}, N = 1000.
Format: DOC Size: 97KB Download file
This file can be viewed with: Microsoft Word Viewer
Additional file 2 Table S2. Simulation results for p = 5, ∑_{1} = ∑_{2} = I_{(5)}, N = 1000.
Format: DOC Size: 80KB Download file
This file can be viewed with: Microsoft Word Viewer
Additional file 3 Table S3. Simulation results for p = 1, ∑_{1} = I_{(1)},∑_{2} = 2I,N = 1000.
Format: DOC Size: 95KB Download file
This file can be viewed with: Microsoft Word Viewer
Additional file 4 Table S4. Simulation results for p = 5, ∑_{1} = I_{(5)},∑_{2} = 2I_{(5)},N = 1000.
Format: DOC Size: 80KB Download file
This file can be viewed with: Microsoft Word Viewer
Possible alternative approach for estimation of SD(BIAS)
It does not seem logical to take the misclassification error as a fixed quantity and
then use crossvalidation to estimate it, because the true conditional error for any
classifier trained using only part of the data within a crossvalidation is not the
same as the true conditional error of the classifier trained on the complete set of
data, i.e., the quantity to be estimated. This leads to an inflated estimate of SD(BIAS). Although it might prove to be computationally prohibitive, it seems more logical
to define and calculate a true conditional error for each training set within each
of the k partitions of a crossvalidation, say e_{ijr} (i = 1,…,N;j = 1,…,k;r = 1,…,R) and then obtain a corresponding estimate,
Average bias estimates Are representative
On the other hand, even though individualrun biases are likely overstated because
of the inflated variance when defined in terms of a fixed true conditional error,
nevertheless, the average bias,
Impact of BCV being negatively biased
The negative bias of BCV, i.e., underestimation of the true error, can be explained by the fact that the probability that a test sample appears in the training set is 1(11/n)^{n} ≈ 0.632. Borrowing the words of Efron and Tibshirani [9] to describe this phenomenon, BCV “uses training samples that are too close to the test points, leading to potential underestimation of the error rate.” It is important to note that the bootstrap methods of Efron and Tibshirani do not include test points in the training set.
The substantial negative bias of BCV means that BCV tends to underestimate the classification
error on average. While the direction and magnitude of the bias of a crossvalidation
method might not matter a great deal if the performances of several competitive classification
procedures are being compared, it definitely matters if the error rate of a specific
classification procedure is of interest. Substantial negative bias, translating to
underestimation of the true misclassification error, would be a serious concern. To
expound on the sizable negative bias of BCV, Figure 4 shows plots of
Assessing reproducibility of error estimates
Because both BCV and kCV can be repeated multiple times, as they have been in the present simulation study,
they can give information on the reproducibility among repeated crossvalidations.
The values of
Fair comparison requires equalization of number of trainings
In this study, there were 100 to 500 repetitions of each method (1000 to 5000 retrainings) in order to put BCV and kCV on the same footing with respect to the number of retrainings of classifiers [9,11]. This is many more repetitions than the ten or twenty repetitions normally done with kCV10. Although nowadays CPU time is relatively inexpensive, 100 to 500 repetitions may be excessive. On the other hand, although the BT632 method of Efron and Tibshirani [9] did not perform as well overall as BCV in the study of Fu et al. [10], it did show competitive behavior in some cases. It seems likely that, if the number of retrainings were equalized while employing the economical algorithm of Efron and Tibshirani [9], the competitiveness of BT632 evaluated in terms of average squared bias and its component parts would improve. As Kim [11] reported recently, the BT632^{+} method based on 50 bootstraps performed better than 5 repetitions of kCV10 in terms of average squared bias for a pruned tree classifier, although it, too, had a downward bias.
Microarray example
We evaluated the performance of kCV and BCV in predicting prognosis based on the gene expression profiles of breast cancer patients previously reported by van’t Veer and colleagues [17,18]. The van de Vijver et al. [17] study consisted of 295 patients with stage I or II breast cancer, and patients’ prognosis and gene expression data are publicly available at http://microarraypubs.stanford.edu/would_NKI/explore.html webcite. While the dataset contained a 70gene prognosis profile, we chose to perform our evaluation of kCV and BCV using only 5 genes based on a simple gene selection procedure using a tstatistic with adjusted pvalues [19]. In the study of Fu et al. [10], the authors chose 5 genes that were most highly correlated with the patient’s prognosis. As noted in Fu et al. [10], such gene selection procedure is prone to bias. However, the purpose of this evaluation is not gene selection. Furthermore, in practice only a small subset of genes is often of clinical interest.
Following the steps taken in Fu et al. [10] for comparison, the subsequent steps were carried out: (1) take a random sample S of size n = 50 with half of the patients having good prognosis and half having poor prognosis, (2) train a QDA classifier based on the random sample S and compute its true conditional error rate based on the proportion of times the trained classifier misclassified the remaining samples, (3) for each random sample S, estimate the true conditional error for LOOCV, CVn/2, CV10, BCV, BCVn/2, and BCV10, (4) calculate their MSE, variance and bias, (5) repeat over 1000 simulation runs and calculate the mean and standard deviation of the MSE, variance, bias and MSB.
The results presented in Table 4 and Figure 5 are consistent with our simulation results in Table 3 for Δ = 1. More specifically, the MSB is larger for BCV and the negative
Conclusions
Crossvalidation is a widely accepted and sound practice for estimating the generalization error of a classifier. Of course, for small data sets with highdimensional predictors, especially for p > n, the variation among crossvalidated error estimates can be large. For methods like BCV and kCV that can be replicated, it is generally accepted that crossvalidation should be repeated 10 to 30 times to account for variation. However, using crossvalidation to estimate the fixed misclassification error of a trained classifier conditional on the training set is problematic and should not be attempted. Although Monte Carlo simulation of this estimation exercise can correctly represent the average bias, it will overstate the variance of the bias. For the lowdimensional conditions simulated in the present study, kCV showed a consistent, but modest, positive bias. Conversely, BCV showed a consistent, and sometimes substantial, negative bias, which was much more pronounced for p=5 than for p=1. Increasing the complexity of the simulation to incorporate higher dimensions would only magnify the effect. The bias of BCV is too high a price to pay for its reduced variance; kfold CV is recommended.
Abbreviations
CV: Crossvalidation; MSE: Mean squared error; MSB: Mean squared bias; BCV: Bootstrap crossvalidation; LOOCV: Leaveoneout crossvalidation; RMS: Root mean square; MSRE: Mean squared relative error; kCV: kfold crossvalidation; QDA: quadratic discriminant analysis; kNN: knearest neighbor.
Competing interests
The authors declare that they have no competing interests relevant to this article to disclose.
Author’s contributions
SO and RLK conceived the problem and designed the simulations for the manuscript. SYL and HJS were in charge of the computational coding. All authors were involved in drafting the manuscript. All authors read and approved the final manuscript.
Authors’ information
Songthip Ounpraseuth is an Associate Professor in Department of Biostatistics at the University of Arkansas for Medical Science, Little Rock, AR. His research interests include prediction error estimation, computational statistics, dimension reduction and classification.
Ralph L Kodell is a Professor in the Department of Biostatistics at the University of Arkansas for Medical Sciences. His research interests include classification algorithms for biomedical decision making and statistical models and methods for toxicology and risk assessment.
Shelly Y. Lensing is a Research Associate biostatistician in the Department of Biostatistics at the University of Arkansas for Medical Sciences. Her research interests are the design and analysis of clinical trials and statistical computing.
Horace J. Spencer is a Research Associate biostatistician in the Department of Biostatistics at the University of Arkansas for Medical Sciences. His research interests are in all aspects of statistical computing and error estimation.
Financial disclosures
The authors have no financial relationships relevant to this article to disclose.
Acknowledgement
The authors are appreciative of the referees for critically reading the manuscript and for their valuable suggestions and comments which have led to an improved presentation.
References

Moon H, Ahn H, Kodell RL, Baek S, Lin CJ, Chen JJ: Ensemble methods for classification of patients for personalized medicine with highdimensional data.
Artif Intell Med 2007, 41:197207. PubMed Abstract  Publisher Full Text

Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer, New York; 2001.

Ambroise C, McLachlan GJ: Selection bias in gene extraction on the basis of microarray geneexpression data.
Proc Natl Acad Sci 2002, 99:65626566. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Subramanian J, Simon R: An evaluation of resampling methods for assessment of survival risk prediction in highdimensional settings.
Stat Med 2011, 30:642653. PubMed Abstract  Publisher Full Text

Liu Y, Yao X, Higuchi T: Evolutionary ensembles with negative correlation learning.
IEEE Trans Evol Comput 2000, 4:380387. Publisher Full Text

Arena VC, Sussman NB, Mazumdar S, Yu S, Macina OT: The utility of structureactivity relationship (SAR) models for prediction and covariate selection in developmental toxicity: comparative analysis of logistic regression and decision tree models.
SAR QSAR Environ Res 2004, 15:118. PubMed Abstract  Publisher Full Text

Molinaro AM, Simon R, Pfeiffer RM: Prediction error estimation: a comparison of resampling methods.
Bioinformatics 2005, 21:33013307. PubMed Abstract  Publisher Full Text

Efron B: Estimating the error rate of a prediction rule: improvement on crossvalidation.
J Amer Stat Assoc 1983, 78:316331. Publisher Full Text

Efron B, Tibshirani R: Improvements on crossvalidation: the .632+ Bootstrap method.

Fu WJ, Carroll RJ, Wang S: Estimating misclassification error with small samples via bootstrap crossvalidation.
Bioinformatics 2005, 21:19791986. PubMed Abstract  Publisher Full Text

Kim JH: Estimating classification error rate: repeated crossvalidation, repeated holdout and bootstrap.
Comput Stat Data Anal 2009, 53:37353745. Publisher Full Text

Friedman J: Regularized discriminant analysis.
J Amer Stat Assoc 1989, 84:165175. Publisher Full Text

R Core Development Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2007.
http://www.Rproject.org, webcite accessed November 2, 2007

SAS: SAS/IML 9.1 User’s Guide. SAS Institute, Inc, Cary, North Carolina; 2004.

Davison AC, Hall P: On the bias and variability of bootstrap and crossvalidation estimates of error rate in discrimination problems.
Biometrika 1992, 79(2):279284. Publisher Full Text

Efron B, Tibshirani R: An Introduction to the Bootstrap. Chapman & Hall/CRC, Boca Raton, Florida; 1993.

Van’t Veer LJ, Dai H, Van de Vijver MJ, He YD, Hart AA, Mao M, et al.: Gene expression profiling predicts clinical outcome of breast cancer.
Nature 2002, 415:530536. PubMed Abstract  Publisher Full Text

Van de Vijver MJ, He YD, et al.: A geneexpression signature as a predictor of survival in breast cancer.
N Engl J Med 2002, 347:19992009. PubMed Abstract  Publisher Full Text

Nguyen VD, Rocke MD: Tumor classification by partial least squares using microarray gene expression data.
Bioinformatics 2002, 18:3950. PubMed Abstract  Publisher Full Text