Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia, Athens, GA, USA

Biometric Research Branch, National Cancer Institute, National Institutes of Health, Rockville, MD, USA

Abstract

Background

We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate?

Results

We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts.

Conclusions

By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller

Background

The split sample approach is a widely used study design in high dimensional settings. This design divides the collection into a training set and a test set as a means of estimating classification accuracy. A classifier is developed on the training set and applied to each sample in the test set. In practice, statistical prediction models have often been developed without separating the data used for model development from the data used for estimation of prediction accuracy

Two approaches to evaluating splits of the data are examined. The first approach is based on simulations designed to understand qualitatively the relationships among dataset characteristics and optimal split proportions. We use these results also to evaluate commonly used rules-of-thumb for allocation of the data to training and test sets. Our second approach involves development of a non-parametric method that does not rely on distributional assumptions and can be applied directly to any existing dataset without stipulating any parameter values. The nonparametric method can be used with any predictor development method (e.g., nearest neighbor, support vector machine).

This paper addresses the situation in which the accuracy of a predictor will be assessed by its performance on a separate test set. An alternative approach is to apply resampling-based methods to the whole dataset. Because re-sampling strategies have been commonly mis-used, often resulting in highly biased estimates of prediction accuracy

The question addressed in this paper has not to our knowledge been addressed before. Sample splitting has been addressed in other contexts, such as comparing different

In the next section we describe the parametric modeling approach and the nonparametric approach that can be applied to specific datasets. We also present the results of application of these methods to synthetic and real world datasets. In the Conclusions section, recommendations for dividing a sample into a training set and test set are discussed.

Approach

The classifier taken forward from a split-sample study is often the one developed on the full dataset. This full-dataset classifier comes from combining the training and test sets together. The full-dataset classifier has an unknown accuracy which is estimated by applying the classifier derived on the training set to the test set. The optimal split will then be the one that minimizes the mean squared error (MSE) with respect to this full-dataset classifier. The MSE naturally penalizes for bias (from using a training set smaller than

MSE decomposition

In the supplemental material [Additional file

**Article supplement**. Contains additional tables, figures, theoretical derivations and discussions.

Click here for file

Here we have symbols

Conceptual Diagram

**Conceptual Diagram**. Diagram of mean squared error decomposition.

A = Accuracy Variance Term

The first term in Equation (1) reflects the variance in the true accuracy of a classifier developed on a training set

V = Binomial Variance Term

The second term in Equation (1) is the variance in the estimated accuracy that results from applying the classifier to the test set. This is a binomial variance because the classifier developed for a specific training set has some fixed true accuracy (success probability), and there are

B = Squared Bias Term

The third term in Equation (1) is the squared bias that results from using a classifier that was developed on

Model-based simulations for high dimensional expression profiles

With each sample is associated a _{1 }for class 1 and _{2 }for class 2 and common covariance matrix Σ. Of the p genes, m are assumed differentially expressed with difference in mean expression levels between classes of 2

Our simulations use the compound covariate predictor

The MSE as a function of splitting proportion is estimated for each simulated dataset in the following way:

1. Given 2

2. For each

3. Using the optimal

4. For each classifier developed on a training set of size t, apply the classifier to the corresponding test set of size n-t and estimate the classification accuracy. Average estimates over the R replicates to obtain the mean predicted accuracy estimate.

5. Develop a CCP classifier on the full dataset

Simulation approach with empirical effect sizes and covariance matrix from real microarray dataset

In order to simulate from a model reflecting more closely real microarray data, data were generated from class _{1 }and _{2 }as

Full dataset accuracies were computed using the equation

A method for determining the optimal sample split for a particular dataset, which utilizes a nonparametric data re-sampling approach

The nonparametric bootstrap method of estimating standard error

In order to estimate the squared bias term ^{α }

The squared bias term is estimated as follows:

1. For fixed

2. For each

3. Fit a smoothing spline or isotonic regression of

4. For

5. Estimate the squared bias using

Results and Discussion

We applied the parametric method to high dimensional multivariate normal datasets, while varying the parameter settings and the class prevalences. Results are shown in Table

Table of optimal allocations of the samples to the training sets

**Optimal number to training set**

**n **= **200**

**Effect = 0.5**

**Effect = 1**.**0**

**Effect = 1.5**

**Effect = 2.0**

**DEG = 50**

170

(86%)

70+

(>99%)

30+

(>99%)

20+

(>99%)

**DEG = 10**

150

(64%)

130

(94%)

100

(99%)

60+

(>99%)

**DEG = 1**

10

(52%)

150

(69%)

120

(77%)

80

(84%)

**n **= **100**

**DEG = 50**

70

(64%)

80

(>99%)

30+

(>99%)

20+

(>99%)

**DEG = 10**

10

(55%)

80

(91%)

70

(99%)

40+

(>99%)

**DEG = 1**

10

(51%)

40

(63%)

80

(77%)

70

(84%)

**n **= **50**

**DEG = 50**

10

(59%)

40

(99%)

30+

(>99%)

20+

(>99%)

**DEG = 10**

10

(52%)

40

(78%)

40

(98%)

40

(>99%)

**DEG = 1**

10

(50%)

10

(54%)

30

(71%)

40

(83%)

Entries in table are

Several features are apparent in Table

[Additional file

The relative sizes of the three terms contributing to the mean squared error of Equation (1) for the scenarios of Table

Example of MSE decomposition

**Example of MSE decomposition**. Example figure showing the relative contributions of the three sources of variation to the mean squared error. This is a scenario from one entry in Table 1. Plots for all other scenarios associated with Table 1 and [Additional file

The squared bias term B tends to be relatively large for small sample sizes and to dominate the other terms. When development of a good classifier is possible, the actual accuracy of classifiers developed on the training set may initially increase rapidly as the training set size increases. As the sample size increases, the bias term B decreases until no longer dominating. This is because the accuracy of the classifier improves as the size of the training set increases and approaches the maximum accuracy possible for the problem at hand. The rate of decrease of the squared bias term B will depend somewhat on the type of classifier employed and on the separation of the classes. When the classes are not different with regard to gene expression, learning is not possible and B will equal zero for all training set sizes.

The binomial variance term V is generally relatively small unless the test set becomes very small at which point it often dominates. The exceptions to this general rule are in cases where the prediction accuracy nears 1 for

Figure

Comparing two rules of thumb

**Comparing two rules of thumb**. Comparison of two common rules-of-thumb: 1/2 the samples to the training set and 2/3 rds of the samples to the training set. X-axis is the average accuracy (%) for training sets of size n. "Excess error" on the y-axis is the difference between the root mean squared error (RMSE) and the optimal RMSE. Each point corresponds to a cell in Table 1. Gray shading indicates scenarios where mean accuracy for full dataset size is below 60%.

• When the achievable true accuracy using the full dataset for training is very close to 1, both the 50% allotment and the 67% allotment to the training set result in similar excess error.

• When the achievable true full dataset accuracy is moderate, say between 60% and 99%, then in several cases, assigning 67% to the training set results in noticeably lower excess error, while in other cases the two allotment schemes are roughly equivalent.

• Finally, and not surprisingly, when the achievable true full dataset accuracy is below 60% (shaded area on graph), then allotment of 50% to the training set is preferable.

In sum, this graph shows that allotment of 2/3 rds to the training set is somewhat more robust than allotment of 1/2 to the training set.

The nonparametric method was applied to simulated datasets and the MSE estimates compared to the parametric approach. Agreement between the two was very good [Additional file

Table

Empirically estimated effects and covariance

**
p
**

**Bayes Acc**.

**
n
**

**Prev**.

**% t**

**Full data Accuracy**

**Opt. Vs. t = 2/3**

**Opt. Vs. t = 1/2**

0.9

0.962

240

50%

58.3

0.961

0.001

0.002

0.6

0.861

240

50%

54.2

0.860

0.003

0.002

Simulation results based on empirical estimates of covariance matrix and effect sizes. Columns are:

Table

Applications to real datasets

**Dataset**

**
n
**

**Prevalence**

**% t**

**Full dataset accuracy**

**Optimal vs. **

**Optimal vs. **

Rosenwald

240

52%

63%

0.96

0.001

0.002

Boer

152

53%

53%

0.98

0.004

2e-4

Golub

72

65%

56%

0.95

0.002

0.004

Sun

131

62%

31%

0.83

0.022

0.008

van't Veer

117

67%

26%

0.78

0.004

0.001

Nonparametric bootstrap with smooth spline (or isotonic regression) learning curve method results [Additional file

Note that the rightmost two columns show the excess error when 1/2 and when 2/3 rds are allotted to the training set. For the Rosenwald et al.

For the Boer et al.

For the Golub et al.

To distinguish oligodendroglioma from glioblastoma in the the Sun et al.

A possible explanation for the Sun et al.

The supplement provides figures related to the fitting on the real datasets [Additional file

Conclusions

We have examined the optimal split of a set of samples into a training set and a test set in the context of developing a gene expression based classifier for a range of synthetic and real-world microarray datasets using a linear classifier. We discovered that the optimal proportion of cases for the training set tended to be in the range of 40% to 80% for the wide range of conditions studied. In some cases, the MSE function was flat over a wide range of training allocation proportions, indicating the near-optimal MSE performance was easy to obtain. In other cases, the MSE function was less flat, indicating clearer optimal selection. In general, smaller total sample sizes led to a larger proportions devoted to the training set being optimal. Intuitively this is because for a given degree of class separation, developing an effective classifier requires a minimal number of cases for training and that number is a greater proportion of a dataset with fewer total cases.

The number of cases needed for effective training depends on the "signal strength" or the extent of separation of the classes with regard to gene expression. "Easy" classification problems contain individual genes with large effects or multiple independent genes with moderately large effects. For such problems the potential classification accuracy is high (low Bayes error). The number of training cases required for near optimal classification for such datasets is smaller and hence smaller proportions devoted to the training set could be near optimal (for

We found that when the average true accuracy of a classifier developed on the full dataset (size

Throughout the simulation studies, this paper has focused on common classifiers which are expected to perform well. Our simulation results should be applicable to the commonly used linear classifiers such as diagonal linear discriminant analysis, Fisher linear discriminant analysis and linear kernel support vector machines. However, there are many other types of classifiers that are currently being investigated. It is beyond the scope of this manuscript to comprehensively examine the MSE patterns of training set size variation for all these classifiers. The simulation results may not carry over to radically different types of classifiers, which may learn at a much different rate or have very different full dataset accuracies than those examined here. It is important not to over-interpret what is necessarily a limited simulation study.

This paper focused on the objective of obtaining a classifier with high accuracy. In some clinical contexts other objectives may be more appropriate, such as estimation of the positive and negative predictive values, or area under the ROC curve. If the prevalence is approximately equal for each class, however, then a high overall accuracy will be highly correlated with high negative and positive predictive values and AUC, so the guidelines here are likely to carry over to these other metrics.

The population prevalence from each class can be an important factor in classifier development. In this paper we looked at equal prevalence from each class, and at the case of 2/3 to 1/3 prevalence split in our simulations. The real datasets had prevalences within this range as well. In cases where there is significant prevalence imbalance between the classes (e.g., 90% versus 10%) there will often be a number of issues outside the scope of this paper. To modify our method for that context, one would need to address whether oversampling from the under-represented class is needed, and whether the cost of misallocation should differ by class.

We looked at a range of sample sizes from

The data based resampling method presented in this paper can be used with any predictor development method by making minor modifications to the algorithm outlined in the Results.

Methods

Computations were carried out in C++ using a Borland 5 compiler and Optivec 5.0 vector and matrix libraries, and R version 2.6.1 (including R "stats" package for smooth.spline and isoreg functions). Gene expression data were obtained from the BRB ArrayTools Data Archive for Human Cancer Gene expression (url:

Authors' contributions

Both authors contributed to all aspects of manuscript development.

Declaration of competing interests

The authors declare that they have no competing interests.

Acknowledgements

Kevin K. Dobbin's work was partially supported by the Distinguished Cancer Clinicians and Scientists Program of the Georgia Cancer Coalition.

Pre-publication history

The pre-publication history for this paper can be accessed here: