Table 3

Applications to real datasets

Dataset

n

Prevalence

%t

Full dataset accuracy

Optimal vs.

Optimal vs.


Rosenwald

240

52%

63%

0.96

0.001

0.002

Boer

152

53%

53%

0.98

0.004

2e-4

Golub

72

65%

56%

0.95

0.002

0.004

Sun

131

62%

31%

0.83

0.022

0.008

van't Veer

117

67%

26%

0.78

0.004

0.001


Nonparametric bootstrap with smooth spline (or isotonic regression) learning curve method results [Additional file 1]. n is the total number of samples from the two classes, and "Prevalence" is the prevalence of the majority class. %t is the percent of samples allocated to the training set under optimal allocation, t/n ยท100%. "Full dataset accuracy" is the estimated mean accuracy on the full dataset of size n. "Optimal vs. rule" is the difference between the root mean squared error for an optimal training set allocation and for the "2/3 rds to training set" allocation rule. The rightmost column is for the "1/2 to training set" allocation rule. Classes for datasets are: Germinal Center B-cell-like lymphoma versus other (Rosenwald et al., 2002), renal clear cell carcinoma primary tumor versus control normal kidney tissue (Boer et al., 2001), acute myelogenous leukemia versus acute lymphoblastic leukemia (Golub et al., 1999), glioblastoma versus oligodendroglioma (Sun et al., 2006), grade 1/2 versus grade 3 lung cancer (van't Veer et al., 2002).

Dobbin and Simon BMC Medical Genomics 2011 4:31   doi:10.1186/1755-8794-4-31

Open Data