Table 3

Stability of variable (gene) selection evaluated using 200 bootstrap samples. "# Genes": number of genes selected on the original data set. "# Genes boot.": median (1st quartile, 3rd quartile) of number of genes selected from on the bootstrap samples. "Freq. genes": median (1st quartile, 3rd quartile) of the frequency with which each gene in the original data set appears in the genes selected from the bootstrap samples. Parameters for backwards elimination with random forest: mtryFactor = 1, s.e. = 0, ntree = 2000, ntreelterat = 1000, fraction.dropped = 0.2.

Data set

Error

# Genes

# Genes boot.

Freq. genes


Backwards elimination of genes from random forest


s.e. = 0


Leukemia

0.087

2

2 (2, 2)

0.38 (0.29, 0.48)1

Breast 2 cl.

0.337

14

9 (5, 23)

0.15 (0.1, 0.28)

Breast 3 cl.

0.346

110

14 (9, 31)

0.08 (0.04, 0.13)

NCI 60

0.327

230

60 (30, 94)

0.1 (0.06, 0.19)

Adenocar.

0.185

6

3 (2, 8)

0.14 (0.12, 0.15)

Brain

0.216

22

14 (7, 22)

0.18 (0.09, 0.25)

Colon

0.159

14

5 (3, 12)

0.29 (0.19, 0.42)

Lymphoma

0.047

73

14 (4, 58)

0.26 (0.18, 0.38)

Prostate

0.061

18

5 (3, 14)

0.22 (0.17, 0.43)

Srbct

0.039

101

18 (11, 27)

0.1 (0.04, 0.29)


s.e. = 1


Leukemia

0.075

2

2 (2, 2)

0.4 (0.32, 0.5)1

Breast 2 cl.

0.332

14

4 (2, 7)

0.12 (0.07, 0.17)

Breast 3 cl.

0.364

6

7 (4, 14)

0.27 (0.22, 0.31)

NCI 60

0.353

24

30 (19, 60)

0.26 (0.17, 0.38)

Adenocar.

0.207

8

3 (2, 5)

0.06 (0.03, 0.12)

Brain

0.216

9

14 (7, 22)

0.26 (0.14, 0.46)

Colon

0.177

3

3 (2, 6)

0.36 (0.32, 0.36)

Lymphoma

0.042

58

12 (5, 73)

0.32 (0.24, 0.42)

Prostate

0.064

2

3 (2, 5)

0.9 (0.82, 0.99)1

Srbct

0.038

22

18 (11, 34)

0.57 (0.4, 0.88)


Alternative approaches


SC.s


Leukemia

0.062

822

46 (14, 504)

0.48 (0.45, 0.59)

Breast 2 cl.

0.326

31

55 (24, 296)

0.54 (0.51, 0.66)

Breast 3 cl.

0.401

2166

4341 (2379, 4804)

0.84 (0.78, 0.88)

NCI 60

0.246

51183

4919 (3711, 5243)

0.84 (0.74, 0.92)

Adenocar.

0.179

0

9 (0, 18)

NA (NA, NA)

Brain

0.159

4177

1257 (295, 3483)

0.38 (0.3, 0.5)

Colon

0.122

15

22 (15, 34)

0.8 (0.66, 0.87)

Lymphoma

0.033

2796

2718 (2030, 3269)

0.82 (0.68, 0.86)

Prostate

0.089

4

3 (2, 4)

0.72 (0.49, 0.92)

Srbct

0.025

374

18 (12, 40)

0.45 (0.34, 0.61)


NN.vs


Leukemia

0.056

512

23 (4, 134)

0.17 (0.14, 0.24)

Breast 2 cl.

0.337

88

23 (4, 110)

0.24 (0.2, 0.31)

Breast 3 cl.

0.424

9

45 (6, 214)

0.66 (0.61, 0.72)

NCI 60

0.237

1718

880 (360, 1718)

0.44 (0.34, 0.57)

Adenocar.

0.181

9868

73 (8, 1324)

0.13 (0.1, 0.18)

Brain

0.194

1834

158 (52, 601)

0.16 (0.12, 0.25)

Colon

0.158

8

9 (4, 45)

0.57 (0.45, 0.72)

Lymphoma

0.04

15

15 (5, 39)

0.5 (0.4, 0.6)

Prostate

0.081

7

6 (3, 18)

0.46 (0.39, 0.78)

Srbct

0.031

11

17 (11, 33)

0.7 (0.66, 0.85)


1 Only two genes are selected from the complete data set; the values are the actual frequencies of those two genes.

2 [33] select 21 genes after visually inspecting the plot of cross-validation error rate vs. amount of shrinkage and number of genes. Their procedure is hard to automate and thus it is very difficult to obtain estimates of the error rate of their procedure.

3 [31] report obtaining more than 2000 genes when using shrunken centroids with this data set and show that the minimum error rate is achieved with about 5000 genes.

4 [33] select 43 genes. The difference is likely due to differences in the random partitions for cross-validation. Repeating 100 times the gene selection process with the full data set the median, 1st quartile, and 3rd quartile of the number of selected genes are 13, 8, and 147. For these data, [31] obtain 72 genes with shrunken centroids, which also falls within the above interval.

Díaz-Uriarte and Alvarez de Andrés BMC Bioinformatics 2006 7:3   doi:10.1186/1471-2105-7-3

Open Data