Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability
-
* Corresponding author: Martin H van Vliet m.h.vanvliet@tudelft.nl
1 Information and Communication Theory Group, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Mekelweg 4, 2628 CD Delft, The Netherlands
2 Bioinformatics and Statistics group, Department of Molecular Biology, Netherlands Cancer Institute, Plesmanlaan 121, 1066 CX Amsterdam, The Netherlands
3 Department of Pathology, Netherlands Cancer Institute, Plesmanlaan 121, 1066 CX Amsterdam, The Netherlands
4 Department of Pathology, Academic Medical Center, Meibergdreef 9, 1100 DD, Amsterdam, The Netherlands
5 Department of Surgery, Institut Curie, 6 rue d'Ulm, 75005 Paris, France
BMC Genomics 2008, 9:375 doi:10.1186/1471-2164-9-375
Published: 6 August 2008Additional files
Additional file 1:
Information on summing SNRs.
Format: PDF Size: 72KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 2:
Scatterplot indicating the classification error relative to the number of datasets that is pooled, using a K Nearest Neighbor classifier (K-NN, K = 3). A) DLCV error. B) Error on a large independent validation set of 2000 samples. The color corresponds to the number of datasets that was used. Labels indicate which combination of datasets was used.
Format: PDF Size: 14KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 3:
Scatterplot indicating the classification error relative to the number of datasets that is pooled, using a Support Vector Machine classifier (SVM-RBF). A) DLCV error. B) Error on a large independent validation set of 2000 samples. The color corresponds to the number of datasets that was used. Labels indicate which combination of datasets was used.
Format: PDF Size: 14KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 4:
Scatterplot indicating the classification error relative to the number of samples that is pooled. A) DLCV error. B) Error on the Vijver et al. [3] dataset. C) Number of genes selected by the DLCV protocol. The color corresponds to the number of datasets that was used. Labels indicate which combination of datasets was used.
Format: PDF Size: 62KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 5:
Scatterplot indicating the classification error relative to the number of datasets that is pooled, using a K Nearest Neighbor classifier (K-NN, K = 3). A) DLCV error. B) Error on the Vijver et al. [3] dataset. The color corresponds to the number of datasets that was used. Labels indicate which combination of datasets was used.
Format: PDF Size: 14KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 6:
Scatterplot indicating the classification error relative to the number of datasets that is pooled, using a Support Vector Machine Classifier (Radial Basis Function used as kernel). A) DLCV error. B) Error on the Vijver et al. [3] dataset. The color corresponds to the number of datasets that was used. Labels indicate which combination of datasets was used.
Format: PDF Size: 14KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 7:
Network indicating the synergy between six real datasets (ER positive samples only). Each node represents a dataset, and each edge the effect on the DLCV error when pooling them. Four different effects were considered, synergy (bright green) when the pooled error is lower than each of the separate errors. Marginal synergy (light blue) when the pooled error is lower than the weighted mean of the separate errors, conversely marginal anti-synergy (yellow) when it is higher. Lastly, true anti-synergy (orange) indicates a higher DLCV error for the pooled dataset.
Format: PDF Size: 9KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 8:
Scatterplot indicating the classification error relative to the number of datasets that is pooled (ER positive samples only). A) DLCV error. B) Error on the Vijver et al. [3] dataset. C) Number of genes selected by the DLCV protocol. The color corresponds to the number of datasets that was used. Labels indicate which combination of datasets was used.
Format: PDF Size: 16KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 9:
Heatmap of the Bonferroni corrected p-values of the enrichment between each signature and a collection of gene sets (ER postive samples only). Only categories with at least 1 significant association are shown.
Format: PDF Size: 14KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 10:
Network indicating the synergy between six real datasets (ER negative samples only). Each node represents a dataset, and each edge the effect on the DLCV error when pooling them. Four different effects were considered, synergy (bright green) when the pooled error is lower than each of the separate errors. Marginal synergy (light blue) when the pooled error is lower than the weighted mean of the separate errors, conversely marginal anti-synergy (yellow) when it is higher. Lastly, true anti-synergy (orange) indicates a higher DLCV error for the pooled dataset.
Format: PDF Size: 9KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 11:
Scatterplot indicating the classification error relative to the number of datasets that is pooled (ER negative samples only). A) DLCV error. B) Error on the Vijver et al. [3] dataset. C) Number of genes selected by the DLCV protocol. The color corresponds to the number of datasets that was used. Labels indicate which combination of datasets was used.
Format: PDF Size: 16KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 12:
Heatmap of the Bonferroni corrected p-values of the enrichment between each signature and a collection of gene sets (ER negative samples only). Only categories with at least 1 significant association are shown.
Format: PDF Size: 13KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 13:
Centroids for the 127 gene classifier that was extracted from the six pooled datasets, including detailed info for the selected reporters.
Format: XLS Size: 58KB Download file
This file can be viewed with: Microsoft Excel Viewer
Additional file 14:
Indication of the distribution of various clinical parameters. In all cases the number of samples (#), and percentage of samples (%) is indicated, except for the tumor size (represented as mm).
Format: PDF Size: 48KB Download file
This file can be viewed with: Adobe Acrobat Reader
