BMC Medical Genomics

official impact factor 3.77

Open Access Research article

The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets – improving meta-analysis and prediction of prognosis

Andrew H Sims1,2*, Graeme J Smethurst3, Yvonne Hey4, Michal J Okoniewski3,5, Stuart D Pepper4, Anthony Howell2, Crispin J Miller3 and Robert B Clarke2

Author Affiliations

1 Applied Bioinformatics of Cancer Research Group, Breakthrough Research Unit, Edinburgh Cancer Research Centre, Western General Hospital, Crewe Road South, Edinburgh, EH4 2XR, UK

2 Breast Biology Group, School of Cancer and Imaging Sciences, University of Manchester, UK

3 Cancer Research UK Applied Computational Biology and Bioinformatics Group

4 Cancer Research UK Affymetrix Service, Paterson Institute for Cancer Research, Wilmslow Road, Manchester M20 4BX, UK

5 Functional Genomics Center, UNI ETH Zurich, Winterthurerstrasse 190, CH-8057 Zurich, Switzerland

For all author emails, please log on.

BMC Medical Genomics 2008, 1:42 doi:10.1186/1755-8794-1-42

Published: 21 September 2008

Additional files

Additional file 1:

Comparison of Affymetrix gene expression data generated using different generations of GeneChips, scanning hardware and protocols. A, Comparing the fold change between replicates across datasets is clearly impractical (grey). However, following mean batch-centering there is good correlation (black). B, Comparison of mean raw expression levels for amplified and unamplified MCF10A replicates before (grey) and after mean batch-centering (black). C, Overall transcriptome similarity of individual GeneChips demonstrated with Pearson clustering. D, Fold changes are unaffected by mean batch-centering.

Format: PDF Size: 1MB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 2:

Concordance of mean expression values of data generated from different experiments. Pearson correlation coefficients are given for uncorrected and mean batch-corrected data, for RMA and MAS5 data, and using alternative cdf files [5].

Format: PDF Size: 25KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 3:

The top 50 differentially expressed probesets between basal and non basal-like/luminal tumours were identified across datasets. Those probesets in common are listed. Before: comparison was performed prior to mean batch-centering. After: comparison was performed following mean batch-centering.

Format: PDF Size: 19KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 4:

Summary of the effect of mean batch-centering on data generated from published studies. Lists of the top 50 differentially expressed probesets between basal and non basal-like/luminal tumours were identified within and across datasets, before and after mean batch-centering. SAM Common: for each column two different pairwise comparisons using SAM were performed, and the top 50 probesets identified for each comparison. The number reported is the intersection between two lists. UC = uncorrected. MC = Mean centering correction.

Format: PDF Size: 16KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 5:

Examples of cross-validation and survival curves from supervised principal components analysis. Cross validation plots (A, C) and Kaplan Meir recurrence curves (B, C) using the Wang et al. dataset as the test set and either a single (Pawitan et al.) dataset (A, B) or five (Chin et al., Desmedt et al., Ivshina et al., Pawitan et al. and Sotoriou et al.) datasets (C, D) combined as the training set. Values at the top of the cross validation plots are the numbers of probesets used to create the profiles; the black, red and green lines represent the 1st, 2nd and 3rd principal components respectively.

Format: PDF Size: 62KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 6:

Full matrix of the 1109 R2 and p-values for all possible combinations of the six training and test sets. The R2 statistic (Cox proportional hazards model) measures the percentage of the variation in time to recurrence that is explained by each combination of test datasets. The p-values are the associated log-rank statistic obtained when applying the test dataset to the training dataset.

Format: XLS Size: 33KB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional file 7:

Comparison of published datasets composed of different ratios of basal and luminal tumours. The number of basal (red) and luminal (blue) tumours from The Farmer et al. (italics) and Richardson et al. studies was varied in order to compare the effect of dataset composition, between (A, B, C) and across (D, E, F) the studies. The datasets were either uncorrected (light grey dots), mean-centered (black open squares) or weighted mean-centered (dark grey open circles). UC = uncorrected, MC = mean-centered.

Format: PDF Size: 172KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 8:

Effects of combining datasets composed solely of ER+ or ER- breast tumours. Datasets from Loi et al. [32] and Minn et al. [33] that are composed wholly of ER+ or ER- tumours have distorted levels of ESR1 transcript if integrated with datasets composed of both ER+ and ER- tumours. Replacing any of the six heterogenous datasets with homogeneous datasets results in a dramatic reduction in the correlation between dataset or tumour number and the association with principal components and recurrence (B).

Format: PDF Size: 88KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 9:

Weighted-mean centering does not significantly improve prognostic prediction when combining datasets or tumours of mean-centering. Five datasets with recorded ER status from immunohistochemistry were used to assess the correction methods as in Figure 4. The R2 statistic (Cox proportional hazards model) is an assessment of the performance of the predictor generated using each combination of training datasets and the remaining test datasets, generated using supervised principal components analysis. Median values are used where a training dataset was used to asses more than one test dataset (up to 5). R2 and p-value results for all possible combinations of training datasets and test datasets (1016) are given in the matrix in Additional Table 5.

Format: PDF Size: 16KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data

Additional file 10:

Distance-weighted discrimination (DWD) method. Comparison of the DWD method (green dots) between (A, B) and across (C, D) validation (A, C) and published (B, D) datasets with mean-(red dots) and weighted mean-(blue circles) centering (see Table 3 for SAM analysis). E, DWD correction of the two breast tumour gene expression profiles generated by the two published studies as in Figure 2. Clustering of tumours based upon 640 probesets representing Sorlie et al. [8] 'intrinsic' genes. Thumbnail shows all 640 probesets. i) Tumours classified by Richardson et al. [10] red = basal-like, blue = non-basal like, pink = BRCA1; tumours classified by Farmer et al. [11] red = basal, blue = luminal, green = apocrine. Clusters of genes associated with the 'Sorlie subtypes' are highlighted as follows; ii) ERBB2 gene cluster, iii) luminal A gene cluster, iv) basal gene cluster. v) Centroid prediction was used to assign the tumours to the five Norway/Stanford subtypes – basal (red), luminal A (dark blue), luminal B (light blue), ERBB2 (purple) and normal-like (green), unassigned (grey).

Format: PDF Size: 241KB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data