<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1755-8794-1-42</ui>
   <ji>1755-8794</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets &#8211; improving meta-analysis and prediction of prognosis</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Sims</snm>
               <mi>H</mi>
               <fnm>Andrew</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>andrew.sims@ed.ac.uk</email>
            </au>
            <au id="A2">
               <snm>Smethurst</snm>
               <mi>J</mi>
               <fnm>Graeme</fnm>
               <insr iid="I3"/>
               <email>graemesmethurst@yahoo.co.uk</email>
            </au>
            <au id="A3">
               <snm>Hey</snm>
               <fnm>Yvonne</fnm>
               <insr iid="I4"/>
               <email>yhey@picr.man.ac.uk</email>
            </au>
            <au id="A4">
               <snm>Okoniewski</snm>
               <mi>J</mi>
               <fnm>Michal</fnm>
               <insr iid="I3"/>
               <insr iid="I5"/>
               <email>michal.okoniewski@fgcz.ethz.ch</email>
            </au>
            <au id="A5">
               <snm>Pepper</snm>
               <mi>D</mi>
               <fnm>Stuart</fnm>
               <insr iid="I4"/>
               <email>spepper@picr.man.ac.uk</email>
            </au>
            <au id="A6">
               <snm>Howell</snm>
               <fnm>Anthony</fnm>
               <insr iid="I2"/>
               <email>anthony.howell@christie.nhs.uk</email>
            </au>
            <au id="A7">
               <snm>Miller</snm>
               <mi>J</mi>
               <fnm>Crispin</fnm>
               <insr iid="I3"/>
               <email>cmiller@picr.man.ac.uk</email>
            </au>
            <au id="A8">
               <snm>Clarke</snm>
               <mi>B</mi>
               <fnm>Robert</fnm>
               <insr iid="I2"/>
               <email>rclarke@picr.man.ac.uk</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Applied Bioinformatics of Cancer Research Group, Breakthrough Research Unit, Edinburgh Cancer Research Centre, Western General Hospital, Crewe Road South, Edinburgh, EH4 2XR, UK</p>
            </ins>
            <ins id="I2">
               <p>Breast Biology Group, School of Cancer and Imaging Sciences, University of Manchester, UK</p>
            </ins>
            <ins id="I3">
               <p>Cancer Research UK Applied Computational Biology and Bioinformatics Group</p>
            </ins>
            <ins id="I4">
               <p>Cancer Research UK Affymetrix Service, Paterson Institute for Cancer Research, Wilmslow Road, Manchester M20 4BX, UK</p>
            </ins>
            <ins id="I5">
               <p>Functional Genomics Center, UNI ETH Zurich, Winterthurerstrasse 190, CH-8057 Zurich, Switzerland</p>
            </ins>
         </insg>
         <source>BMC Medical Genomics</source>
         <issn>1755-8794</issn>
         <pubdate>2008</pubdate>
         <volume>1</volume>
         <issue>1</issue>
         <fpage>42</fpage>
         <url>http://www.biomedcentral.com/1755-8794/1/42</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">18803878</pubid>
               <pubid idtype="doi">10.1186/1755-8794-1-42</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>14</day>
               <month>5</month>
               <year>2008</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>21</day>
               <month>9</month>
               <year>2008</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>21</day>
               <month>9</month>
               <year>2008</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2008</year>
         <collab>Sims et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>The number of gene expression studies in the public domain is rapidly increasing, representing a highly valuable resource. However, dataset-specific bias precludes meta-analysis at the raw transcript level, even when the RNA is from comparable sources and has been processed on the same microarray platform using similar protocols. Here, we demonstrate, using Affymetrix data, that much of this bias can be removed, allowing multiple datasets to be legitimately combined for meaningful meta-analyses.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>A series of validation datasets comparing breast cancer and normal breast cell lines (MCF7 and MCF10A) were generated to examine the variability between datasets generated using different amounts of starting RNA, alternative protocols, different generations of Affymetrix GeneChip or scanning hardware. We demonstrate that systematic, multiplicative biases are introduced at the RNA, hybridization and image-capture stages of a microarray experiment. Simple batch mean-centering was found to significantly reduce the level of inter-experimental variation, allowing raw transcript levels to be compared across datasets with confidence. By accounting for dataset-specific bias, we were able to assemble the largest gene expression dataset of primary breast tumours to-date (1107), from six previously published studies. Using this meta-dataset, we demonstrate that combining greater numbers of datasets or tumours leads to a greater overlap in differentially expressed genes and more accurate prognostic predictions. However, this is highly dependent upon the composition of the datasets and patient characteristics.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>Multiplicative, systematic biases are introduced at many stages of microarray experiments. When these are reconciled, raw data can be directly integrated from different gene expression datasets leading to new biological findings with increased statistical power.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="bmc" subtype="user_supplied_xml" id="endnote"/>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Successful microarray experiments are reliant on sufficient care being taken to minimize and account for experimental variability. Formalization and control of all stages of the experimental pipeline is now routine, and the need to associate experiments with detailed descriptions of protocols and techniques is now widely accepted <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. However, despite these efforts, it is still not possible to account for all potential sources of variation, and even identical experiments performed at different sites have been shown to produce significantly different results <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. This makes it difficult to routinely compare gene expression data generated from different experiments, even when using samples from comparable sources that have been processed on the same microarray platform using similar protocols.</p>
         <p>Issues of experimental reproducibility have become increasingly important with the advent of microarray databases and repositories (e.g. ArrayExpress <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>, GEO <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>), given the potential they offer for cross-experimental comparison and data mining. Even if it is possible to successfully control inter-experiment variation to a point where this might be possible, rapid developments in both the hybridization protocols and in the arrays themselves have also led to major improvements in the technology. Lower requirements for the amount of starting RNA have enabled gene expression profiling to be combined with cell sorting methods or laser capture microdissection, while increases in the number of features represented on the arrays have resulted in progressively more detailed coverage of the transcriptome. Since each advance in technology leads to genuine improvements, there is a strong incentive to use the latest arrays and protocols whenever possible. This is however, tempered by a lack of backward compatibility between datasets produced using different array and protocol versions, and any decision to move to a newer (better) iteration of the technology must be made with an appreciation of the difficulty in maintaining compatibility with previous studies.</p>
         <p>In this study we first demonstrate, using an extended series of validation data, that Affymetrix datasets cannot in general be compared at the raw expression level due to systematic, multiplicative biases. Secondly, we show that simple batch mean-centering can significantly reduce the level of inter-experimental variation and that this allows raw transcript levels to be compared across datasets. The approach is then applied to a series of published breast cancer studies and we show that the integrated datasets possess increased statistical power and improved prognostic ability, compared to the individual datasets alone.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Systematic bias in microarray data</p>
            </st>
            <p>All validation datasets (consisting of six GeneChips each) were produced by hybridizing triplicate RNA samples from a breast cancer cell line (MCF7) and an immortalised normal breast cell line (MCF10A) using a variety of different array types and sample preparation protocols. In all cases, the aim of the validation study was to assess the correspondence between the sets of differentially expressed probesets (or transcripts) identified when using different versions of the same underlying technology. In the hypothetical situation when all datasets yield identical results, the same set of differentially expressed probesets would be identified, irrespective of the array type or protocol used to process the data; we considered how close the data approaches this ideal. For example, a comparison of the fold changes <it>between </it>datasets (MCF7-amplified/MCF10A-amplified vs. MCF7-unamplified/MCF10A-unamplified) generated using Affymetrix's small sample protocol and Affymetrix's standard protocol yields good, but not perfect correspondence (Fig <figr fid="F1">1A</figr>, Table <tblr tid="T1">1</tblr>). However, when fold changes are calculated <it>across </it>datasets (MCF7-amplified/MCF10A-unamplified vs. MCF7-unamplified/MCF10A-amplified) correspondence falls dramatically (Fig. <figr fid="F1">1B</figr> grey dots, Table <tblr tid="T1">1</tblr>).</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Summary of the effect of mean batch-centering on data generated from different experiments.</p>
               </caption>
               <tblbdy cols="10">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="4" ca="center">
                        <p>Within two-fold consistent (%)</p>
                     </c>
                     <c cspan="4" ca="center">
                        <p>SAM Common, top 1000</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="8">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Data from different experiments</p>
                     </c>
                     <c ca="center">
                        <p>Probesets</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Between datasets</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Across datasets</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Between datasets</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Across datasets</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="8">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Before</p>
                     </c>
                     <c ca="center">
                        <p>After</p>
                     </c>
                     <c ca="center">
                        <p>Before</p>
                     </c>
                     <c ca="center">
                        <p>After</p>
                     </c>
                     <c ca="center">
                        <p>Before</p>
                     </c>
                     <c ca="center">
                        <p>After</p>
                     </c>
                     <c ca="center">
                        <p>Before</p>
                     </c>
                     <c ca="center">
                        <p>After</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="10">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Amplified 10 ng and unamplified 10 &#956;g protocols, RMA, U133A</p>
                     </c>
                     <c ca="center">
                        <p>22283</p>
                     </c>
                     <c ca="center">
                        <p>20645 (92%)</p>
                        <p>(Fig 1A)</p>
                     </c>
                     <c ca="center">
                        <p>20645 (92%)</p>
                        <p>(Fig 1A)</p>
                     </c>
                     <c ca="center">
                        <p>13221 (59%)</p>
                        <p>(Fig 1B)</p>
                     </c>
                     <c ca="center">
                        <p>22283 (100%)</p>
                        <p>(Fig 1B)</p>
                     </c>
                     <c ca="center">
                        <p>522</p>
                        <p>(0.031)</p>
                        <p>(0.032)</p>
                     </c>
                     <c ca="center">
                        <p>522</p>
                        <p>(0.031)</p>
                        <p>(0.032)</p>
                     </c>
                     <c ca="center">
                        <p>251</p>
                        <p>(0.023)</p>
                        <p>(0.037)</p>
                     </c>
                     <c ca="center">
                        <p>594</p>
                        <p>(0.025)</p>
                        <p>(0.035)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>U133A and plus 2.0 arrays common, MAS5 present = 4/6 chips</p>
                     </c>
                     <c ca="center">
                        <p>11198</p>
                     </c>
                     <c ca="center">
                        <p>10641 (95%)</p>
                     </c>
                     <c ca="center">
                        <p>10669 (95%)</p>
                     </c>
                     <c ca="center">
                        <p>2675 (24%)</p>
                     </c>
                     <c ca="center">
                        <p>11170 (100%)</p>
                     </c>
                     <c ca="center">
                        <p>493</p>
                        <p>(0.037)</p>
                        <p>(0.036)</p>
                     </c>
                     <c ca="center">
                        <p>493</p>
                        <p>(0.037)</p>
                        <p>(0.036)</p>
                     </c>
                     <c ca="center">
                        <p>112</p>
                        <p>(0.026)</p>
                        <p>(0.027)</p>
                     </c>
                     <c ca="center">
                        <p>954</p>
                        <p>(0.041)</p>
                        <p>(0.040)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Exon v U133 plus 2 (consensus mapping) Plier and MAS5</p>
                     </c>
                     <c ca="center">
                        <p>44280</p>
                     </c>
                     <c ca="center">
                        <p>37255 (84%)</p>
                     </c>
                     <c ca="center">
                        <p>37255 (84%)</p>
                     </c>
                     <c ca="center">
                        <p>9485 (21%)</p>
                     </c>
                     <c ca="center">
                        <p>44280 (100%)</p>
                     </c>
                     <c ca="center">
                        <p>528</p>
                        <p>(0.067)</p>
                        <p>(0.060)</p>
                     </c>
                     <c ca="center">
                        <p>528</p>
                        <p>(0.067)</p>
                        <p>(0.060)</p>
                     </c>
                     <c ca="center">
                        <p>569</p>
                        <p>(0.024)</p>
                        <p>(0.020)</p>
                     </c>
                     <c ca="center">
                        <p>847</p>
                        <p>(0.110)</p>
                        <p>(0.089)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Exon v U133 plus 2 (SIF mapping) Plier and MAS5</p>
                     </c>
                     <c ca="center">
                        <p>13730</p>
                     </c>
                     <c ca="center">
                        <p>12028 (88%)</p>
                     </c>
                     <c ca="center">
                        <p>12028 (88%)</p>
                     </c>
                     <c ca="center">
                        <p>3626 (26%)</p>
                     </c>
                     <c ca="center">
                        <p>13730 (100%)</p>
                     </c>
                     <c ca="center">
                        <p>524</p>
                        <p>(0.068)</p>
                        <p>(0.060)</p>
                     </c>
                     <c ca="center">
                        <p>524</p>
                        <p>(0.068)</p>
                        <p>(0.060)</p>
                     </c>
                     <c ca="center">
                        <p>731</p>
                        <p>(0.026)</p>
                        <p>(0.026)</p>
                     </c>
                     <c ca="center">
                        <p>916</p>
                        <p>(0.569)</p>
                        <p>(0.370)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Standard 10 &#956;g and revised 2 &#956;g protocols, RMA, U133A</p>
                     </c>
                     <c ca="center">
                        <p>22283</p>
                     </c>
                     <c ca="center">
                        <p>21303 (96%)</p>
                     </c>
                     <c ca="center">
                        <p>21303 (96%)</p>
                     </c>
                     <c ca="center">
                        <p>19779 (89%)</p>
                     </c>
                     <c ca="center">
                        <p>22283 (100%)</p>
                     </c>
                     <c ca="center">
                        <p>688</p>
                        <p>(0.035)</p>
                        <p>(0.050)</p>
                     </c>
                     <c ca="center">
                        <p>688</p>
                        <p>(0.035)</p>
                        <p>(0.050)</p>
                     </c>
                     <c ca="center">
                        <p>618</p>
                        <p>(0.039)</p>
                        <p>(0.044)</p>
                     </c>
                     <c ca="center">
                        <p>901</p>
                        <p>(0.044)</p>
                        <p>(0.042)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GeneChip 3000 and GeneArray 2500 scanners, RMA, U133A</p>
                     </c>
                     <c ca="center">
                        <p>22283</p>
                     </c>
                     <c ca="center">
                        <p>22276 (100%)</p>
                     </c>
                     <c ca="center">
                        <p>22276 (100%)</p>
                     </c>
                     <c ca="center">
                        <p>22282 (100%)</p>
                     </c>
                     <c ca="center">
                        <p>22283 (100%)</p>
                     </c>
                     <c ca="center">
                        <p>883</p>
                        <p>(0)</p>
                        <p>(0)</p>
                     </c>
                     <c ca="center">
                        <p>883</p>
                        <p>(0)</p>
                        <p>(0)</p>
                     </c>
                     <c ca="center">
                        <p>872</p>
                        <p>(0)</p>
                        <p>(0)</p>
                     </c>
                     <c ca="center">
                        <p>879</p>
                        <p>(0)</p>
                        <p>(0)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>NuGEN 10 ng and EpiStem 2 ng amplification, RMA, U133 plus 2</p>
                     </c>
                     <c ca="center">
                        <p>54675</p>
                     </c>
                     <c ca="center">
                        <p>49308 (90%)</p>
                     </c>
                     <c ca="center">
                        <p>49308 (90%)</p>
                     </c>
                     <c ca="center">
                        <p>28276 (52%)</p>
                     </c>
                     <c ca="center">
                        <p>54675 (100%)</p>
                     </c>
                     <c ca="center">
                        <p>530</p>
                        <p>(0.060)</p>
                        <p>(0.066)</p>
                     </c>
                     <c ca="center">
                        <p>530</p>
                        <p>(0.060)</p>
                        <p>(0.066)</p>
                     </c>
                     <c ca="center">
                        <p>364</p>
                        <p>(0.035)</p>
                        <p>(0.034)</p>
                     </c>
                     <c ca="center">
                        <p>743</p>
                        <p>(0.070)</p>
                        <p>(0.069)</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Sets of differentially expressed probesets comparing MCF7 and MCF10A replicates were identified for each experiment, before and after mean batch-centering. Comparisons between and across different validation experiments were performed. The number (%) of probesets with less than 2-fold deviation in the fold change found for each comparison is reported in the table. SAM Common: for each column two different pairwise comparisons using SAM were performed, and the top 1000 probesets identified for each comparison. The number reported is the intersection between the two sets. Before: comparison was performed prior to mean batch-centering. After: comparison was performed following mean batch-centering. Values in brackets are the FDR for each top 1000 probesets. See Additional File <supplr sid="S1">1</supplr> for plots.</p>
               </tblfn>
            </tbl>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Comparison of Affymetrix gene expression data generated using amplified and unamplified protocols</p>
               </caption>
               <text>
                  <p><b>Comparison of Affymetrix gene expression data generated using amplified and unamplified protocols</b>. A, Comparing fold changes <it>between </it>unamplified and amplified datasets demonstrates reasonable correlation. B, Comparing fold changes <it>across </it>datasets (unamplified MCF7 with amplified MCF10A and vice versa) is clearly impractical (grey spots), however following mean batch-centering there is excellent correlation across the datasets (black spots). C, Comparison of mean raw expression levels for amplified and unamplified MCF10A replicates before (grey) and after mean batch-centering (black). D, Pearson clustering of the GeneChips representing the same cell lines is tighter following mean-centering. E, Mean-centering has no effect on fold changes between datasets. F, Mean-centering of unbalanced datasets (duplicate rather than triplicate amplified MCF10A) results in a distortion of the comparison (black spots), however this is rectified with weighted mean-centering (open dark grey spots), both methods show a dramatic improvement over uncorrected data (light grey spots).</p>
               </text>
               <graphic file="1755-8794-1-42-1"/>
            </fig>
            <p>Batch mean-centering (see methods) of the amplified and unamplified datasets was found to dramatically increase the correspondence comparison <it>across </it>the datasets, with 100% of probesets having fold changes within two-fold between experimental branches (Fig. <figr fid="F1">1B</figr> black dots, Table <tblr tid="T1">1</tblr>). Similarly, when significance analysis of microarrays (SAM) was used to identify lists of probesets with statistically significant changes between the same replicate groupings used to generate Fig. <figr fid="F1">1</figr>, the intersection between sets was also found to be greater following mean batch correction (Table <tblr tid="T1">1</tblr>) and Pearson correlation of raw intensities was also found to increase (Figs. <figr fid="F1">1C</figr>, <figr fid="F1">1D</figr> and Additional File <supplr sid="S1">1</supplr>). It is notable that while correction improves fold-change correspondence <it>across </it>protocols, fold changes <it>between </it>datasets are preserved (Fig. <figr fid="F1">1E</figr>, Table <tblr tid="T1">1</tblr>).</p>
            <suppl id="S1">
               <title>
                  <p>Additional file 1</p>
               </title>
               <text>
                  <p><b>Comparison of Affymetrix gene expression data generated using different generations of GeneChips, scanning hardware and protocols</b>. A, Comparing the fold change between replicates across datasets is clearly impractical (grey). However, following mean batch-centering there is good correlation (black). B, Comparison of mean raw expression levels for amplified and unamplified MCF10A replicates before (grey) and after mean batch-centering (black). C, Overall transcriptome similarity of individual GeneChips demonstrated with Pearson clustering. D, Fold changes are unaffected by mean batch-centering.</p>
               </text>
               <file name="1755-8794-1-42-S1.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>The same approach was applied to a variety of other validation datasets that were designed to investigate the effect of using different generations of Affymetrix GeneChips, alternative protocols and scanning hardware (Table <tblr tid="T1">1</tblr>, Additional File <supplr sid="S1">1</supplr>). A systematic bias was found to be present in all datasets, and correspondence improved following mean-centering in all cases (median centering also performed similarly; data not shown). These results demonstrate that systematic, multiplicative bias is a widespread property within Affymetrix array data, and mean-centering was found to lead to improvements irrespective of the summarisation method used (RMA or MAS5 <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>) to process the data or when using alternative Chip Description Files (CDFs) to group probes according to Unigene cluster <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> (Additional File <supplr sid="S2">2</supplr>).</p>
            <suppl id="S2">
               <title>
                  <p>Additional file 2</p>
               </title>
               <text>
                  <p><b>Concordance of mean expression values of data generated from different experiments</b>. Pearson correlation coefficients are given for uncorrected and mean batch-corrected data, for RMA and MAS5 data, and using alternative cdf files <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>.</p>
               </text>
               <file name="1755-8794-1-42-S2.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
         </sec>
         <sec>
            <st>
               <p>Integration of published breast tumour datasets</p>
            </st>
            <p>Breast tumours have been classified into five molecular subtypes; basal, luminal A, luminal B, ERBB2 and normal-like by identifying a set of genes with significantly greater variation in expression between different tumours than between paired samples from the same tumour <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>. Since members of this set appear to define properties 'intrinsic' to each subtype, the authors referred to the genes as an 'intrinsic gene set'. 640 Affymetrix probesets representing the 534 'intrinsic gene set' from <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> were identified using NetAffx <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. These probesets were used to cluster MAS5 normalised expression data from two published Affymetrix gene expression studies <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp> with similar numbers and composition of tumours. Despite using similar starting material (primary tumours) and the same microarray platform, when combined the two datasets formed two distinct, independent clusters representing the two datasets (Fig. <figr fid="F2">2A</figr>), suggesting a dataset-specific systematic bias as observed with the validation datasets described above. Although clusters of known luminal and basal-specific genes show similar patterns of differential expression in each of the two datasets (Figure <figr fid="F2">2A iii, iv</figr>), the majority of the probesets representing the full 'intrinsic gene set' show greater differences in expression between the two studies than between the different classes of tumours. Following mean-centering as before, the 'basal-like tumours' from the Richardson <it>et al. </it><abbrgrp><abbr bid="B10">10</abbr></abbrgrp> dataset were found to cluster alongside the 'basal tumours' from Farmer <it>et al. </it><abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, and the 'non-basal tumours' with the 'luminal tumours' (Fig. <figr fid="F2">2B</figr>). A third cluster of tumours was identified with high expression of the ERBB2 cluster of probesets. This cluster contained all of the molecular apocrine tumours, plus a mixture of basal/basal-like and luminal/non-basal-like tumours. Using centroid prediction <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> as described previously, the tumours from both datasets were assigned to the five Norway/Stanford subtypes (basal, luminal A, luminal B, ERBB2 and normal-like <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>), one of the tumours in this third cluster was assigned to the luminal B subtype, thirteen to the ERBB2 subtype and 9 could not clearly be assigned to a subtype (Fig <figr fid="F2">2B v</figr>).</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Comparison of breast tumour gene expression profiles generated by two published studies</p>
               </caption>
               <text>
                  <p><b>Comparison of breast tumour gene expression profiles generated by two published studies</b>. The Farmer <it>et al. </it>study used U133A GeneChips with RNA amplification, whereas the Richardson <it>et al. </it>study used U133 plus 2.0 arrays and the standard labeling protocol. A, Before mean batch-centering. B, After mean batch-centering. Hierarchical clustering of tumours based upon 640 probesets representing Sorlie <it>et al. </it><abbrgrp><abbr bid="B8">8</abbr></abbrgrp> 'intrinsic' genes. Thumbnails show all 640 probesets. i) Tumours classified by Richardson <it>et al. </it><abbrgrp><abbr bid="B10">10</abbr></abbrgrp> red = basal-like, blue = non-basal like, pink = BRCA1; tumours classified by Farmer <it>et al. </it><abbrgrp><abbr bid="B11">11</abbr></abbrgrp> red = basal, blue = luminal, green = apocrine. Clusters of genes associated with the 'Sorlie subtypes' are highlighted as follows; ii) ERBB2 gene cluster, iii) luminal A gene cluster, iv) basal gene cluster. v) Centroid prediction was used to assign the tumours to the five Norway/Stanford subtypes &#8211; basal (red), luminal A (dark blue), luminal B (light blue), ERBB2 (purple) and normal-like (green), unassigned (grey).</p>
               </text>
               <graphic file="1755-8794-1-42-2"/>
            </fig>
            <p>The greatest single difference between molecular subtypes has repeatedly been demonstrated to be between estrogen receptor-positive (ER+) luminal tumours and ER-negative basal tumours <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr></abbrgrp>. SAM analysis was used to identify probesets differentially expressed between the basal/basal-like and non basal-like/luminal subtypes using the combined data from both sets of samples. It was only following mean-centering that probesets were identified that represent genes that are known to characterize the differences between these subtypes (Additional File <supplr sid="S3">3</supplr>), including the fundamental estrogen receptor alpha and GATA binding protein 3, which maintains differentiation into the luminal cell fate in the mammary gland <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. In addition, following mean-centering, a greater number of statistically significant probesets were found to be differentially expressed between the tumour subtypes than were found between the two initial sets of samples (Additional File <supplr sid="S4">4</supplr>).</p>
            <suppl id="S3">
               <title>
                  <p>Additional file 3</p>
               </title>
               <text>
                  <p><b>The top 50 differentially expressed probesets between basal and non basal-like/luminal tumours were identified across datasets.</b> Those probesets in common are listed. Before: comparison was performed prior to mean batch-centering. After: comparison was performed following mean batch-centering.</p>
               </text>
               <file name="1755-8794-1-42-S3.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S4">
               <title>
                  <p>Additional file 4</p>
               </title>
               <text>
                  <p><b>Summary of the effect of mean batch-centering on data generated from published studies.</b> Lists of the top 50 differentially expressed probesets between basal and non basal-like/luminal tumours were identified within and across datasets, before and after mean batch-centering. SAM Common: for each column two different pairwise comparisons using SAM were performed, and the top 50 probesets identified for each comparison. The number reported is the intersection between two lists. UC = uncorrected. MC = Mean centering correction.</p>
               </text>
               <file name="1755-8794-1-42-S4.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>These results encouraged us to apply the approach to integrate six previously published datasets <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr></abbrgrp> (Table <tblr tid="T2">2</tblr>) of primary breast cancer tumours processed on Affymetrix U133A, U133AA or plus 2.0 GeneChips. Multidimensional scaling of 1107 tumours based upon the expression of all common probesets between the three arrays (22,215) demonstrated that global gene expression is highly influenced by dataset, with tumours clustering by study (Fig. <figr fid="F3">3A</figr>), again suggesting a systematic, dataset-specific bias. However, following mean-centering, the tumours appear to cluster by breast cancer subtype (assigned using centroid prediction <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>), regardless of the dataset from which they were generated (Fig. <figr fid="F3">3B</figr>).</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Published breast cancer datasets used in this study.</p>
               </caption>
               <tblbdy cols="9">
                  <r>
                     <c ca="left">
                        <p>
                           <b>Datasets</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>No. Tumours</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Array express/GEO ID</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>GeneChip</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>ER+</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Age</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Tumour Size (cm)</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>FU (years)</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>Reference</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Chin <it>et al. </it>2006</p>
                     </c>
                     <c ca="center">
                        <p>114</p>
                     </c>
                     <c ca="center">
                        <p>E-TABM</p>
                     </c>
                     <c ca="center">
                        <p>U133AA</p>
                     </c>
                     <c ca="center">
                        <p>67%</p>
                     </c>
                     <c ca="center">
                        <p>51</p>
                     </c>
                     <c ca="center">
                        <p>2.3</p>
                     </c>
                     <c ca="center">
                        <p>6.1</p>
                     </c>
                     <c ca="center">
                        <p>
                           <abbrgrp>
                              <abbr bid="B16">16</abbr>
                           </abbrgrp>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Desmedt <it>et al. </it>2007</p>
                     </c>
                     <c ca="center">
                        <p>198</p>
                     </c>
                     <c ca="center">
                        <p>GSE7390</p>
                     </c>
                     <c ca="center">
                        <p>U133A</p>
                     </c>
                     <c ca="center">
                        <p>68%</p>
                     </c>
                     <c ca="center">
                        <p>47</p>
                     </c>
                     <c ca="center">
                        <p>2.0</p>
                     </c>
                     <c ca="center">
                        <p>13.6</p>
                     </c>
                     <c ca="center">
                        <p>
                           <abbrgrp>
                              <abbr bid="B17">17</abbr>
                           </abbrgrp>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Farmer <it>et al </it>2005</p>
                     </c>
                     <c ca="center">
                        <p>49</p>
                     </c>
                     <c ca="center">
                        <p>GSE1561</p>
                     </c>
                     <c ca="center">
                        <p>U133A</p>
                     </c>
                     <c ca="center">
                        <p>58%</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>
                           <abbrgrp>
                              <abbr bid="B11">11</abbr>
                           </abbrgrp>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Ivshina <it>et al. </it>2006</p>
                     </c>
                     <c ca="center">
                        <p>249</p>
                     </c>
                     <c ca="center">
                        <p>GSE4922</p>
                     </c>
                     <c ca="center">
                        <p>U133A</p>
                     </c>
                     <c ca="center">
                        <p>85%</p>
                     </c>
                     <c ca="center">
                        <p>63</p>
                     </c>
                     <c ca="center">
                        <p>2.0</p>
                     </c>
                     <c ca="center">
                        <p>9.9</p>
                     </c>
                     <c ca="center">
                        <p>
                           <abbrgrp>
                              <abbr bid="B18">18</abbr>
                           </abbrgrp>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Loi <it>et al. </it>2007</p>
                     </c>
                     <c ca="center">
                        <p>119, 87</p>
                     </c>
                     <c ca="center">
                        <p>GSE6532</p>
                     </c>
                     <c ca="center">
                        <p>U133A, U133 plus2.0</p>
                     </c>
                     <c ca="center">
                        <p>100%</p>
                     </c>
                     <c ca="center">
                        <p>65, 62</p>
                     </c>
                     <c ca="center">
                        <p>2.4, 2.1</p>
                     </c>
                     <c ca="center">
                        <p>5.2, 11.4</p>
                     </c>
                     <c ca="center">
                        <p>
                           <abbrgrp>
                              <abbr bid="B32">32</abbr>
                           </abbrgrp>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Minn <it>et al. </it>2007</p>
                     </c>
                     <c ca="center">
                        <p>58</p>
                     </c>
                     <c ca="center">
                        <p>GSE5327</p>
                     </c>
                     <c ca="center">
                        <p>U133A</p>
                     </c>
                     <c ca="center">
                        <p>0%</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>7.2</p>
                     </c>
                     <c ca="center">
                        <p>
                           <abbrgrp>
                              <abbr bid="B33">33</abbr>
                           </abbrgrp>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Pawitan <it>et al. </it>2005</p>
                     </c>
                     <c ca="center">
                        <p>159</p>
                     </c>
                     <c ca="center">
                        <p>GSE1456</p>
                     </c>
                     <c ca="center">
                        <p>U133A</p>
                     </c>
                     <c ca="center">
                        <p>83%</p>
                     </c>
                     <c ca="center">
                        <p>58<sup>$</sup></p>
                     </c>
                     <c ca="center">
                        <p>2.2<sup>$</sup></p>
                     </c>
                     <c ca="center">
                        <p>7.1</p>
                     </c>
                     <c ca="center">
                        <p>
                           <abbrgrp>
                              <abbr bid="B19">19</abbr>
                           </abbrgrp>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Richardson et al.</p>
                     </c>
                     <c ca="center">
                        <p>40</p>
                     </c>
                     <c ca="center">
                        <p>GSE3744</p>
                     </c>
                     <c ca="center">
                        <p>U133 plus2.0</p>
                     </c>
                     <c ca="center">
                        <p>38%</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>
                           <abbrgrp>
                              <abbr bid="B10">10</abbr>
                           </abbrgrp>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Sotiriou <it>et al. </it>2006</p>
                     </c>
                     <c ca="center">
                        <p>101*</p>
                     </c>
                     <c ca="center">
                        <p>GSE2990</p>
                     </c>
                     <c ca="center">
                        <p>U133A</p>
                     </c>
                     <c ca="center">
                        <p>71%</p>
                     </c>
                     <c ca="center">
                        <p>60</p>
                     </c>
                     <c ca="center">
                        <p>2.0</p>
                     </c>
                     <c ca="center">
                        <p>5.8</p>
                     </c>
                     <c ca="center">
                        <p>
                           <abbrgrp>
                              <abbr bid="B20">20</abbr>
                           </abbrgrp>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Wang <it>et al. </it>2005</p>
                     </c>
                     <c ca="center">
                        <p>286</p>
                     </c>
                     <c ca="center">
                        <p>GSE2034</p>
                     </c>
                     <c ca="center">
                        <p>U133A</p>
                     </c>
                     <c ca="center">
                        <p>73%</p>
                     </c>
                     <c ca="center">
                        <p>52</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>7.2</p>
                     </c>
                     <c ca="center">
                        <p>
                           <abbrgrp>
                              <abbr bid="B52">52</abbr>
                           </abbrgrp>
                        </p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Continuous variables (age, size and follow up) are given as median values,  except where indicated <sup>$</sup> the mean was given. The follow up (FU) endpoints  for the datasets Loi et al, Pawitan et al. and Sotoriou et al were  recurrence-free survival, for datasets Desmedt et al. and Ivshina et al. it  was disease-free survival and for datasets Minn et al. and Wang et al. it  was distant metastasis-free survival. *The full dataset of Sotiriou et al.  includes 189 tumours, but 88 of the Uppsala tumours are included in dataset  Ivshina et al.</p>
               </tblfn>
            </tbl>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Dataset-specific bias in published Affymetrix breast cancer studies</p>
               </caption>
               <text>
                  <p><b>Dataset-specific bias in published Affymetrix breast cancer studies</b>. Multidimensional scaling for all common probesets (22,215) for 1107 breast tumours from six published studies <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr></abbrgrp> on U133A, U133AA and U133 plus 2.0 GeneChips. Tumours from different datasets are distinguished by symbol. Tumours assigned to one of the five Sorlie <it>et al. </it>subtypes by centroid prediction are discriminated by colours. With uncorrected data the tumours cluster by study, following mean-centering the tumours cluster by molecular subtype.</p>
               </text>
               <graphic file="1755-8794-1-42-3"/>
            </fig>
            <p>A growing body of evidence has accumulated, supporting the notion that gene expression profiling of primary breast tumours can be used to stratify patients by subtype and the likelihood of disease progression (reviewed in <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>). The approaches have included both unsupervised (intrinsic gene set <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>) and supervised (genes associated with patient follow up <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B21">21</abbr><abbr bid="B23">23</abbr></abbrgrp>) methods, along with studies of cancer-associated pathways or tumor characteristics <abbrgrp><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr></abbrgrp>, in all cases these signatures appear to predict recurrence, despite the lack of overlap in their respective profiles or signatures <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. In order to establish whether integrating multiple datasets can improve prognostic prediction, the six published datasets (described above) were used individually or in combination as 'training sets' for supervised principal components analysis <abbrgrp><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr></abbrgrp>. This method has been shown to produce more accurate predictions than several competing methods on both simulated and real microarray datasets <abbrgrp><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr></abbrgrp>. Using the Superpc <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> package for R <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>, a Cox proportional hazards model was fitted to each predictor (generated for all combinations of one to five datasets used as the 'training set') and cross validation curves were plotted to determine the optimum threshold for the predictor of survival as described previously <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. The 1<sup>st </sup>supervised principal component was found to be the most significant in the vast majority of cases, which is consistent with the hypothesis that recurrence is an inherent property of primary tumour gene expression (examples of cross-validation and survival curves are shown in Additional File <supplr sid="S5">5</supplr>). The remaining dataset(s) were used as a test set for each predictor and an <it>R</it><sup>2 </sup>statistic was computed to assess the performance. Combining greater numbers of datasets or tumours significantly improves prediction of prognosis based upon gene expression data (Fig. <figr fid="F4">4</figr>). Mean-centering of the datasets significantly increased the correlation between the supervised principal components and clinical follow up, therefore improving prognostic performance. It is clear that the predictive power of some combinations of training and test sets is more reliable than others. Although only a limited number of patient and tumour characteristics were available (Table <tblr tid="T2">2</tblr>), it seems that the most accurate predictions are achieved for test datasets that have characteristics most similar to those of the individual or combined training dataset. <it>R</it><sup>2 </sup>statistic and p-values (log rank) for all possible combinations of training datasets and test datasets are given in Additional File <supplr sid="S6">6</supplr>.</p>
            <suppl id="S5">
               <title>
                  <p>Additional file 5</p>
               </title>
               <text>
                  <p><b>Examples of cross-validation and survival curves from supervised principal components analysis</b>. Cross validation plots (A, C) and Kaplan Meir recurrence curves (B, C) using the Wang <it>et al. </it>dataset as the test set and either a single (Pawitan <it>et al</it>.) dataset (A, B) or five (Chin <it>et al</it>., Desmedt <it>et al.</it>, Ivshina <it>et al</it>., Pawitan <it>et al. </it>and Sotoriou <it>et al.</it>) datasets (C, D) combined as the training set. Values at the top of the cross validation plots are the numbers of probesets used to create the profiles; the black, red and green lines represent the 1<sup>st</sup>, 2<sup>nd </sup>and 3<sup>rd </sup>principal components respectively.</p>
               </text>
               <file name="1755-8794-1-42-S5.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S6">
               <title>
                  <p>Additional file 6</p>
               </title>
               <text>
                  <p><b>Full matrix of the 1109 <it>R</it><sup>2 </sup>and <it>p-values </it>for all possible combinations of the six training and test sets</b>. The <it>R</it><sup>2 </sup>statistic (Cox proportional hazards model) measures the percentage of the variation in time to recurrence that is explained by each combination of test datasets. The <it>p-values </it>are the associated log-rank statistic obtained when applying the test dataset to the training dataset.</p>
               </text>
               <file name="1755-8794-1-42-S6.xls">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Combining datasets or tumours and mean-centering significantly increases prognostic prediction</p>
               </caption>
               <text>
                  <p><b>Combining datasets or tumours and mean-centering significantly increases prognostic prediction</b>. A, Before mean batch-centering. B, After mean batch-centering. The <it>R</it><sup>2 </sup>statistic (Cox proportional hazards model) is an assessment of the performance of the predictor generated using each combination of training datasets and the remaining test datasets, generated using supervised principal components analysis. Median values are used where a training dataset was used to assess more than one test dataset (up to 5). <it>R</it><sup>2 </sup>and <it>p-value </it>results for all possible combinations of training datasets and test datasets (1016) are given in the matrix in Additional File <supplr sid="S6">6</supplr>.</p>
               </text>
               <graphic file="1755-8794-1-42-4"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Dataset composition</p>
            </st>
            <p>We investigated the effect of altering the composition of luminal (ER+) and basal (ER-) tumours from the two published datasets of Farmer <it>et al. </it><abbrgrp><abbr bid="B11">11</abbr></abbrgrp> and Richardson <it>et al. </it><abbrgrp><abbr bid="B10">10</abbr></abbrgrp> compared above. Unbalancing the composition of the datasets from a 1:1 ratio of basal to luminal tumours to a 2:1 or 5:1 ratio of tumours reduced correspondence <it>between </it>datasets and caused a distortion <it>across </it>datasets (Additional File <supplr sid="S7">7</supplr>). Similar results were also observed with the MCF7/MCF10A datasets described above (Fig <figr fid="F1">1F</figr>, Table <tblr tid="T3">3</tblr>). Weighted-mean-centering for ER status removed the distortion but also reduced correspondence for the 2:1 ratio of luminal tumours, and increased correspondence in the 5:1 ratio luminal to basal comparison, at the expense of high false discovery rates (Table <tblr tid="T3">3</tblr>). An extreme example of the effect of dataset composition was seen when looking at the expression level of the estrogen receptor in homogeneous datasets from Loi <it>et al. </it><abbrgrp><abbr bid="B32">32</abbr></abbrgrp> and Minn <it>et al. </it><abbrgrp><abbr bid="B33">33</abbr></abbrgrp> composed wholly of ER+ or ER- tumours. Following mean-centering it appears that these datasets contain a mixture of ER+ and ER- tumours (Additional File <supplr sid="S8">8A</supplr>). Replacing any of the six heterogenous datasets above (containing 67&#8211;85% ER+ tumours) with homogeneous datasets (containing only ER+ or ER- tumours) showed a dramatic reduction in the correlation between dataset or tumour number and prediction of recurrence (Additional File <supplr sid="S8">8B</supplr>). Using weighted mean-centering to account for the differences in the composition of ER+ tumours in five out of the six datasets (individual ER status by immunohistochemistry for tumours in the Pawitan <it>et al. </it>dataset was not available) did not significantly improve prognostic performance over mean-centering alone (Additional File <supplr sid="S9">9</supplr>).</p>
            <suppl id="S7">
               <title>
                  <p>Additional file 7</p>
               </title>
               <text>
                  <p><b>Comparison of published datasets composed of different ratios of basal and luminal tumours</b>. The number of basal (red) and luminal (blue) tumours from The Farmer <it>et al. </it>(<it>italics</it>) and Richardson <it>et al. </it>studies was varied in order to compare the effect of dataset composition, between (A, B, C) and across (D, E, F) the studies. The datasets were either uncorrected (light grey dots), mean-centered (black open squares) or weighted mean-centered (dark grey open circles). UC = uncorrected, MC = mean-centered.</p>
               </text>
               <file name="1755-8794-1-42-S7.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S8">
               <title>
                  <p>Additional file 8</p>
               </title>
               <text>
                  <p><b>Effects of combining datasets composed solely of ER+ or ER- breast tumours</b>. Datasets from Loi <it>et al. </it><abbrgrp><abbr bid="B32">32</abbr></abbrgrp> and Minn <it>et al. </it><abbrgrp><abbr bid="B33">33</abbr></abbrgrp> that are composed wholly of ER+ or ER- tumours have distorted levels of ESR1 transcript if integrated with datasets composed of both ER+ and ER- tumours. Replacing any of the six heterogenous datasets with homogeneous datasets results in a dramatic reduction in the correlation between dataset or tumour number and the association with principal components and recurrence (B).</p>
               </text>
               <file name="1755-8794-1-42-S8.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S9">
               <title>
                  <p>Additional file 9</p>
               </title>
               <text>
                  <p><b>Weighted-mean centering does not significantly improve prognostic prediction when combining datasets or tumours of mean-centering</b>. Five datasets with recorded ER status from immunohistochemistry were used to assess the correction methods as in Figure <figr fid="F4">4</figr>. The <it>R</it><sup>2 </sup>statistic (Cox proportional hazards model) is an assessment of the performance of the predictor generated using each combination of training datasets and the remaining test datasets, generated using supervised principal components analysis. Median values are used where a training dataset was used to asses more than one test dataset (up to 5). <it>R</it><sup>2 </sup>and <it>p-value </it>results for all possible combinations of training datasets and test datasets (1016) are given in the matrix in Additional Table 5.</p>
               </text>
               <file name="1755-8794-1-42-S9.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Effect of dataset composition on differential gene expression.</p>
               </caption>
               <tblbdy cols="9">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="8" ca="center">
                        <p>SAM Common, top 1000</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Uneven comparisons</p>
                     </c>
                     <c cspan="4" ca="center">
                        <p>Between datasets</p>
                     </c>
                     <c cspan="4" ca="center">
                        <p>Across datasets</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="8">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>UC</p>
                     </c>
                     <c ca="center">
                        <p>MC</p>
                     </c>
                     <c ca="center">
                        <p>wMC</p>
                     </c>
                     <c ca="center">
                        <p>DWD</p>
                     </c>
                     <c ca="center">
                        <p>UC</p>
                     </c>
                     <c ca="center">
                        <p>MC</p>
                     </c>
                     <c ca="center">
                        <p>wMC</p>
                     </c>
                     <c ca="center">
                        <p>DWD</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Unamplified MCF7 (3) v MCF10A (3)</p>
                        <p>Amplified MCF7 (3) v MCF10A (3)</p>
                     </c>
                     <c ca="center">
                        <p>522</p>
                        <p>(0.031)</p>
                        <p>(0.032)</p>
                     </c>
                     <c ca="center">
                        <p>522</p>
                        <p>(0.031)</p>
                        <p>(0.032)</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>427</p>
                        <p>(0.029)</p>
                        <p>(0.028)</p>
                     </c>
                     <c ca="center">
                        <p>251</p>
                        <p>(0.023)</p>
                        <p>(0.037)</p>
                     </c>
                     <c ca="center">
                        <p>594</p>
                        <p>(0.025)</p>
                        <p>(0.035)</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>447</p>
                        <p>(0.023)</p>
                        <p>(0.032)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Unamplified MCF7 (3) v MCF10A (3)</p>
                        <p>Amplified MCF7 (3) v MCF10A (2)</p>
                     </c>
                     <c ca="center">
                        <p>495</p>
                        <p>(0.031)</p>
                        <p>(0.036)</p>
                     </c>
                     <c ca="center">
                        <p>495</p>
                        <p>(0.031)</p>
                        <p>(0.036)</p>
                     </c>
                     <c ca="center">
                        <p>495</p>
                        <p>(0.031)</p>
                        <p>(0.036)</p>
                     </c>
                     <c ca="center">
                        <p>469</p>
                        <p>(0.03)</p>
                        <p>(0.031)</p>
                     </c>
                     <c ca="center">
                        <p>232</p>
                        <p>(0.026)</p>
                        <p>(0.035)</p>
                     </c>
                     <c ca="center">
                        <p>600</p>
                        <p>(0.024)</p>
                        <p>(0.037)</p>
                     </c>
                     <c ca="center">
                        <p>597</p>
                        <p>(0.026)</p>
                        <p>(0.040)</p>
                     </c>
                     <c ca="center">
                        <p>550</p>
                        <p>(0.028)</p>
                        <p>(0.0)</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Richardson <it>et al. </it>Non-basal (12) v basal (12)</p>
                        <p>Farmer <it>et al. </it>Luminal A (12) v basal (12)</p>
                     </c>
                     <c ca="center">
                        <p>394</p>
                        <p>(0.003)</p>
                        <p>(0.019)</p>
                     </c>
                     <c ca="center">
                        <p>394</p>
                        <p>(0.003)</p>
                        <p>(0.019)</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>389</p>
                        <p>(0.003)</p>
                        <p>(0.019)</p>
                     </c>
                     <c ca="center">
                        <p>368</p>
                        <p>(0.001)</p>
                        <p>(0.019)</p>
                     </c>
                     <c ca="center">
                        <p>708</p>
                        <p>(0.047)</p>
                        <p>(0.02)</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>695</p>
                        <p>(0.046)</p>
                        <p>(0.014)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Richardson <it>et al. </it>Non-basal (7) v basal (19)</p>
                        <p>Farmer <it>et al. </it>Luminal A (15) v basal (14)</p>
                     </c>
                     <c ca="center">
                        <p>380</p>
                        <p>(0.019)</p>
                        <p>(0.001)</p>
                     </c>
                     <c ca="center">
                        <p>380</p>
                        <p>(0.019)</p>
                        <p>(0.001)</p>
                     </c>
                     <c ca="center">
                        <p>380</p>
                        <p>(0.019)</p>
                        <p>(0.001)</p>
                     </c>
                     <c ca="center">
                        <p>373</p>
                        <p>(0.001)</p>
                        <p>(0.017)</p>
                     </c>
                     <c ca="center">
                        <p>346</p>
                        <p>(0)</p>
                        <p>(0)</p>
                     </c>
                     <c ca="center">
                        <p>725</p>
                        <p>(0.003)</p>
                        <p>(0.078)</p>
                     </c>
                     <c ca="center">
                        <p>608</p>
                        <p>(0.002)</p>
                        <p>(0.038)</p>
                     </c>
                     <c ca="center">
                        <p>658</p>
                        <p>(0.005)</p>
                        <p>(0.021)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Richardson <it>et al. </it>Non-basal (3) v basal (19)</p>
                        <p>Farmer <it>et al. </it>Luminal A (15) v basal (3)</p>
                     </c>
                     <c ca="center">
                        <p>283</p>
                        <p>(0.1)</p>
                        <p>(0.194)</p>
                     </c>
                     <c ca="center">
                        <p>283</p>
                        <p>(0.1)</p>
                        <p>(0.194)</p>
                     </c>
                     <c ca="center">
                        <p>283</p>
                        <p>(0.1)</p>
                        <p>(0.194)</p>
                     </c>
                     <c ca="center">
                        <p>258</p>
                        <p>(0.195)</p>
                        <p>(0.099)</p>
                     </c>
                     <c ca="center">
                        <p>290</p>
                        <p>(0)</p>
                        <p>(0.027)</p>
                     </c>
                     <c ca="center">
                        <p>480</p>
                        <p>(0.093)</p>
                        <p>(0.9)</p>
                     </c>
                     <c ca="center">
                        <p>684</p>
                        <p>(0.001)</p>
                        <p>(0.789)</p>
                     </c>
                     <c ca="center">
                        <p>506</p>
                        <p>(0.112)</p>
                        <p>(0.9)</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Sets of differentially expressed probesets comparing MCF7 and MCF10A replicates or basal/basal-like and luminal/nonbasal-like tumours were identified for each experiment, before and after mean batch-centering, comparisons both between and across datasets were performed. SAM Common: for each column two different pairwise comparisons using SAM were performed, and the top 1000 probesets identified for each comparison. The number reported is the intersection between the two sets. Before: comparison was performed prior to mean batch-centering. After: comparison was performed following mean batch-centering. Values in brackets are the FDR for each top 1000 probesets. Weighted mean-centering for datasets with even numbers of samples are not shown as the values are identical to mean-centering. UC = uncorrected, MC batch mean-centered, wMC = weighted mean-centered, DWD = distance-weighted discrimination.</p>
               </tblfn>
            </tbl>
            <p>The mean-centering approach was compared with a previously described method for integrating breast cancer tumour microarray data generated on different platforms, using a distance weighted discrimination (DWD) method to adjust for systematic microarray data biases <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>. For both the validation datasets and the published datasets, mean-centering out-performed DWD (Table <tblr tid="T3">3</tblr> and Additional File <supplr sid="S10">10</supplr>).</p>
            <suppl id="S10">
               <title>
                  <p>Additional file 10</p>
               </title>
               <text>
                  <p><b>Distance-weighted discrimination (DWD) method</b>. Comparison of the DWD method (green dots) between (A, B) and across (C, D) validation (A, C) and published (B, D) datasets with mean-(red dots) and weighted mean-(blue circles) centering (see Table <tblr tid="T3">3</tblr> for SAM analysis). E, DWD correction of the two breast tumour gene expression profiles generated by the two published studies as in Figure <figr fid="F2">2</figr>. Clustering of tumours based upon 640 probesets representing Sorlie <it>et al. </it><abbrgrp><abbr bid="B8">8</abbr></abbrgrp> 'intrinsic' genes. Thumbnail shows all 640 probesets. i) Tumours classified by Richardson <it>et al. </it><abbrgrp><abbr bid="B10">10</abbr></abbrgrp> red = basal-like, blue = non-basal like, pink = BRCA1; tumours classified by Farmer <it>et al. </it><abbrgrp><abbr bid="B11">11</abbr></abbrgrp> red = basal, blue = luminal, green = apocrine. Clusters of genes associated with the 'Sorlie subtypes' are highlighted as follows; ii) ERBB2 gene cluster, iii) luminal A gene cluster, iv) basal gene cluster. v) Centroid prediction was used to assign the tumours to the five Norway/Stanford subtypes &#8211; basal (red), luminal A (dark blue), luminal B (light blue), ERBB2 (purple) and normal-like (green), unassigned (grey).</p>
               </text>
               <file name="1755-8794-1-42-S10.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
         </sec>
         <sec>
            <st>
               <p>Most variable genes</p>
            </st>
            <p>An alternative approach to assess whether mean-centering improves comparisons across published datasets was to identify lists of the five hundred most highly differentially expressed probesets across each dataset (those with the highest variance) and compare these gene lists with the most differentially expressed probesets from other single or combined datasets. Bringing together greater numbers of datasets or tumours increased the overlap in differentially expressed probesets (Figure <figr fid="F5">5</figr>). The number of probesets in common was significantly greater with mean-centering (p = 3.6 &#215; 10<sup>-107</sup>) or weighted mean-centering (p = 9.2 &#215; 10<sup>-106</sup>) over uncorrected data, although there was no significant improvement between the two methods (p = 0.5). The mean number of genes in common was higher following weighted mean-centering than mean-centering when the dataset was made up of less ER-positive tumours (Chin <it>et al.</it>, Farmer <it>et al</it>. and Richardson <it>et al</it>.) and lower when the dataset was made up of more ER-positive tumours (Ivshina <it>et al</it>., Sotoriou <it>et al</it>., Wang <it>et al.</it>).</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Combining greater numbers of datasets leads to a greater overlap in differentially expressed probesets</p>
               </caption>
               <text>
                  <p><b>Combining greater numbers of datasets leads to a greater overlap in differentially expressed probesets</b>. Lists of the five hundred probesets with the highest variance were generated for each dataset and combinations of up to six datasets and the number of probesets in common between these lists were plotted for each dataset. A, Plots show the number of common probesets between each individual dataset and other single or combined datasets. B, Overall mean numbers of genes in common for each dataset.</p>
               </text>
               <graphic file="1755-8794-1-42-5"/>
            </fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>Mean-centering has been widely used in the past to compare relative gene expression of high and lowly expressed genes together within a single dataset, particularly for heatmaps and clustering programs <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>. However, this is the first study to assess the utility of mean-centering for minimising the effects of dataset-specific bias and integration of multiple datasets. An unknown, systematic, multiplicative bias associated with each group of arrays processed together ('dataset') is simply removed when the GeneChips are considered relative to each other. The approach clearly shows significant improvements in the degree of correspondence found across datasets, without any loss of internal coherence within each of the initial datasets from which the integrated dataset is assembled. Relative intensities within each individual dataset are left unchanged (Figure <figr fid="F1">1D</figr>), with the consequence that both fold-changes and p-values produced by techniques such as SAM, remain identical to those found prior to correction (Table <tblr tid="T1">1</tblr>). Therefore, balanced corrected datasets can be treated with at least as much confidence as the initial uncorrected data. We have also demonstrated that combining greater numbers of datasets or tumours increases the overlap in differentially expressed probesets between studies and that this is further improved with mean-centering.</p>
         <p>A number of previous studies have also investigated the level of consensus found between different experimental datasets. The mean-centering approach out-performed a distance weighted discrimination method <abbrgrp><abbr bid="B34">34</abbr></abbrgrp> that attempted to adjust for systematic microarray data biases for integrating breast cancer tumour microarray data generated on different platforms. This group stated that they had also applied this technique to 'merge two distinct Affymetrix breast tumor datasets together' and 'saw similar, but less dramatic results due to fewer systematic biases present in datasets performed on the same Affymetrix microarrays' <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>. Our results suggest that there are many sources of systematic biases in Affymetrix data, which are highly significant and multiplicative, but that these can be largely corrected for, allowing the integration of datasets. An empirical Bayes method to adjust for batch effects <abbrgrp><abbr bid="B36">36</abbr></abbrgrp> (ComBat; <url>http://statistics.byu.edu/johnson/ComBat/</url>) has also been used to integrate published datasets for meta-analysis <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>. This approach generated plots analogous to those in Additional File <supplr sid="S8">8</supplr> for mean-centering and weighted mean-centering when ER status was used as a covariate (data not shown). The mean-centering method described in this study was used in a recent meta-analysis whilst our manuscript was under review <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>, although no attempt was made to account for differences in dataset composition.</p>
         <p>Combining two published studies without mean-centering, clearly demonstrated how dataset-specific biases can mask the biological differences between breast cancer tumour subtypes (Figure <figr fid="F2">2</figr>). The Farmer <it>et al. </it><abbrgrp><abbr bid="B11">11</abbr></abbrgrp> dataset was generated from trucut tumour biopsies (4 &#215; 2.5 &#956;m sections), necessitating RNA amplification prior to hybridization to U133A GeneChips. By contrast, RNA in the Richardson <it>et al. </it>2006 study <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> was derived from tumours following surgical removal, so amplification was not required prior to hybridisation to U133 plus 2.0 GeneChips. Despite the experimental differences between the studies, both of which have been shown above to lead to significant deviations in measured raw intensities in our validation datasets, mean-centering appears to reconcile the data and leads to the identification of biologically plausible relationships not found when combining uncorrected data.</p>
         <p>The gold standard for demonstrating the power of a gene expression classifier is to test it against independent datasets. However, if the molecular profile of a set of tumours is representative of its patient characteristics, then any prognostic signature will be dependant upon the composition of the patient cohort and therefore be dataset-specific. Thus in order to generate accurate prognostic predictions, the characteristics of this second 'test' dataset must have similar characteristics to the first 'training' set <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. Recently, strong time dependence was identified for a prognostic signature when comparing an independent validation dataset with a longer median follow-up time (14 years) compared to the original study (8 years) <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. A number of recent microarray studies have been performed after increasing the size of a dataset with additional samples <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B18">18</abbr><abbr bid="B20">20</abbr><abbr bid="B33">33</abbr></abbrgrp>, however it is unclear whether subsequent changes in the results are due to changes in the sample composition of the extended dataset or simply to technical effects arising from the microarrays being processed in different batches. Some studies have also based their findings upon combined data from more than one type of Affymetrix GeneChip without evaluating any GeneChip-specific effects.</p>
         <p>By integrating six published datasets with patient follow-up information we have demonstrated that combining breast cancer datasets can increase the accuracy of prognosis prediction and that this can be improved by removing systematic, multiplicative bias. The most accurate prognosis predictions are generated when the test-sets closely share the patient and tumour characteristics of the training-sets. An alternative approach to building ever larger combined datasets representing the whole breast cancer population, would be to concentrate on generating gene expression classifiers for clearly defined groups of patients (e.g. node-negative, ER-positive from patients aged 50&#8211;60, with 10 years of follow-up). Strict entry criterion would severely restrict the number of tumours eligible for inclusion, whilst taking no account of possible unknown confounding factors. In clinical practice, we urgently need single sample predictors <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>, applicable to all patients and our work strongly suggests that these will be best generated from the largest possible cohorts (or integrated datasets) representing the wider population, which will involve large international collaborations and public sharing of data. The current consensus for best practices for breast cancer treatment are based upon bringing together data for hundreds of trials representing thousands of women within the Early Breast Cancer Trialists' Collaborative Group (EBCTCG). If we can begin to bring together many large datasets of gene expression data free from dataset-specific bias the opportunity exists to create a highly valuable resource. A possible 'new' breast cancer subtype characterized by the high expression of interferon regulated genes <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> was identified by cluster analysis of a combined (non-Affymetrix) dataset of 315 breast tumours, which is consistent with the notion that rare molecular subtypes will only be detected with larger datasets. Our findings are in agreement with the conclusions of a recent study <abbrgrp><abbr bid="B39">39</abbr></abbrgrp> that integrated breast tumour datasets generated on two different microarray platforms; they also showed that the gene expression profile generated by integrated analysis of multiple datasets achieves better prediction of breast cancer recurrence, and that the performance of profiles is confounded by the known and unknown clinical background of patients <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. In the current study however, we demonstrate improved prediction of prognosis in datasets derived from the <it>same </it>platform integrated using a simpler scaling method of the raw data rather than a normalisation method reliant on fold changes. One limitation of our study is that it was not possible to use a single definition of follow-up endpoint across the published datasets, in each case we used the most conservative indicator of relapse available (recurrence-free survival, disease-free survival and distant metastasis-free survival) rather than overall survival. There was also some variation in both patient age and tumour size between the studies (Table <tblr tid="T3">3</tblr>). Variation in gene expression due to the heterogeneity of patient characteristics has begun to be addressed by studies that investigate the effects of age <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>, race <abbrgrp><abbr bid="B41">41</abbr><abbr bid="B42">42</abbr></abbrgrp> and differences in risk factors <abbrgrp><abbr bid="B43">43</abbr><abbr bid="B44">44</abbr></abbrgrp> for breast cancer. Integrating large breast tumour gene expression datasets will potentially enable us to uncover more subtle population-level associations, providing that all clinical details and follow-up information is consistent, complete and made publicly available.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>Systematic multiplicative biases are introduced at many stages of microarray experiments, however they can easily be accounted for, which can enable raw data to be directly integrated from different gene expression datasets in order to generate results with improved statistical power and greater biological significance.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>RNA preparation</p>
            </st>
            <p>Generation and processing of RNA comparing the two breast cell lines, MCF7 (cancer) and MCF10A (normal immortalised mammary epithelial) was described previously <abbrgrp><abbr bid="B45">45</abbr><abbr bid="B46">46</abbr></abbrgrp>. Briefly, cells were grown in Dulbecco's modified Eagle's medium (DMEM) with 10% fetal calf serum (MCF7) or DMEM/F12 with 5% horse serum, 2 ng/ml, 0.5 &#956;g/ml hydrocortisone, 0.5 &#956;g/ml cholera toxin, and 5 &#956;g/ml insulin (MCF10A). Minimally passaged cells (&lt; 20) were obtained from the American Type Culture Collection ATCC. RNA was isolated using Trizol<sup>&#174; </sup>(Ambion) according to manufacturer's instructions, purified using Qiagen RNeasy columns (Qiagen, Valencia, CA) and quantified using a Nanodrop spectrophotometer (Labtech). The quality and amount of starting RNA was confirmed with an Agilent Bioanalyzer 2100 (Agilent) prior to labelling and hybridisation to HG-U133A, HG-U133 plus 2.0, or Human Exon 1.0ST GeneChips (Affymetrix) using either the Affymetrix standard or small sample preparation protocols as previously described <abbrgrp><abbr bid="B45">45</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>GeneChip processing and analysis</p>
            </st>
            <p>Each experiment was repeated in triplicate, with three samples per cell line for each amount of starting total RNA, protocol or GeneChip used (1 dataset = 3 &#215; MCF7 and 3 &#215; MCF10A = 6 GeneChips). The HGU133A Genechips for the standard protocol and amplification experiments were scanned using a GeneArray 2500 scanner and the HG-U133 plus 2.0, and Exon 1.0ST Genechips were scanned using a GeneChip Scanner 3000. All MCF7 and MCF10A microarray data is MIAME compliant and accessible via MIAME VICE <abbrgrp><abbr bid="B47">47</abbr></abbrgrp>. All protocols are described in full here, <url>http://bioinformatics.picr.man.ac.uk/mbcf/downloads/</url>. Raw spot readings were processed using R <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> and Bionconductor <abbrgrp><abbr bid="B48">48</abbr></abbrgrp>. Probeset summarisation was done using MAS 5.0 and RMA <abbrgrp><abbr bid="B49">49</abbr></abbrgrp> as implemented in the Simpleaffy package <abbrgrp><abbr bid="B50">50</abbr></abbrgrp> or plier algorithm from Affymetrix ExACT software. Mappings between the Exon and U133A plus 2.0 GeneChips were performed as described previously <abbrgrp><abbr bid="B46">46</abbr></abbrgrp>. Alternative cell description files, relating probesets to unigene sequences were implemented as described previously <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. Lists of common significantly differentially expressed genes before and after mean batch-centering were identified using SAM <abbrgrp><abbr bid="B51">51</abbr></abbrgrp> analysis (siggenes BioConductor package) by adjust &#916; value to find the top 1000 differentially expressed probesets using each protocol, as described previously <abbrgrp><abbr bid="B46">46</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Correction by batch mean-centering</p>
            </st>
            <p>The mean expression level per probeset for a given dataset is subtracted from the individual GeneChip expression level on the Log2 scale. This can simply be performed in R using the '<it>rowMeans</it>' function. Whilst preparing the manuscript we also noticed that this can be achieved using the '<it>pamr.batchadjust</it>' function within the pamr Bioconductor package.</p>
         </sec>
         <sec>
            <st>
               <p>Published data</p>
            </st>
            <p>Affymetrix data was downloaded from a total of ten datasets from published studies listed in Table <tblr tid="T2">2</tblr> from the Gene Expression Omnibus <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> or Array Express <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> repositories. Raw .cel files were not available for the Wang <it>et al. </it>dataset, so all other datasets were normalized as in this study using the MAS 5.0 algorithm with a target intensity of 600 as implemented in the Simpleaffy package <abbrgrp><abbr bid="B50">50</abbr></abbrgrp>, using R <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> within BioConductor <abbrgrp><abbr bid="B48">48</abbr></abbrgrp>. NetAffx <abbrgrp><abbr bid="B9">9</abbr></abbrgrp> was used to identify Affymetrix probesets representing the 'intrinsic gene set' previously used to classify human breast tumours <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. Centered average linkage clustering was performed using the Cluster <abbrgrp><abbr bid="B35">35</abbr></abbrgrp> and TreeView programs as described previously <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. Supervised principal components analysis using the Superpc for R package was used as previously described <abbrgrp><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr></abbrgrp>, in order to compare the predictive power of combining different published datasets. The follow up endpoints for the Loi <it>et al.</it>, Pawitan <it>et al. </it>and Sotoriou <it>et al. </it>datasets were recurrence-free survival, for Desmedt <it>et al. </it>and Ivshina <it>et al. </it>datasets it was disease-free survival and for the Minn <it>et al. </it>and Wang <it>et al. </it>datasets it was distant metastasis-free survival.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Competing interests</p>
         </st>
         <p>The authors declare that they have no competing interests.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>AHS conceived and designed the study. RBC, YH and SDP processed the lab experiments. AHS, GS and MJO performed the computational experiments and analysis. AHS, CJM, AH and RBC drafted the manuscript. All authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>AHS is supported by Breakthrough Breast Cancer, RBC is funded by a Breast Cancer Campaign Fellowship, YH, GS, MJO, SDP and CJM are supported by Cancer Research UK.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Data storage and analysis in ArrayExpress</p>
            </title>
            <aug>
               <au>
                  <snm>Brazma</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Kapushesky</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Parkinson</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Sarkans</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Shojatalab</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Methods Enzymol</source>
            <pubdate>2006</pubdate>
            <volume>411</volume>
            <fpage>370</fpage>
            <lpage>386</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0076-6879(06)11020-4</pubid>
                  <pubid idtype="pmpid" link="fulltext">16939801</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Cross-site comparison of gene expression data reveals high similarity</p>
            </title>
            <aug>
               <au>
                  <snm>Chu</snm>
                  <fnm>TM</fnm>
               </au>
               <au>
                  <snm>Deng</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Wolfinger</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Paules</snm>
                  <fnm>RS</fnm>
               </au>
               <au>
                  <snm>Hamadeh</snm>
                  <fnm>HK</fnm>
               </au>
            </aug>
            <source>Environ Health Perspect</source>
            <pubdate>2004</pubdate>
            <volume>112</volume>
            <issue>4</issue>
            <fpage>449</fpage>
            <lpage>455</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1241898</pubid>
                  <pubid idtype="pmpid" link="fulltext">15033594</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>NCBI GEO: mining millions of expression profiles &#8211; database and tools</p>
            </title>
            <aug>
               <au>
                  <snm>Barrett</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Suzek</snm>
                  <fnm>TO</fnm>
               </au>
               <au>
                  <snm>Troup</snm>
                  <fnm>DB</fnm>
               </au>
               <au>
                  <snm>Wilhite</snm>
                  <fnm>SE</fnm>
               </au>
               <au>
                  <snm>Ngau</snm>
                  <fnm>WC</fnm>
               </au>
               <au>
                  <snm>Ledoux</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Rudnev</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Lash</snm>
                  <fnm>AE</fnm>
               </au>
               <au>
                  <snm>Fujibuchi</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Edgar</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2005</pubdate>
            <issue>33 Database</issue>
            <fpage>D562</fpage>
            <lpage>566</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">539976</pubid>
                  <pubid idtype="pmpid" link="fulltext">15608262</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>The utility of MAS5 expression summary and detection call algorithms</p>
            </title>
            <aug>
               <au>
                  <snm>Pepper</snm>
                  <fnm>SD</fnm>
               </au>
               <au>
                  <snm>Saunders</snm>
                  <fnm>EK</fnm>
               </au>
               <au>
                  <snm>Edwards</snm>
                  <fnm>LE</fnm>
               </au>
               <au>
                  <snm>Wilson</snm>
                  <fnm>CL</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>CJ</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <fpage>273</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1950098</pubid>
                  <pubid idtype="pmpid" link="fulltext">17663764</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-8-273</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data</p>
            </title>
            <aug>
               <au>
                  <snm>Dai</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Boyd</snm>
                  <fnm>AD</fnm>
               </au>
               <au>
                  <snm>Kostov</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Athey</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Jones</snm>
                  <fnm>EG</fnm>
               </au>
               <au>
                  <snm>Bunney</snm>
                  <fnm>WE</fnm>
               </au>
               <au>
                  <snm>Myers</snm>
                  <fnm>RM</fnm>
               </au>
               <au>
                  <snm>Speed</snm>
                  <fnm>TP</fnm>
               </au>
               <au>
                  <snm>Akil</snm>
                  <fnm>H</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2005</pubdate>
            <volume>33</volume>
            <issue>20</issue>
            <fpage>e175</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1283542</pubid>
                  <pubid idtype="pmpid" link="fulltext">16284200</pubid>
                  <pubid idtype="doi">10.1093/nar/gni179</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Molecular portraits of human breast tumours</p>
            </title>
            <aug>
               <au>
                  <snm>Perou</snm>
                  <fnm>CM</fnm>
               </au>
               <au>
                  <snm>Sorlie</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Eisen</snm>
                  <fnm>MB</fnm>
               </au>
               <au>
                  <snm>Rijn</snm>
                  <mnm>van de</mnm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Jeffrey</snm>
                  <fnm>SS</fnm>
               </au>
               <au>
                  <snm>Rees</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Pollack</snm>
                  <fnm>JR</fnm>
               </au>
               <au>
                  <snm>Ross</snm>
                  <fnm>DT</fnm>
               </au>
               <au>
                  <snm>Johnsen</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Akslen</snm>
                  <fnm>LA</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nature</source>
            <pubdate>2000</pubdate>
            <volume>406</volume>
            <issue>6797</issue>
            <fpage>747</fpage>
            <lpage>752</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/35021093</pubid>
                  <pubid idtype="pmpid" link="fulltext">10963602</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications</p>
            </title>
            <aug>
               <au>
                  <snm>Sorlie</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Perou</snm>
                  <fnm>CM</fnm>
               </au>
               <au>
                  <snm>Tibshirani</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Aas</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Geisler</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Johnsen</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Hastie</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Eisen</snm>
                  <fnm>MB</fnm>
               </au>
               <au>
                  <snm>Rijn</snm>
                  <mnm>van de</mnm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Jeffrey</snm>
                  <fnm>SS</fnm>
               </au>
               <etal/>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2001</pubdate>
            <volume>98</volume>
            <issue>19</issue>
            <fpage>10869</fpage>
            <lpage>10874</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">58566</pubid>
                  <pubid idtype="pmpid" link="fulltext">11553815</pubid>
                  <pubid idtype="doi">10.1073/pnas.191367098</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Repeated observation of breast tumor subtypes in independent gene expression data sets</p>
            </title>
            <aug>
               <au>
                  <snm>Sorlie</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Tibshirani</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Parker</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Hastie</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Marron</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Nobel</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Deng</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Johnsen</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Pesich</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Geisler</snm>
                  <fnm>S</fnm>
               </au>
               <etal/>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2003</pubdate>
            <volume>100</volume>
            <issue>14</issue>
            <fpage>8418</fpage>
            <lpage>8423</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">166244</pubid>
                  <pubid idtype="pmpid" link="fulltext">12829800</pubid>
                  <pubid idtype="doi">10.1073/pnas.0932692100</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>NetAffx: Affymetrix probesets and annotations</p>
            </title>
            <aug>
               <au>
                  <snm>Liu</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Loraine</snm>
                  <fnm>AE</fnm>
               </au>
               <au>
                  <snm>Shigeta</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Cline</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Cheng</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Valmeekam</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Sun</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kulp</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Siani-Rose</snm>
                  <fnm>MA</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <issue>1</issue>
            <fpage>82</fpage>
            <lpage>86</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">165568</pubid>
                  <pubid idtype="pmpid" link="fulltext">12519953</pubid>
                  <pubid idtype="doi">10.1093/nar/gkg121</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>X chromosomal abnormalities in basal-like human breast cancer</p>
            </title>
            <aug>
               <au>
                  <snm>Richardson</snm>
                  <fnm>AL</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>ZC</fnm>
               </au>
               <au>
                  <snm>De Nicolo</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Lu</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Miron</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Liao</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Iglehart</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Livingston</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Ganesan</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Cancer Cell</source>
            <pubdate>2006</pubdate>
            <volume>9</volume>
            <issue>2</issue>
            <fpage>121</fpage>
            <lpage>132</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.ccr.2006.01.013</pubid>
                  <pubid idtype="pmpid" link="fulltext">16473279</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Identification of molecular apocrine breast tumours by microarray analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Farmer</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Bonnefoi</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Becette</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Tubiana-Hulin</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Fumoleau</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Larsimont</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Macgrogan</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Bergh</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Cameron</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Goldstein</snm>
                  <fnm>D</fnm>
               </au>
               <etal/>
            </aug>
            <source>Oncogene</source>
            <pubdate>2005</pubdate>
            <volume>24</volume>
            <issue>29</issue>
            <fpage>4660</fpage>
            <lpage>4671</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/sj.onc.1208561</pubid>
                  <pubid idtype="pmpid" link="fulltext">15897907</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Intrinsic molecular signature of breast cancer in a population-based cohort of 412 patients</p>
            </title>
            <aug>
               <au>
                  <snm>Calza</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Hall</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Auer</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Bjohle</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Klaar</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kronenwett</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Liu</snm>
                  <fnm>ET</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Ploner</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Smeds</snm>
                  <fnm>J</fnm>
               </au>
               <etal/>
            </aug>
            <source>Breast Cancer Res</source>
            <pubdate>2006</pubdate>
            <volume>8</volume>
            <issue>4</issue>
            <fpage>R34</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1779468</pubid>
                  <pubid idtype="pmpid" link="fulltext">16846532</pubid>
                  <pubid idtype="doi">10.1186/bcr1517</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Distinct molecular mechanisms underlying clinically relevant subtypes of breast cancer: Gene expression analyses across three different platforms</p>
            </title>
            <aug>
               <au>
                  <snm>Sorlie</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Xiao</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Johnsen</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Naume</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Samaha</snm>
                  <fnm>RR</fnm>
               </au>
               <au>
                  <snm>Borresen-Dale</snm>
                  <fnm>AL</fnm>
               </au>
            </aug>
            <source>BMC Genomics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <issue>1</issue>
            <fpage>127</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1489944</pubid>
                  <pubid idtype="pmpid" link="fulltext">16729877</pubid>
                  <pubid idtype="doi">10.1186/1471-2164-7-127</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>The molecular portraits of breast tumors are conserved across microarray platforms</p>
            </title>
            <aug>
               <au>
                  <snm>Hu</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Fan</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Oh</snm>
                  <fnm>DS</fnm>
               </au>
               <au>
                  <snm>Marron</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>He</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Qaqish</snm>
                  <fnm>BF</fnm>
               </au>
               <au>
                  <snm>Livasy</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Carey</snm>
                  <fnm>LA</fnm>
               </au>
               <au>
                  <snm>Reynolds</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Dressler</snm>
                  <fnm>L</fnm>
               </au>
               <etal/>
            </aug>
            <source>BMC Genomics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>96</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1468408</pubid>
                  <pubid idtype="pmpid" link="fulltext">16643655</pubid>
                  <pubid idtype="doi">10.1186/1471-2164-7-96</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>GATA-3 maintains the differentiation of the luminal cell fate in the mammary gland</p>
            </title>
            <aug>
               <au>
                  <snm>Kouros-Mehr</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Slorach</snm>
                  <fnm>EM</fnm>
               </au>
               <au>
                  <snm>Sternlicht</snm>
                  <fnm>MD</fnm>
               </au>
               <au>
                  <snm>Werb</snm>
                  <fnm>Z</fnm>
               </au>
            </aug>
            <source>Cell</source>
            <pubdate>2006</pubdate>
            <volume>127</volume>
            <issue>5</issue>
            <fpage>1041</fpage>
            <lpage>1055</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.cell.2006.09.048</pubid>
                  <pubid idtype="pmpid" link="fulltext">17129787</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Genomic and transcriptional aberrations linked to breast cancer pathophysiologies</p>
            </title>
            <aug>
               <au>
                  <snm>Chin</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>DeVries</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Fridlyand</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Spellman</snm>
                  <fnm>PT</fnm>
               </au>
               <au>
                  <snm>Roydasgupta</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Kuo</snm>
                  <fnm>WL</fnm>
               </au>
               <au>
                  <snm>Lapuk</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Neve</snm>
                  <fnm>RM</fnm>
               </au>
               <au>
                  <snm>Qian</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Ryder</snm>
                  <fnm>T</fnm>
               </au>
               <etal/>
            </aug>
            <source>Cancer Cell</source>
            <pubdate>2006</pubdate>
            <volume>10</volume>
            <issue>6</issue>
            <fpage>529</fpage>
            <lpage>541</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.ccr.2006.10.009</pubid>
                  <pubid idtype="pmpid" link="fulltext">17157792</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series</p>
            </title>
            <aug>
               <au>
                  <snm>Desmedt</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Piette</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Loi</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Lallemand</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Haibe-Kains</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Viale</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Delorenzi</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>d'Assignies</snm>
                  <fnm>MS</fnm>
               </au>
               <etal/>
            </aug>
            <source>Clin Cancer Res</source>
            <pubdate>2007</pubdate>
            <volume>13</volume>
            <issue>11</issue>
            <fpage>3207</fpage>
            <lpage>3214</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1158/1078-0432.CCR-06-2765</pubid>
                  <pubid idtype="pmpid" link="fulltext">17545524</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer</p>
            </title>
            <aug>
               <au>
                  <snm>Ivshina</snm>
                  <fnm>AV</fnm>
               </au>
               <au>
                  <snm>George</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Senko</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Mow</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Putti</snm>
                  <fnm>TC</fnm>
               </au>
               <au>
                  <snm>Smeds</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Lindahl</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Pawitan</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Hall</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Nordgren</snm>
                  <fnm>H</fnm>
               </au>
               <etal/>
            </aug>
            <source>Cancer Res</source>
            <pubdate>2006</pubdate>
            <volume>66</volume>
            <issue>21</issue>
            <fpage>10292</fpage>
            <lpage>10301</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1158/0008-5472.CAN-05-4414</pubid>
                  <pubid idtype="pmpid" link="fulltext">17079448</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts</p>
            </title>
            <aug>
               <au>
                  <snm>Pawitan</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Bjohle</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Amler</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Borg</snm>
                  <fnm>AL</fnm>
               </au>
               <au>
                  <snm>Egyhazi</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Hall</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Han</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Holmberg</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Huang</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Klaar</snm>
                  <fnm>S</fnm>
               </au>
               <etal/>
            </aug>
            <source>Breast Cancer Res</source>
            <pubdate>2005</pubdate>
            <volume>7</volume>
            <issue>6</issue>
            <fpage>R953</fpage>
            <lpage>964</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1410752</pubid>
                  <pubid idtype="pmpid" link="fulltext">16280042</pubid>
                  <pubid idtype="doi">10.1186/bcr1325</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis</p>
            </title>
            <aug>
               <au>
                  <snm>Sotiriou</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Wirapati</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Loi</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Harris</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Fox</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Smeds</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Nordgren</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Farmer</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Praz</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Haibe-Kains</snm>
                  <fnm>B</fnm>
               </au>
               <etal/>
            </aug>
            <source>J Natl Cancer Inst</source>
            <pubdate>2006</pubdate>
            <volume>98</volume>
            <issue>4</issue>
            <fpage>262</fpage>
            <lpage>272</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">16478745</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer</p>
            </title>
            <aug>
               <au>
                  <snm>Wang</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Klijn</snm>
                  <fnm>JG</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Sieuwerts</snm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Look</snm>
                  <fnm>MP</fnm>
               </au>
               <au>
                  <snm>Yang</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Talantov</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Timmermans</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Meijer-van Gelder</snm>
                  <fnm>ME</fnm>
               </au>
               <au>
                  <snm>Yu</snm>
                  <fnm>J</fnm>
               </au>
               <etal/>
            </aug>
            <source>Lancet</source>
            <pubdate>2005</pubdate>
            <volume>365</volume>
            <issue>9460</issue>
            <fpage>671</fpage>
            <lpage>679</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15721472</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>High-throughput genomic technology in research and clinical management of breast cancer. Exploiting the potential of gene expression profiling: is it ready for the clinic?</p>
            </title>
            <aug>
               <au>
                  <snm>Sims</snm>
                  <fnm>AH</fnm>
               </au>
               <au>
                  <snm>Ong</snm>
                  <fnm>KR</fnm>
               </au>
               <au>
                  <snm>Clarke</snm>
                  <fnm>RB</fnm>
               </au>
               <au>
                  <snm>Howell</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Breast Cancer Res</source>
            <pubdate>2006</pubdate>
            <volume>8</volume>
            <issue>5</issue>
            <fpage>214</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1779487</pubid>
                  <pubid idtype="pmpid" link="fulltext">17076877</pubid>
                  <pubid idtype="doi">10.1186/bcr1605</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Gene expression profiling predicts clinical outcome of breast cancer</p>
            </title>
            <aug>
               <au>
                  <snm>van 't Veer</snm>
                  <fnm>LJ</fnm>
               </au>
               <au>
                  <snm>Dai</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Vijver</snm>
                  <mnm>van de</mnm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>He</snm>
                  <fnm>YD</fnm>
               </au>
               <au>
                  <snm>Hart</snm>
                  <fnm>AA</fnm>
               </au>
               <au>
                  <snm>Mao</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Peterse</snm>
                  <fnm>HL</fnm>
               </au>
               <au>
                  <snm>Kooy</snm>
                  <mnm>van der</mnm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Marton</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Witteveen</snm>
                  <fnm>AT</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nature</source>
            <pubdate>2002</pubdate>
            <volume>415</volume>
            <issue>6871</issue>
            <fpage>530</fpage>
            <lpage>536</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/415530a</pubid>
                  <pubid idtype="pmpid" link="fulltext">11823860</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Genetic regulators of large-scale transcriptional signatures in cancer</p>
            </title>
            <aug>
               <au>
                  <snm>Adler</snm>
                  <fnm>AS</fnm>
               </au>
               <au>
                  <snm>Lin</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Horlings</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Nuyten</snm>
                  <fnm>DS</fnm>
               </au>
               <au>
                  <snm>Vijver</snm>
                  <mnm>van de</mnm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Chang</snm>
                  <fnm>HY</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2006</pubdate>
            <volume>38</volume>
            <issue>4</issue>
            <fpage>421</fpage>
            <lpage>430</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1435790</pubid>
                  <pubid idtype="pmpid" link="fulltext">16518402</pubid>
                  <pubid idtype="doi">10.1038/ng1752</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival</p>
            </title>
            <aug>
               <au>
                  <snm>Chang</snm>
                  <fnm>HY</fnm>
               </au>
               <au>
                  <snm>Nuyten</snm>
                  <fnm>DS</fnm>
               </au>
               <au>
                  <snm>Sneddon</snm>
                  <fnm>JB</fnm>
               </au>
               <au>
                  <snm>Hastie</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Tibshirani</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Sorlie</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Dai</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>He</snm>
                  <fnm>YD</fnm>
               </au>
               <au>
                  <snm>Van't Veer</snm>
                  <fnm>LJ</fnm>
               </au>
               <au>
                  <snm>Bartelink</snm>
                  <fnm>H</fnm>
               </au>
               <etal/>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2005</pubdate>
            <volume>102</volume>
            <issue>10</issue>
            <fpage>3531</fpage>
            <lpage>3532</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">553302</pubid>
                  <pubid idtype="pmpid" link="fulltext">15738396</pubid>
                  <pubid idtype="doi">10.1073/pnas.0409462102</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Gene expression programs in response to hypoxia: cell type specificity and prognostic significance in human cancers</p>
            </title>
            <aug>
               <au>
                  <snm>Chi</snm>
                  <fnm>JT</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Nuyten</snm>
                  <fnm>DS</fnm>
               </au>
               <au>
                  <snm>Rodriguez</snm>
                  <fnm>EH</fnm>
               </au>
               <au>
                  <snm>Schaner</snm>
                  <fnm>ME</fnm>
               </au>
               <au>
                  <snm>Salim</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Kristensen</snm>
                  <fnm>GB</fnm>
               </au>
               <au>
                  <snm>Helland</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Borresen-Dale</snm>
                  <fnm>AL</fnm>
               </au>
               <etal/>
            </aug>
            <source>PLoS Med</source>
            <pubdate>2006</pubdate>
            <volume>3</volume>
            <issue>3</issue>
            <fpage>e47</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1334226</pubid>
                  <pubid idtype="pmpid" link="fulltext">16417408</pubid>
                  <pubid idtype="doi">10.1371/journal.pmed.0030047</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>The prognostic role of a gene signature from tumorigenic breast-cancer cells</p>
            </title>
            <aug>
               <au>
                  <snm>Liu</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>GY</fnm>
               </au>
               <au>
                  <snm>Dalerba</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Gurney</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Hoey</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Sherlock</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Lewicki</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Shedden</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Clarke</snm>
                  <fnm>MF</fnm>
               </au>
            </aug>
            <source>N Engl J Med</source>
            <pubdate>2007</pubdate>
            <volume>356</volume>
            <issue>3</issue>
            <fpage>217</fpage>
            <lpage>226</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1056/NEJMoa063994</pubid>
                  <pubid idtype="pmpid" link="fulltext">17229949</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Concordance among gene-expression-based predictors for breast cancer</p>
            </title>
            <aug>
               <au>
                  <snm>Fan</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Oh</snm>
                  <fnm>DS</fnm>
               </au>
               <au>
                  <snm>Wessels</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Weigelt</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Nuyten</snm>
                  <fnm>DS</fnm>
               </au>
               <au>
                  <snm>Nobel</snm>
                  <fnm>AB</fnm>
               </au>
               <au>
                  <snm>van't Veer</snm>
                  <fnm>LJ</fnm>
               </au>
               <au>
                  <snm>Perou</snm>
                  <fnm>CM</fnm>
               </au>
            </aug>
            <source>N Engl J Med</source>
            <pubdate>2006</pubdate>
            <volume>355</volume>
            <issue>6</issue>
            <fpage>560</fpage>
            <lpage>569</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1056/NEJMoa052933</pubid>
                  <pubid idtype="pmpid" link="fulltext">16899776</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Prediction by supervised principal components</p>
            </title>
            <aug>
               <au>
                  <snm>Bair</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Hastie</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Debashis</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Tibshirani</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Stanford Tech Report</source>
            <pubdate>2004</pubdate>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Semi-supervised methods to predict patient survival from gene expression data</p>
            </title>
            <aug>
               <au>
                  <snm>Bair</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Tibshirani</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>PLoS Biol</source>
            <pubdate>2004</pubdate>
            <volume>2</volume>
            <issue>4</issue>
            <fpage>E108</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">387275</pubid>
                  <pubid idtype="pmpid" link="fulltext">15094809</pubid>
                  <pubid idtype="doi">10.1371/journal.pbio.0020108</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <title>
               <p>R: a language for data analysis and graphics</p>
            </title>
            <aug>
               <au>
                  <snm>Ihaka</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Gentleman</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Journal of Computational and Graphical Statistics</source>
            <pubdate>1996</pubdate>
            <volume>5</volume>
            <fpage>299</fpage>
            <lpage>314</lpage>
            <xrefbib>
               <pubid idtype="doi">10.2307/1390807</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Definition of clinically distinct molecular subtypes in estrogen receptor-positive breast carcinomas through genomic grade</p>
            </title>
            <aug>
               <au>
                  <snm>Loi</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Haibe-Kains</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Desmedt</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Lallemand</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Tutt</snm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Gillet</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Ellis</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Harris</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bergh</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Foekens</snm>
                  <fnm>JA</fnm>
               </au>
               <etal/>
            </aug>
            <source>J Clin Oncol</source>
            <pubdate>2007</pubdate>
            <volume>25</volume>
            <issue>10</issue>
            <fpage>1239</fpage>
            <lpage>1246</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1200/JCO.2006.07.1522</pubid>
                  <pubid idtype="pmpid" link="fulltext">17401012</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Lung metastasis genes couple breast tumor size and metastatic spread</p>
            </title>
            <aug>
               <au>
                  <snm>Minn</snm>
                  <fnm>AJ</fnm>
               </au>
               <au>
                  <snm>Gupta</snm>
                  <fnm>GP</fnm>
               </au>
               <au>
                  <snm>Padua</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Bos</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Nguyen</snm>
                  <fnm>DX</fnm>
               </au>
               <au>
                  <snm>Nuyten</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Kreike</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Ishwaran</snm>
                  <fnm>H</fnm>
               </au>
               <etal/>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2007</pubdate>
            <volume>104</volume>
            <issue>16</issue>
            <fpage>6740</fpage>
            <lpage>6745</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1871856</pubid>
                  <pubid idtype="pmpid" link="fulltext">17420468</pubid>
                  <pubid idtype="doi">10.1073/pnas.0701138104</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B34">
            <title>
               <p>Adjustment of systematic microarray data biases</p>
            </title>
            <aug>
               <au>
                  <snm>Benito</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Parker</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Du</snm>
                  <fnm>Q</fnm>
               </au>
               <au>
                  <snm>Wu</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Xiang</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Perou</snm>
                  <fnm>CM</fnm>
               </au>
               <au>
                  <snm>Marron</snm>
                  <fnm>JS</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>20</volume>
            <issue>1</issue>
            <fpage>105</fpage>
            <lpage>114</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btg385</pubid>
                  <pubid idtype="pmpid" link="fulltext">14693816</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Cluster analysis and display of genome-wide expression patterns</p>
            </title>
            <aug>
               <au>
                  <snm>Eisen</snm>
                  <fnm>MB</fnm>
               </au>
               <au>
                  <snm>Spellman</snm>
                  <fnm>PT</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>PO</fnm>
               </au>
               <au>
                  <snm>Botstein</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1998</pubdate>
            <volume>95</volume>
            <issue>25</issue>
            <fpage>14863</fpage>
            <lpage>14868</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">24541</pubid>
                  <pubid idtype="pmpid" link="fulltext">9843981</pubid>
                  <pubid idtype="doi">10.1073/pnas.95.25.14863</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B36">
            <title>
               <p>Adjusting batch effects in microarray expression data using empirical Bayes methods</p>
            </title>
            <aug>
               <au>
                  <snm>Johnson</snm>
                  <fnm>WE</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Rabinovic</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Biostatistics</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <issue>1</issue>
            <fpage>118</fpage>
            <lpage>127</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/biostatistics/kxj037</pubid>
                  <pubid idtype="pmpid" link="fulltext">16632515</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>Gene expression signatures, clinicopathological features, and individualized therapy in breast cancer</p>
            </title>
            <aug>
               <au>
                  <snm>Acharya</snm>
                  <fnm>CR</fnm>
               </au>
               <au>
                  <snm>Hsu</snm>
                  <fnm>DS</fnm>
               </au>
               <au>
                  <snm>Anders</snm>
                  <fnm>CK</fnm>
               </au>
               <au>
                  <snm>Anguiano</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Salter</snm>
                  <fnm>KH</fnm>
               </au>
               <au>
                  <snm>Walters</snm>
                  <fnm>KS</fnm>
               </au>
               <au>
                  <snm>Redman</snm>
                  <fnm>RC</fnm>
               </au>
               <au>
                  <snm>Tuchman</snm>
                  <fnm>SA</fnm>
               </au>
               <au>
                  <snm>Moylan</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Mukherjee</snm>
                  <fnm>S</fnm>
               </au>
               <etal/>
            </aug>
            <source>Jama</source>
            <pubdate>2008</pubdate>
            <volume>299</volume>
            <issue>13</issue>
            <fpage>1574</fpage>
            <lpage>1587</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1001/jama.299.13.1574</pubid>
                  <pubid idtype="pmpid" link="fulltext">18387932</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B38">
            <title>
               <p>An embryonic stem cell-like gene expression signature in poorly differentiated aggressive human tumors</p>
            </title>
            <aug>
               <au>
                  <snm>Ben-Porath</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Thomson</snm>
                  <fnm>MW</fnm>
               </au>
               <au>
                  <snm>Carey</snm>
                  <fnm>VJ</fnm>
               </au>
               <au>
                  <snm>Ge</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Bell</snm>
                  <fnm>GW</fnm>
               </au>
               <au>
                  <snm>Regev</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Weinberg</snm>
                  <fnm>RA</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2008</pubdate>
            <volume>40</volume>
            <issue>5</issue>
            <fpage>499</fpage>
            <lpage>507</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/ng.127</pubid>
                  <pubid idtype="pmpid" link="fulltext">18443585</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B39">
            <title>
               <p>Integrated analysis of independent gene expression microarray datasets improves the predictability of breast cancer outcome</p>
            </title>
            <aug>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Fenstermacher</snm>
                  <fnm>DA</fnm>
               </au>
            </aug>
            <source>BMC Genomics</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <issue>1</issue>
            <fpage>331</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2064937</pubid>
                  <pubid idtype="pmpid" link="fulltext">17883867</pubid>
                  <pubid idtype="doi">10.1186/1471-2164-8-331</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B40">
            <title>
               <p>Aging impacts transcriptome but not genome of hormone-dependent breast cancers</p>
            </title>
            <aug>
               <au>
                  <snm>Yau</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Fedele</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Roydasgupta</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Fridlyand</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Hubbard</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Gray</snm>
                  <fnm>JW</fnm>
               </au>
               <au>
                  <snm>Chew</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Dairkee</snm>
                  <fnm>SH</fnm>
               </au>
               <au>
                  <snm>Moore</snm>
                  <fnm>DH</fnm>
               </au>
               <au>
                  <snm>Schittulli</snm>
                  <fnm>F</fnm>
               </au>
               <etal/>
            </aug>
            <source>Breast Cancer Res</source>
            <pubdate>2007</pubdate>
            <volume>9</volume>
            <issue>5</issue>
            <fpage>R59</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2216076</pubid>
                  <pubid idtype="pmpid" link="fulltext">17850661</pubid>
                  <pubid idtype="doi">10.1186/bcr1765</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B41">
            <title>
               <p>Breast cancer in african-american women: differences in tumor biology from European-american women</p>
            </title>
            <aug>
               <au>
                  <snm>Amend</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Hicks</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Ambrosone</snm>
                  <fnm>CB</fnm>
               </au>
            </aug>
            <source>Cancer Res</source>
            <pubdate>2006</pubdate>
            <volume>66</volume>
            <issue>17</issue>
            <fpage>8327</fpage>
            <lpage>8330</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1158/0008-5472.CAN-06-1927</pubid>
                  <pubid idtype="pmpid" link="fulltext">16951137</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B42">
            <title>
               <p>Race, breast cancer subtypes, and survival in the Carolina Breast Cancer Study</p>
            </title>
            <aug>
               <au>
                  <snm>Carey</snm>
                  <fnm>LA</fnm>
               </au>
               <au>
                  <snm>Perou</snm>
                  <fnm>CM</fnm>
               </au>
               <au>
                  <snm>Livasy</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Dressler</snm>
                  <fnm>LG</fnm>
               </au>
               <au>
                  <snm>Cowan</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Conway</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Karaca</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Troester</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Tse</snm>
                  <fnm>CK</fnm>
               </au>
               <au>
                  <snm>Edmiston</snm>
                  <fnm>S</fnm>
               </au>
               <etal/>
            </aug>
            <source>Jama</source>
            <pubdate>2006</pubdate>
            <volume>295</volume>
            <issue>21</issue>
            <fpage>2492</fpage>
            <lpage>2502</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1001/jama.295.21.2492</pubid>
                  <pubid idtype="pmpid" link="fulltext">16757721</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B43">
            <title>
               <p>Epidemiology of basal-like breast cancer</p>
            </title>
            <aug>
               <au>
                  <snm>Millikan</snm>
                  <fnm>RC</fnm>
               </au>
               <au>
                  <snm>Newman</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Tse</snm>
                  <fnm>CK</fnm>
               </au>
               <au>
                  <snm>Moorman</snm>
                  <fnm>PG</fnm>
               </au>
               <au>
                  <snm>Conway</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>LV</fnm>
               </au>
               <au>
                  <snm>Labbok</snm>
                  <fnm>MH</fnm>
               </au>
               <au>
                  <snm>Geradts</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Bensen</snm>
                  <fnm>JT</fnm>
               </au>
               <au>
                  <snm>Jackson</snm>
                  <fnm>S</fnm>
               </au>
               <etal/>
            </aug>
            <source>Breast Cancer Res Treat</source>
            <pubdate>2007</pubdate>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pubmed">17578664 </pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B44">
            <title>
               <p>Differences in risk factors for breast cancer molecular subtypes in a population-based study</p>
            </title>
            <aug>
               <au>
                  <snm>Yang</snm>
                  <fnm>XR</fnm>
               </au>
               <au>
                  <snm>Sherman</snm>
                  <fnm>ME</fnm>
               </au>
               <au>
                  <snm>Rimm</snm>
                  <fnm>DL</fnm>
               </au>
               <au>
                  <snm>Lissowska</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Brinton</snm>
                  <fnm>LA</fnm>
               </au>
               <au>
                  <snm>Peplonska</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Hewitt</snm>
                  <fnm>SM</fnm>
               </au>
               <au>
                  <snm>Anderson</snm>
                  <fnm>WF</fnm>
               </au>
               <au>
                  <snm>Szeszenia-Dabrowska</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Bardin-Mikolajczak</snm>
                  <fnm>A</fnm>
               </au>
               <etal/>
            </aug>
            <source>Cancer Epidemiol Biomarkers Prev</source>
            <pubdate>2007</pubdate>
            <volume>16</volume>
            <issue>3</issue>
            <fpage>439</fpage>
            <lpage>443</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1158/1055-9965.EPI-06-0806</pubid>
                  <pubid idtype="pmpid" link="fulltext">17372238</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B45">
            <title>
               <p>Amplification protocols introduce systematic but reproducible errors into gene expression studies</p>
            </title>
            <aug>
               <au>
                  <snm>Wilson</snm>
                  <fnm>CL</fnm>
               </au>
               <au>
                  <snm>Pepper</snm>
                  <fnm>SD</fnm>
               </au>
               <au>
                  <snm>Hey</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>CJ</fnm>
               </au>
            </aug>
            <source>Biotechniques</source>
            <pubdate>2004</pubdate>
            <volume>36</volume>
            <issue>3</issue>
            <fpage>498</fpage>
            <lpage>506</lpage>
            <xrefbib>
               <pubid idtype="pmpid">15038166</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B46">
            <title>
               <p>High correspondance between Affymetrix exon and standard expression arrays</p>
            </title>
            <aug>
               <au>
                  <snm>Okoniewski</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Hey</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Pepper</snm>
                  <fnm>SD</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>CJ</fnm>
               </au>
            </aug>
            <source>Biotechniques</source>
            <pubdate>2007</pubdate>
            <volume>42</volume>
            <issue>2</issue>
            <fpage>181</fpage>
            <lpage>185</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.2144/000112315</pubid>
                  <pubid idtype="pmpid">17373482</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B47">
            <title>
               <p>MIAME VICE</p>
            </title>
            <url>http://bioinformatics.picr.man.ac.uk/vice</url>
         </bibl>
         <bibl id="B48">
            <title>
               <p>Bioconductor: open software development for computational biology and bioinformatics</p>
            </title>
            <aug>
               <au>
                  <snm>Gentleman</snm>
                  <fnm>RC</fnm>
               </au>
               <au>
                  <snm>Carey</snm>
                  <fnm>VJ</fnm>
               </au>
               <au>
                  <snm>Bates</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Bolstad</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Dettling</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Dudoit</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Ellis</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Gautier</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Ge</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Gentry</snm>
                  <fnm>J</fnm>
               </au>
               <etal/>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <issue>10</issue>
            <fpage>R80</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">545600</pubid>
                  <pubid idtype="pmpid" link="fulltext">15461798</pubid>
                  <pubid idtype="doi">10.1186/gb-2004-5-10-r80</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B49">
            <title>
               <p>Exploration, normalization, and summaries of high density oligonucleotide array probe level data</p>
            </title>
            <aug>
               <au>
                  <snm>Irizarry</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Hobbs</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Collin</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Beazer-Barclay</snm>
                  <fnm>YD</fnm>
               </au>
               <au>
                  <snm>Antonellis</snm>
                  <fnm>KJ</fnm>
               </au>
               <au>
                  <snm>Scherf</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Speed</snm>
                  <fnm>TP</fnm>
               </au>
            </aug>
            <source>Biostatistics</source>
            <pubdate>2003</pubdate>
            <volume>4</volume>
            <issue>2</issue>
            <fpage>249</fpage>
            <lpage>264</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/biostatistics/4.2.249</pubid>
                  <pubid idtype="pmpid" link="fulltext">12925520</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B50">
            <title>
               <p>Simpleaffy: a BioConductor package for Affymetrix Quality Control and data analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Wilson</snm>
                  <fnm>CL</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>CJ</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <issue>18</issue>
            <fpage>3683</fpage>
            <lpage>3685</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti605</pubid>
                  <pubid idtype="pmpid" link="fulltext">16076888</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B51">
            <title>
               <p>Significance analysis of microarrays applied to the ionizing radiation response</p>
            </title>
            <aug>
               <au>
                  <snm>Tusher</snm>
                  <fnm>VG</fnm>
               </au>
               <au>
                  <snm>Tibshirani</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Chu</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2001</pubdate>
            <volume>98</volume>
            <issue>9</issue>
            <fpage>5116</fpage>
            <lpage>5121</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">33173</pubid>
                  <pubid idtype="pmpid" link="fulltext">11309499</pubid>
                  <pubid idtype="doi">10.1073/pnas.091062498</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B52">
            <title>
               <p>Distinct patterns of DNA copy number alteration are associated with different clinicopathological features and gene-expression subtypes of breast cancer</p>
            </title>
            <aug>
               <au>
                  <snm>Bergamaschi</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Kim</snm>
                  <fnm>YH</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Sorlie</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Hernandez-Boussard</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Lonning</snm>
                  <fnm>PE</fnm>
               </au>
               <au>
                  <snm>Tibshirani</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Borresen-Dale</snm>
                  <fnm>AL</fnm>
               </au>
               <au>
                  <snm>Pollack</snm>
                  <fnm>JR</fnm>
               </au>
            </aug>
            <source>Genes Chromosomes Cancer</source>
            <pubdate>2006</pubdate>
            <volume>45</volume>
            <issue>11</issue>
            <fpage>1033</fpage>
            <lpage>1040</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/gcc.20366</pubid>
                  <pubid idtype="pmpid" link="fulltext">16897746</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
      <sec>
         <st>
            <p>Pre-publication history</p>
         </st>
         <p>The pre-publication history for this paper can be accessed here:</p>
         <p>
            <url>http://www.biomedcentral.com/1755-8794/1/42/prepub</url>
         </p>
      </sec>
   </bm>
</art>
