<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-6-28</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>An adaptive method for cDNA microarray normalization</p>
         </title>
         <aug>
            <au id="A1" ce="yes">
               <snm>Zhao</snm>
               <fnm>Yingdong</fnm>
               <insr iid="I1"/>
               <email>zhaoy@helix.nih.gov</email>
            </au>
            <au id="A2" ce="yes">
               <snm>Li</snm>
               <fnm>Ming-Chung</fnm>
               <insr iid="I2"/>
               <email>mli@emmes.com</email>
            </au>
            <au id="A3" ca="yes">
               <snm>Simon</snm>
               <fnm>Richard</fnm>
               <insr iid="I1"/>
               <email>rsimon@mail.nih.gov</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Biometric Research Branch, National Cancer Institute, National Institutes of Health, Rockville, Maryland, USA</p>
            </ins>
            <ins id="I2">
               <p>The EMMES Corporation, Rockville, Maryland, USA</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2005</pubdate>
         <volume>6</volume>
         <issue>1</issue>
         <fpage>28</fpage>
         <url>http://www.biomedcentral.com/1471-2105/6/28</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">15707486</pubid>
               <pubid idtype="doi">10.1186/1471-2105-6-28</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>21</day>
               <month>10</month>
               <year>2004</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>11</day>
               <month>2</month>
               <year>2005</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>11</day>
               <month>2</month>
               <year>2005</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2005</year>
         <collab>Zhao et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Normalization is a critical step in analysis of gene expression profiles. For dual-labeled arrays, global normalization assumes that the majority of the genes on the array are non-differentially expressed between the two channels and that the number of over-expressed genes approximately equals the number of under-expressed genes. These assumptions can be inappropriate for custom arrays or arrays in which the reference RNA is very different from the experimental samples.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We propose a mixture model based normalization method that adaptively identifies non-differentially expressed genes and thereby substantially improves normalization for dual-labeled arrays in settings where the assumptions of global normalization are problematic. The new method is evaluated using both simulated and real data.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusions</p>
               </st>
               <p>The new normalization method is effective for general microarray platforms when samples with very different expression profile are co-hybridized and for custom arrays where the majority of genes are likely to be differentially expressed.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Microarray technology provides simultaneous measurements of expression levels for thousands of genes. Each step from sample preparation to data analysis, however, contains potential sources of bias and variability. Proper normalization adjusts for differences which interfere with the comparison of intensities of different labels at a given probe and with the comparison of intensities of corresponding probes on different arrays. Proper data normalization should allow for the comparison of expression levels across different arrays. Subsequent data analysis results are heavily dependent on effective normalization.</p>
         <p>Normalization issues differ for dual-labeled platforms compared to single labeled platforms such as the Affymetrix GeneChip arrays. In this paper we address normalization for dual-labeled arrays with either cDNA or oligonucleotide probes. The objective of normalization for dual-labeled arrays is to correct for differences in intensities for the two labels on the same array. These differences arise from factors such as differences in sample concentrations, differences in photomultiplier tube setting, and differences in the affinity of the two labels for DNA.</p>
         <p>Median or mean based global normalization methods use a single normalization factor applied to all genes on the array to adjust for labeling bias <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>. Such methods are widely used because of their simplicity. Intensity-based and location-based methods take into account intensity and spatial dependence on dye bias normalization factors <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>. Both global and intensity/location based normalization methods assume that most of the genes are not differentially expressed between the two samples hybridized on the array, and that for the differentially expressed genes, the direction of the difference is symmetric between the two samples. In many important cases, however, these assumptions are not appropriate because: 1) more than half of the genes are differentially expressed on the array; 2) the numbers of over- and under-expressed genes on the array are unequal; 3) only genes of specific biological interest are selected to make a customized array, which are highly variable across the samples. In the above cases, the global normalization methods and intensity/location based normalization methods become less accurate and a more sophisticated method is needed <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>.</p>
         <p>There are some methods which attempt to adaptively identify the subset of 'housekeeping' genes <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>. These methods require multiple arrays in order to identify the 'housekeeping' gene set, which does not always exist.</p>
         <p>Newton <it>et al. </it>proposed a Gamma-Gamma-Bernoulli model for identifying differentially expressed genes in dual labeled arrays <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. We have generalized Newton's model and here propose an adaptive method based on three-component mixture model for normalization of dual labeled microarray data.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <p>As described in the Methods section, we have applied our adaptive method to both the simulated data and real data. We have also compared our method with the global method and the intensity-based lowess method.</p>
         <p>Results of the simulation studies are shown as bar plots in Figure <figr fid="F1">1</figr>. Figure <figr fid="F1">1A</figr> shows the comparison of our adaptive method, the global method and the lowess method when no noise was added. When the majority of genes in the array were non-differentially expressed (Case 1), or the numbers of over- and under-expressed genes on the array were equal (Case 2), the root mean squared error (RMSE) of the adaptive method was comparable with the other two methods; all were very small. When the array contained unequal numbers of over- and under-expressed genes and when the majority of genes were differentially expressed (Cases 3&#8211;6), the RMSEs of the global normalization method and the lowess method were much larger than those of the adaptive method. The differences ranged from around a two fold difference (0.895 in log<sub>2 </sub>scale) when the number of under-, null, and over-expressed genes were 200, 100, and 100, to more than a three fold difference (1.617 in log<sub>2 </sub>scale) when the number of under-, null, and over-expressed genes were 200, 50, and 50. The RMSEs for the adaptive method ranged from 0.078 to 0.159 in log<sub>2 </sub>scale.</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Bar plots show comparison of RMSEs by using the global method (black bar), the lowess method (grey bar), and the adaptive method (white bar) for normalization with simulated data generated from a mixture model with <it>c </it>= 1.5, <it>a </it>= 118, <it>a</it><sub>0 </sub>= 410, <it>&#947; </it>= 31, <it>&#947;</it><sub>1 </sub>= 23, and <it>&#947;</it><sub>2 </sub>= 29 at three different noise levels (A) SD = 0; (B) SD = 0.25; and (C) SD = 0.50</p>
            </caption>
            <text>
               <p>Bar plots show comparison of RMSEs by using the global method (black bar), the lowess method (grey bar), and the adaptive method (white bar) for normalization with simulated data generated from a mixture model with <it>c </it>= 1.5, <it>a </it>= 118, <it>a</it><sub>0 </sub>= 410, <it>&#947; </it>= 31, <it>&#947;</it><sub>1 </sub>= 23, and <it>&#947;</it><sub>2 </sub>= 29 at three different noise levels (A) SD = 0; (B) SD = 0.25; and (C) SD = 0.50.</p>
            </text>
            <graphic file="1471-2105-6-28-1"/>
         </fig>
         <p>We compared the histogram of observed intensities to the fitted marginal density from the adaptive method as a simple check to see whether the proposed model and the estimation procedure are in line with available data. Figure <figr fid="F2">2</figr> shows the histograms of log(ratio) and log intensities of red and green channels of the simulated data, and the curve in each plot is the estimated density obtained from the fitted model. It is seen the data fits to the model quite well.</p>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Histograms and the estimated densities of log(ratio) and log(intensity) for a set of simulated data generated from a mixture model with <it>c </it>= 1.5, <it>a </it>= 118, <it>a</it><sub>0 </sub>= 410, <it>&#947; </it>= 31, <it>&#947;</it><sub>1 </sub>= 23, and <it>&#947;</it><sub>2 </sub>= 29. The superimposed curve on each plot is generated from the fitted model</p>
            </caption>
            <text>
               <p>Histograms and the estimated densities of log(ratio) and log(intensity) for a set of simulated data generated from a mixture model with <it>c </it>= 1.5, <it>a </it>= 118, <it>a</it><sub>0 </sub>= 410, <it>&#947; </it>= 31, <it>&#947;</it><sub>1 </sub>= 23, and <it>&#947;</it><sub>2 </sub>= 29. The superimposed curve on each plot is generated from the fitted model.</p>
            </text>
            <graphic file="1471-2105-6-28-2"/>
         </fig>
         <p>Gaussian noise with SD of 0.25 and 0.50 were added so that the data was not generated from the same model used for analysis with the adaptive method. The RMSEs of the global normalization method and the lowess method remained large, while the RMSEs of the adaptive method remained small, ranging from 0.083 to 0.569 on the log<sub>2 </sub>scale (Figure <figr fid="F1">1B</figr> and <figr fid="F1">1C</figr>).</p>
         <p>In the above simulation, no apparent groups could be seen in the histograms of log(ratio) (Figure <figr fid="F2">2A</figr>). Better results for the adaptive method were also obtained for a simulation case where the three groups (under-expressed, non-differentially expressed, and over-expressed) are apparent in the histogram of log(ratio). The results can be seen in Figure 4 and Figure 5 [see Additional files <supplr sid="S3">3</supplr>, <supplr sid="S4">4</supplr>].</p>
         <suppl id="S3">
            <title>
               <p>Additional File 3</p>
            </title>
            <text>
               <p>Figure 4: Bar plots show comparison of RMSE by using the adaptive method (black bar) and global method (grey bar) with simulated data generated from a mixture model with <it>c </it>= 1.5, <it>a </it>= 90, <it>a</it><sub>0 </sub>= 120, <it>&#947; </it>= 8, <it>&#947;</it><sub>1 </sub>= 6, and <it>&#947;</it><sub>2 </sub>= 10 at three different noise levels (A) SD = 0; (B) SD = 0.25; and (C) SD = 0.50.</p>
            </text>
            <file name="1471-2105-6-28-S3.jpeg">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S4">
            <title>
               <p>Additional File 4</p>
            </title>
            <text>
               <p>Figure 5: Histograms and the estimated densities of log(ratio) and log(intensity) for a simulated data of a mixture model with <it>c </it>= 1.5, <it>a </it>= 90, <it>a</it><sub>0 </sub>= 120, <it>&#947; </it>= 8, <it>&#947;</it><sub>1 </sub>= 6, and <it>&#947;</it><sub>2 </sub>= 10. The superimposed curve on each plot is generated from the fitted model.</p>
            </text>
            <file name="1471-2105-6-28-S4.jpeg">
               <p>Click here for file</p>
            </file>
         </suppl>
         <p>Results comparing RMSEs for the adaptive method, the global method and the lowess method with real data are shown in Table <tblr tid="T2">2</tblr>. The RMSEs of the adaptive method on data generated from ten different arrays ranged from 0.128 to 0.529, in comparison with RMSEs of around 1.0 using the global normalization method. The average RMSE (0.607) of the lowess method is almost two times that of our adaptive method (0.328), although the lowess method performed better than the global method (average RMSE = 1.016). Figure <figr fid="F3">3</figr> shows the histograms of log(ratio) and log intensities of red and green channels of the real data, and the curve in each plot is the estimated density from the adaptive method.</p>
         <tbl id="T2">
            <title>
               <p>Table 2</p>
            </title>
            <caption>
               <p>Comparison of RMSEs by using the global method, the lowess method, and the adaptive method for normalization with real data.</p>
            </caption>
            <tblbdy cols="5">
               <r>
                  <c ca="center">
                     <p>Case</p>
                  </c>
                  <c ca="center">
                     <p>Array ID</p>
                  </c>
                  <c ca="center">
                     <p>Global</p>
                  </c>
                  <c ca="center">
                     <p>Lowess</p>
                  </c>
                  <c ca="center">
                     <p>Adaptive</p>
                  </c>
               </r>
               <r>
                  <c cspan="5">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>1</p>
                  </c>
                  <c ca="center">
                     <p>svcc134</p>
                  </c>
                  <c ca="center">
                     <p>1.082</p>
                  </c>
                  <c ca="center">
                     <p>0.600</p>
                  </c>
                  <c ca="center">
                     <p>0.315</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>2</p>
                  </c>
                  <c ca="center">
                     <p>svcc104</p>
                  </c>
                  <c ca="center">
                     <p>1.023</p>
                  </c>
                  <c ca="center">
                     <p>0.601</p>
                  </c>
                  <c ca="center">
                     <p>0.508</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>3</p>
                  </c>
                  <c ca="center">
                     <p>svcc120</p>
                  </c>
                  <c ca="center">
                     <p>1.005</p>
                  </c>
                  <c ca="center">
                     <p>0.552</p>
                  </c>
                  <c ca="center">
                     <p>0.435</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>4</p>
                  </c>
                  <c ca="center">
                     <p>svcc64</p>
                  </c>
                  <c ca="center">
                     <p>0.999</p>
                  </c>
                  <c ca="center">
                     <p>0.593</p>
                  </c>
                  <c ca="center">
                     <p>0.516</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>5</p>
                  </c>
                  <c ca="center">
                     <p>svcc106</p>
                  </c>
                  <c ca="center">
                     <p>0.967</p>
                  </c>
                  <c ca="center">
                     <p>0.704</p>
                  </c>
                  <c ca="center">
                     <p>0.128</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>6</p>
                  </c>
                  <c ca="center">
                     <p>svcc89</p>
                  </c>
                  <c ca="center">
                     <p>1.018</p>
                  </c>
                  <c ca="center">
                     <p>0.593</p>
                  </c>
                  <c ca="center">
                     <p>0.284</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>7</p>
                  </c>
                  <c ca="center">
                     <p>svcc109</p>
                  </c>
                  <c ca="center">
                     <p>1.014</p>
                  </c>
                  <c ca="center">
                     <p>0.577</p>
                  </c>
                  <c ca="center">
                     <p>0.264</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>8</p>
                  </c>
                  <c ca="center">
                     <p>svcc103</p>
                  </c>
                  <c ca="center">
                     <p>1.011</p>
                  </c>
                  <c ca="center">
                     <p>0.653</p>
                  </c>
                  <c ca="center">
                     <p>0.138</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>9</p>
                  </c>
                  <c ca="center">
                     <p>svcc98</p>
                  </c>
                  <c ca="center">
                     <p>1.022</p>
                  </c>
                  <c ca="center">
                     <p>0.631</p>
                  </c>
                  <c ca="center">
                     <p>0.159</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>10</p>
                  </c>
                  <c ca="center">
                     <p>svcc82</p>
                  </c>
                  <c ca="center">
                     <p>1.017</p>
                  </c>
                  <c ca="center">
                     <p>0.567</p>
                  </c>
                  <c ca="center">
                     <p>0.529</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>Average RMSE</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>1.016</p>
                  </c>
                  <c ca="center">
                     <p>0.607</p>
                  </c>
                  <c ca="center">
                     <p>0.328</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <fig id="F3">
            <title>
               <p>Figure 3</p>
            </title>
            <caption>
               <p>Histograms and the estimated densities of log(ratio) and log(intensity) for a set of real data generated from array svcc109</p>
            </caption>
            <text>
               <p>Histograms and the estimated densities of log(ratio) and log(intensity) for a set of real data generated from array svcc109. The superimposed curve on each plot is generated from the fitted model.</p>
            </text>
            <graphic file="1471-2105-6-28-3"/>
         </fig>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>In this paper, we propose a new method for normalization of dual-labeled arrays in cases where the number of differentially expressed genes is substantial and not necessarily symmetric in direction. The method performed effectively with both simulated and real data.</p>
         <p>We started our model building initially by introducing an unknown constant <it>c </it>into Newton's Gamma-Gamma-Bernoulli model <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. The mixture model consisted of two groups: non-differentially expressed genes (Equation 1A) and differentially expressed genes (Equation 1B).</p>
         <p>log(<it>cR</it><sub><it>k</it></sub>) ~ <it>Gamma</it>(<it>a</it>, <it>s</it><sub><it>k</it></sub>)</p>
         <p>log(<it>G</it><sub><it>k</it></sub>) ~ <it>Gamma</it>(<it>a</it>, <it>s</it><sub><it>k</it></sub>) &#160;&#160;&#160; (1A)</p>
         <p><it>s</it><sub><it>k </it></sub>~ <it>Gamma</it>(<it>a</it><sub>0</sub>, <it>&#947;</it>)</p>
         <p>log(<it>cR</it><sub><it>k</it></sub>) ~ <it>Gamma</it>(<it>a</it>, <graphic file="1471-2105-6-28-i1.gif"/>)</p>
         <p>log(<it>G</it><sub><it>k</it></sub>) ~ <it>Gamma</it>(<it>a</it>, <graphic file="1471-2105-6-28-i2.gif"/>) &#160;&#160;&#160; (1B)</p>
         <p><graphic file="1471-2105-6-28-i1.gif"/> ~ <it>Gamma</it>(<it>a</it><sub>0</sub>, <it>&#947;</it>)</p>
         <p><graphic file="1471-2105-6-28-i2.gif"/> ~ <it>Gamma</it>(<it>a</it><sub>0</sub>, <it>&#947;</it>)</p>
         <p>We found that when the differential expression was symmetric between the two samples, the model worked well. However, the error increased significantly when the ratio of the numbers of under- to over-expressed genes shifted from 1.</p>
         <p>In order to make the model more flexible, we modified the model by assigning different scale factors <it>&#947;</it><sub><it>R </it></sub>and <it>&#947;</it><sub><it>G </it></sub>for the red channel and green channel intensities. For this modified two-component mixture, the error still remained large. We then extended the model into a three-component mixture model listed as Equations 8A-8C in the additional material [see <supplr sid="S2">Additional file 2</supplr>]. The model was then quite flexible but there were too many parameters that needed to be optimized. After we tested it with simulation data and real data, we found the estimated model was not stable and difficult to optimize. We finally simplified the model to our final model given by Equations 2A-2C (see Methods section). When applying it to real data or simulated data, the estimates converged well close to globe optima. When different start points were used, the optimizations remained relatively robust.</p>
         <suppl id="S2">
            <title>
               <p>Additional File 2</p>
            </title>
            <text>
               <p>Equations 8A-8C: a three-component mixture model.</p>
            </text>
            <file name="1471-2105-6-28-S2.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <p>Evaluation of normalization methods can be difficult since the true normalization factors are unknown with real data for custom arrays. We avoided this problem by synthesizing customized arrays based on real data for standard arrays containing thousands of genes. In order to make the distribution of each component group look smoother, we allowed certain range of overlap between the adjacent groups. Additional sampling method was tried to divide the whole distribution range into many non-overlapping intervals. In each interval the number of genes sampled increased when the absolute value of log<sub>2</sub>(ratio) became larger (Table 3 [see <supplr sid="S6">Additional file 6</supplr>]). The model fitting results using data generated by this sampling method are listed in Table 4 [see <supplr sid="S7">Additional file 7</supplr>] and Figure 6 [see <supplr sid="S5">Additional file 5</supplr>].</p>
         <suppl id="S5">
            <title>
               <p>Additional File 5</p>
            </title>
            <text>
               <p>Figure 6: Histograms and the estimated densities of log(ratio) and log(intensity) for a set of real data generated from array svcc109. The superimposed curve on each plot is generated from the fitted model. The procedure to generate the data was described in the paper and the sampling rate was shown in Table 3 [see <supplr sid="S6">Additional file 6</supplr>].</p>
            </text>
            <file name="1471-2105-6-28-S5.jpeg">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S6">
            <title>
               <p>Additional File 6</p>
            </title>
            <text>
               <p>Table 3: Different number of genes sampled in each interval.</p>
            </text>
            <file name="1471-2105-6-28-S6.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S7">
            <title>
               <p>Additional File 7</p>
            </title>
            <text>
               <p>Table 4: Comparison of RMSE by using the adaptive method and global method with real data by a different sampling method. The procedure to generate the data was described in the paper and the sampling rate was shown in Table 3 [see <supplr sid="S6">Additional file 6</supplr>].</p>
            </text>
            <file name="1471-2105-6-28-S7.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <p>We compared our adaptive method with the global method and the intensity-based lowess method. The lowess method assumes that in each intensity interval either the majority of genes are non-differentially expressed or the numbers of up- and down-regulated genes are equal. The global median normalization makes these assumptions only over the array as a whole. It is not surprising that our method performed much better that the above two methods, because the global median method only works well when the assumptions are valid while the intensity-based lowess method is only effective when there are intensity-dependent biases.</p>
         <p>Correlation structure is complicated for the thousands of genes on a microarray. In our model, the intensity of each channel is conditionally independent given the scale parameter, but not marginally independent.Therefore, we are not assuming the intensities in two channels are independent. Although we did not generate correlated genes in our simulated data sets, correlations of genes do exist in the real data sets we tested. Spatial correlations are also possible but our method is not designed for that purpose. Yang <it>et al. </it>proposed using the lowess normalization separately within each grid on the array <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. Our algorithm could be similarly applied within each grid to control for spatial effects.</p>
         <p>Limited simulations were performed in this study. We also tried to use real data to test our method. Since an appropriate data set with known normalization factor was not available, we synthesized such data sets by sub-setting large arrays in which the true normalization factor could be accurately estimated. In the process of synthesizing such small arrays we had to choose an empirical threshold to stratify the differentially expressed genes and non-differentially expressed genes. Although we do not believe that the superiority shown for our algorithm depends critically on the threshold chosen nor on details of the synthesis, it would be preferable to evaluate the algorithm on real data sets with know normalization factors.</p>
         <p>Although our method is designed for dual-labeled cDNA array, it can be extended to single channel Affymetrix chip data. The most popular normalization method for the Affymetrix chip compares each array to a single base line array for probe set summaries. The assumptions behind the normalization method are that the majority of the genes are non-differentially expressed and the numbers of over- and under-expressed genes are roughly equal; the same assumptions as those for dual-labeled cDNA arrays. We could treat the base line array as the 'reference channel' and the other array as the 'test' channel and apply our algorithm to probe set summaries. For Affymetrix chip data, there are multiple base pairs in a probe set and each probe has an intensity measurement. Several alternative normalization methods of Affymetrix arrays utilize the probe level information. For example, method based on an 'invariant set' proposed by Li and Wong assumes that a probe of a non-differentially expressed genes in two arrays to have similar ranks and uses an iterative procedure to identify the invariant set which presumably consists of points from non-differentially expressed genes <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusions</p>
         </st>
         <p>Our new normalization method does not require that the majority of genes be non-differentially expressed, and doesn't require multiple array replicates, dye swaps, spiked controls, or housekeeping genes. It appears much more effective than standard methods when the numbers of over- and under-expressed genes are unequal, and the majority of the genes are differentially expressed. It can be very useful for general microarray platforms when samples with very different expression profile are co-hybridized and for custom arrays where the majority of genes are likely to be differentially expressed. In both of these settings, standard normalization methods are problematic.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Model</p>
            </st>
            <p>We define true intensities for a specific gene <it>k </it>in two channels as <it>R</it><sub><it>k </it></sub>(red) and <it>G</it><sub><it>k </it></sub>(green). Let <it>c </it>be a positive constant which is related to the normalization constant. The observed intensities for gene <it>k </it>in two channels are <it>cR</it><sub><it>k </it></sub>and <it>G</it><sub><it>k</it></sub>. We assume the logarithm of intensity in each channel has a <it>Gamma </it>distribution. The genes on the array belong to three different groups: 1) non-differentially expressed; 2) under-expressed; and 3) over-expressed. The overall data will be fitted into a mixture model listed below.</p>
            <p>For a non-differentially expressed gene <it>k</it>,</p>
            <p>log(<it>cR</it><sub><it>k</it></sub>) ~ <it>Gamma</it>(<it>a</it>, <it>s</it><sub><it>k</it></sub>)</p>
            <p>log(<it>G</it><sub><it>k</it></sub>) ~ <it>Gamma</it>(<it>a</it>, <it>s</it><sub><it>k</it></sub>) &#160;&#160;&#160; (2A)</p>
            <p><it>s</it><sub><it>k </it></sub>~ <it>Gamma</it>(<it>a</it><sub>0</sub>, <it>&#947;</it>).</p>
            <p>For an under-expressed gene <it>k</it>,</p>
            <p>log(<it>cR</it><sub><it>k</it></sub>) ~ <it>Gamma</it>(<it>a</it>, <graphic file="1471-2105-6-28-i1.gif"/>)</p>
            <p>log(<it>G</it><sub><it>k</it></sub>) ~ <it>Gamma</it>(<it>a</it>, <graphic file="1471-2105-6-28-i2.gif"/>) &#160;&#160;&#160; (2B)</p>
            <p><graphic file="1471-2105-6-28-i1.gif"/> ~ <it>Gamma</it>(<it>a</it><sub>0</sub>, <it>&#947;</it><sub>1</sub>)</p>
            <p><graphic file="1471-2105-6-28-i2.gif"/> ~ <it>Gamma</it>(<it>a</it><sub>0</sub>, <it>&#947;</it><sub>2</sub>).</p>
            <p>For an over-expressed gene <it>k</it>,</p>
            <p>log(<it>cR</it><sub><it>k</it></sub>) ~ <it>Gamma</it>(<it>a</it>, <graphic file="1471-2105-6-28-i1.gif"/>)</p>
            <p>log(<it>G</it><sub><it>k</it></sub>) ~ <it>Gamma</it>(<it>a</it>, <graphic file="1471-2105-6-28-i2.gif"/>) &#160;&#160;&#160; (2C)</p>
            <p><graphic file="1471-2105-6-28-i1.gif"/> ~ <it>Gamma</it>(<it>a</it><sub>0</sub>, <it>&#947;</it><sub>2</sub>)</p>
            <p><graphic file="1471-2105-6-28-i2.gif"/> ~ <it>Gamma</it>(<it>a</it><sub>0</sub>, <it>&#947;</it><sub>1</sub>).</p>
            <p>In the above <it>Gamma </it>distributions, the parameters <it>a </it>and <it>a</it><sub>0 </sub>are shape factors, and the parameters <it>s</it><sub><it>k</it></sub>, <it>&#947;</it>, <it>&#947;</it><sub>1</sub>,<it>&#947;</it><sub>2</sub>, <graphic file="1471-2105-6-28-i1.gif"/>, <graphic file="1471-2105-6-28-i2.gif"/> are scale factors. The parameters <it>a</it>, <it>a</it><sub>0</sub>, <it>&#947;</it>, <it>&#947;</it><sub>1</sub>, <it>&#947;</it><sub>2 </sub>will be estimated from the data.</p>
            <p>Let <it>p</it><sub><it>u</it></sub>(<it>R</it><sub><it>k</it></sub>, <it>G</it><sub><it>k</it></sub>), <it>p</it><sub><it>o</it></sub>(<it>R</it><sub><it>k</it></sub>, <it>G</it><sub><it>k</it></sub>) and <it>p</it><sub><it>n</it></sub>(<it>R</it><sub><it>k</it></sub>, <it>G</it><sub><it>k</it></sub>) be the densities of (<it>R</it><sub><it>k</it></sub>, <it>G</it><sub><it>k</it></sub>) for under-expressed, over-expressed and non-differentially expressed genes, respectively. The joint distributions of (<it>R</it><sub><it>k</it></sub>, <it>G</it><sub><it>k</it></sub>) in three groups can be derived as follows [details see <supplr sid="S1">Additional file 1</supplr>]:</p>
            <suppl id="S1">
               <title>
                  <p>Additional File 1</p>
               </title>
               <text>
                  <p>Derivation of joint distributions of (R, G) for <it>p</it><sub><it>n</it></sub>, <it>p</it><sub><it>u</it></sub>, and <it>p</it><sub><it>o </it></sub></p>
               </text>
               <file name="1471-2105-6-28-S1.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>
               <graphic file="1471-2105-6-28-i3.gif"/>
            </p>
            <p>
               <graphic file="1471-2105-6-28-i4.gif"/>
            </p>
            <p>
               <graphic file="1471-2105-6-28-i5.gif"/>
            </p>
            <p>Let <it>&#952; </it>denote the unknown parameter vector (<it>a</it>, <it>a</it><sub>0</sub>, <it>&#947;</it>, <it>&#947;</it><sub>1</sub>, <it>&#947;</it><sub>2</sub>, <it>c</it>), which can be estimated by maximizing the likelihood function of observed data. We used the EM algorithm <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> for this maximization. Let <it>p</it><sub>1 </sub>be the proportion of under-expressed genes and <it>p</it><sub>2 </sub>be the proportion of over-expressed genes. We define indicator binary variable <it>z</it><sub><it>k</it>1 </sub>to be 1 if the <it>k</it>th gene is under expressed, 0 otherwise; and <it>z</it><sub><it>k</it>2 </sub>to be 1 if the <it>k</it>th gene is over expressed, 0 otherwise. The <it>complete-data loglikelihood </it>for all spots can be derived as follows,</p>
            <p>
               <graphic file="1471-2105-6-28-i6.gif"/>
            </p>
            <p>In the M-step, we first take derivative on Equation (4) with respect to <it>p</it><sub>1 </sub>and <it>p</it><sub>2</sub>. This yields</p>
            <p>
               <graphic file="1471-2105-6-28-i7.gif"/>
            </p>
            <p>
               <graphic file="1471-2105-6-28-i8.gif"/>
            </p>
            <p>where <it>K </it>is the total number of genes on the array.</p>
            <p>To maximize Equation (4), we only need to maximize Equation (6) because the left out terms do not depend on the parameter <it>&#952;</it>.</p>
            <p>
               <graphic file="1471-2105-6-28-i9.gif"/>
            </p>
            <p>In the E-step, we compute the conditional expectations of <it>z</it><sub><it>k</it>1 </sub>and <it>z</it><sub><it>k</it>2 </sub>given the other parameters from the M-step.</p>
            <p>
               <graphic file="1471-2105-6-28-i10.gif"/>
            </p>
            <p>
               <graphic file="1471-2105-6-28-i11.gif"/>
            </p>
            <p>Once the constant <it>c </it>was obtained, the normalization constant for log intensity ratio data can be calculated as log(1/<it>c</it>).</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Model evaluation</p>
         </st>
         <p>Simulation studies were performed by generating two channel intensities from the mixture model with <it>c </it>= 1.5, <it>a </it>= 118, <it>a</it><sub>0 </sub>= 410, <it>&#947; </it>= 31, <it>&#947;</it><sub>1 </sub>= 23, and <it>&#947;</it><sub>2 </sub>= 29. Six scenarios were included using different proportions of non-differentially expressed genes and different ratios of under- to over- expressed genes, as listed in Table <tblr tid="T1">1</tblr>.</p>
         <tbl id="T1">
            <title>
               <p>Table 1</p>
            </title>
            <caption>
               <p>Six scenarios using different proportions of non-differentially expressed genes and different ratios of under- to over-expressed genes with simulated data.</p>
            </caption>
            <tblbdy cols="4">
               <r>
                  <c>
                     <p/>
                  </c>
                  <c cspan="3" ca="center">
                     <p>Number of genes</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>Case</p>
                  </c>
                  <c ca="center">
                     <p>Under-expressed</p>
                  </c>
                  <c ca="center">
                     <p>Non-differentially expressed</p>
                  </c>
                  <c ca="center">
                     <p>Over-expressed</p>
                  </c>
               </r>
               <r>
                  <c cspan="4">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>1</p>
                  </c>
                  <c ca="center">
                     <p>100</p>
                  </c>
                  <c ca="center">
                     <p>500</p>
                  </c>
                  <c ca="center">
                     <p>100</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>2</p>
                  </c>
                  <c ca="center">
                     <p>100</p>
                  </c>
                  <c ca="center">
                     <p>100</p>
                  </c>
                  <c ca="center">
                     <p>100</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>3</p>
                  </c>
                  <c ca="center">
                     <p>200</p>
                  </c>
                  <c ca="center">
                     <p>100</p>
                  </c>
                  <c ca="center">
                     <p>100</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>4</p>
                  </c>
                  <c ca="center">
                     <p>200</p>
                  </c>
                  <c ca="center">
                     <p>100</p>
                  </c>
                  <c ca="center">
                     <p>50</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>5</p>
                  </c>
                  <c ca="center">
                     <p>200</p>
                  </c>
                  <c ca="center">
                     <p>50</p>
                  </c>
                  <c ca="center">
                     <p>100</p>
                  </c>
               </r>
               <r>
                  <c ca="center">
                     <p>6</p>
                  </c>
                  <c ca="center">
                     <p>200</p>
                  </c>
                  <c ca="center">
                     <p>50</p>
                  </c>
                  <c ca="center">
                     <p>50</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <p>One hundred data sets were generated for each scenario and the RMSE between the estimated log<sub>2</sub>(1/<graphic file="1471-2105-6-28-i12.gif"/>) and the true log2(1/<it>c</it>) was calculated. The global method takes the median log<sub>2</sub>(ratio) of all genes in each data set as the normalization factor. The lowess method performs robust locally linear fits of M-A plot and corrects the biases that are dependent on spot intensity <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. The RMSE between the normalized log<sub>2</sub>(ratio) using the lowess method and the normalized log<sub>2</sub>(ratio) using the global method with the true normalization factor (log<sub>2</sub>(1/<it>c</it>)) for all genes was also calculated for the same 100 data sets.</p>
         <p>Gaussian white noise was also added when generating the simulated data. We used standard deviation of 0.25 in log<sub>2 </sub>scale to reflect the experimental noise in inbred strains of mice or cell line data and 0.5 in log<sub>2 </sub>scale to reflect a larger experimental noise in human tissue data <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>.</p>
         <p><it>In-silico </it>studies were performed on real data. We tested the method on ten arrays from publicly available breast cancer data <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. Each array consists of 9216 genes. The common reference sample was a pool of RNA isolated from 11 different cultured cell lines (green channel, labeled with Cy3). RNA from tissues of breast cancer patients were used in the test channel (red channel, labeled with Cy5). The array was first normalized by the global normalization method. The median log<sub>2</sub>(ratio) of all genes was considered as the true normalization factor <it>c</it>. The genes were then divided into three groups: over-expressed genes (log<sub>2</sub>(ratio)>1), non-differentially expressed genes(-1.5&lt;log<sub>2</sub>(ratio)&lt;1.5), and under-expressed genes (log<sub>2</sub>(ratio)&lt;-1). We randomly sampled a specified number of genes from each group (100 non-differentially expressed genes, 200 under-expressed and 100 over-expressed genes) and then combined them into an <it>in-silico </it>array. We constructed 100 datasets for each of the 10 arrays in this way and the RMSE between the estimated log<sub>2</sub>(1/ <graphic file="1471-2105-6-28-i12.gif"/>) and the true log<sub>2</sub>(1/<it>c</it>) was calculated.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgement</p>
            </st>
            <p>We thank Dr. George Wright for reading our manuscript and helpful discussions, and the editor and reviewers who provided valuable suggestions.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Centralization: a new method for the normalization of gene expression data</p>
            </title>
            <aug>
               <au>
                  <snm>Zien</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Aigner</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Zimmer</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Lengauer</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2001</pubdate>
            <issue>suppl 17</issue>
            <fpage>S323</fpage>
            <lpage>331</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pubmed">11473024</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Microarray data normalization and transformation</p>
            </title>
            <aug>
               <au>
                  <snm>Quackenbush</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Nature Genetics</source>
            <pubdate>2002</pubdate>
            <issue>Suppl 32</issue>
            <fpage>496</fpage>
            <lpage>501</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1038/ng1032</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation</p>
            </title>
            <aug>
               <au>
                  <snm>Yang</snm>
                  <fnm>YH</fnm>
               </au>
               <au>
                  <snm>Dudoit</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Luu</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Lin</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Peng</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Ngai</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Speed</snm>
                  <fnm>TP</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <issue>4</issue>
            <fpage>e15</fpage>
            <lpage/>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">100354</pubid>
                  <pubid idtype="pmpid" link="fulltext">11842121</pubid>
                  <pubid idtype="doi">10.1093/nar/30.4.e15</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Normalization and analysis of DNA microarray data by self-consistency and local regression</p>
            </title>
            <aug>
               <au>
                  <snm>Kepler</snm>
                  <fnm>TB</fnm>
               </au>
               <au>
                  <snm>Crosby</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Morgan</snm>
                  <fnm>KT</fnm>
               </au>
            </aug>
            <source>Genome Biology</source>
            <pubdate>2002</pubdate>
            <volume>3</volume>
            <fpage>research0037</fpage>
            <lpage/>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">126242</pubid>
                  <pubid idtype="pmpid" link="fulltext">12184811</pubid>
                  <pubid idtype="doi">10.1186/gb-2002-3-7-research0037</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Organ-specific differences in gene expression and UniGene annotations describing source material</p>
            </title>
            <aug>
               <au>
                  <snm>Stivers</snm>
                  <fnm>DN</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Rosner</snm>
                  <fnm>GL</fnm>
               </au>
               <au>
                  <snm>Coombes</snm>
                  <fnm>KR</fnm>
               </au>
            </aug>
            <source>Methods of Microarray Data Analysis III</source>
            <publisher>Kluwer Academic Publishers</publisher>
            <pubdate>2003</pubdate>
            <fpage>59</fpage>
            <lpage>72</lpage>
         </bibl>
         <bibl id="B6">
            <title>
               <p>New normalization methods for cDNA microarray data</p>
            </title>
            <aug>
               <au>
                  <snm>Wilson</snm>
                  <fnm>DL</fnm>
               </au>
               <au>
                  <snm>Buckley</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Helliwell</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Wilson</snm>
                  <fnm>IW</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <fpage>1325</fpage>
            <lpage>1332</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btg146</pubid>
                  <pubid idtype="pmpid" link="fulltext">12874043</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Iterative normalization of cDNA Microarray data</p>
            </title>
            <aug>
               <au>
                  <snm>Wang</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Lu</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Gu</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Clarke</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>IEEE Trans On Info Tech in Biom</source>
            <pubdate>2002</pubdate>
            <volume>6</volume>
            <fpage>29</fpage>
            <lpage>37</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1109/4233.992159</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Normalization of single-channel DNA array data by principal component analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Stoyanova</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Querec</snm>
                  <fnm>TD</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>TR</fnm>
               </au>
               <au>
                  <snm>Patriotis</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>20</volume>
            <fpage>1772</fpage>
            <lpage>1784</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bth170</pubid>
                  <pubid idtype="pmpid" link="fulltext">15037508</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data</p>
            </title>
            <aug>
               <au>
                  <snm>Newton</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Kendziorski</snm>
                  <fnm>CM</fnm>
               </au>
               <au>
                  <snm>Richmond</snm>
                  <fnm>CS</fnm>
               </au>
               <au>
                  <snm>Blattner</snm>
                  <fnm>FR</fnm>
               </au>
               <au>
                  <snm>Tsui</snm>
                  <fnm>KW</fnm>
               </au>
            </aug>
            <source>J Computat Biol</source>
            <pubdate>2001</pubdate>
            <volume>8</volume>
            <fpage>37</fpage>
            <lpage>52</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1089/106652701300099074</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application</p>
            </title>
            <aug>
               <au>
                  <snm>Li</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Wong</snm>
                  <fnm>WH</fnm>
               </au>
            </aug>
            <source>Genome Biology</source>
            <pubdate>2001</pubdate>
            <volume>2</volume>
            <issue>8</issue>
            <fpage>research0032</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">55329</pubid>
                  <pubid idtype="pmpid" link="fulltext">11532216</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Maximum likelihood from incomplete data via the EM algorithm (with discussion)</p>
            </title>
            <aug>
               <au>
                  <snm>Dempster</snm>
                  <fnm>AP</fnm>
               </au>
               <au>
                  <snm>Laird</snm>
                  <fnm>NM</fnm>
               </au>
               <au>
                  <snm>Rubin</snm>
                  <fnm>DB</fnm>
               </au>
            </aug>
            <source>J Royal Statistical Society Series B</source>
            <pubdate>1977</pubdate>
            <volume>39</volume>
            <fpage>1</fpage>
            <lpage>38</lpage>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Design and analysis of DNA microarray investigations</p>
            </title>
            <aug>
               <au>
                  <snm>Simon</snm>
                  <fnm>RM</fnm>
               </au>
               <au>
                  <snm>Korn</snm>
                  <fnm>EL</fnm>
               </au>
               <au>
                  <snm>McShane</snm>
                  <fnm>LM</fnm>
               </au>
               <au>
                  <snm>Radmacher</snm>
                  <fnm>MD</fnm>
               </au>
               <au>
                  <snm>Wright</snm>
                  <fnm>GW</fnm>
               </au>
               <au>
                  <snm>Zhao</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <source>Springer</source>
            <pubdate>2004</pubdate>
            <volume/>
            <fpage>24</fpage>
            <lpage>25</lpage>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Molecular portraits of human breast tumors</p>
            </title>
            <aug>
               <au>
                  <snm>Perou</snm>
                  <fnm>CM</fnm>
               </au>
               <au>
                  <snm>Sorlie</snm>
                  <fnm>DT</fnm>
               </au>
               <au>
                  <snm>Eisen</snm>
                  <fnm>MB</fnm>
               </au>
               <au>
                  <snm>Van de Rijin</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Jeffrey</snm>
                  <fnm>SS</fnm>
               </au>
               <au>
                  <snm>Rees</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Pollack</snm>
                  <fnm>JR</fnm>
               </au>
               <au>
                  <snm>Ross</snm>
                  <fnm>DT</fnm>
               </au>
               <au>
                  <snm>Johnson</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Akslen</snm>
                  <fnm>LA</fnm>
               </au>
               <au>
                  <snm>Fluge</snm>
                  <fnm>&#216;</fnm>
               </au>
               <au>
                  <snm>Pergamenschikov</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Williams</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Zhu</snm>
                  <fnm>SX</fnm>
               </au>
               <au>
                  <snm>L&#248;nning</snm>
                  <fnm>PE</fnm>
               </au>
               <au>
                  <snm>B&#248;rresen-Dale</snm>
                  <fnm>AL</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>PO</fnm>
               </au>
               <au>
                  <snm>Botstein</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2000</pubdate>
            <volume>406</volume>
            <fpage>747</fpage>
            <lpage>752</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/35021093</pubid>
                  <pubid idtype="pmpid" link="fulltext">10963602</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
