<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-9-164</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>A comprehensive re-analysis of the Golden Spike data: Towards a benchmark for differential expression methods</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Pearson</snm>
               <mi>D</mi>
               <fnm>Richard</fnm>
               <insr iid="I1"/>
               <email>richard.pearson@postgrad.manchester.ac.uk</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>School of Computer Science, University of Manchester, Oxford Road, Manchester, M13 9PL, UK</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2008</pubdate>
         <volume>9</volume>
         <issue>1</issue>
         <fpage>164</fpage>
         <url>http://www.biomedcentral.com/1471-2105/9/164</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">18366762</pubid>
               <pubid idtype="doi">10.1186/1471-2105-9-164</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>09</day>
               <month>11</month>
               <year>2007</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>26</day>
               <month>3</month>
               <year>2008</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>26</day>
               <month>3</month>
               <year>2008</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2008</year>
         <collab>Pearson; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>The Golden Spike data set has been used to validate a number of methods for summarizing Affymetrix data sets, sometimes with seemingly contradictory results. Much less use has been made of this data set to evaluate differential expression methods. It has been suggested that this data set should not be used for method comparison due to a number of inherent flaws.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We have used this data set in a comparison of methods which is far more extensive than any previous study. We outline six stages in the analysis pipeline where decisions need to be made, and show how the results of these decisions can lead to the apparently contradictory results previously found. We also show that, while flawed, this data set is still a useful tool for method comparison, particularly for identifying combinations of summarization and differential expression methods that are unlikely to perform well on real data sets. We describe a new benchmark, AffyDEComp, that can be used for such a comparison.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>We conclude with recommendations for preferred Affymetrix analysis tools, and for the development of future spike-in data sets.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>The issue of method validation is of great importance to the microarray community; arguably more important than the development of new methods <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. The microarray analyst is faced with a seemingly endless choice of methods, many of which give evidence to support their claims of being superior to other approaches, which at times can appear contradictory. Because of this, choice of methods is often determined not by a rigorous comparison of method performance, but by what a researcher is familiar with, what a researcher's colleagues have expertise in, or what was used in a researcher's favorite paper. Method validation is a difficult problem in microarray analysis because, for the vast majority of microarray data sets, we don't know what the "right answer" really is. For example, in a typical analysis of differential gene expression, we rarely know which genes are truly differentially expressed (DE) between different conditions. Perhaps even worse than this, we rarely have any strong evidence about the proportion of genes that are differentially expressed.</p>
         <p>Perhaps the most well-known and widely used benchmark for Affymetrix analysis methods is Affycomp <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. This is essentially a benchmark for normalization and summarization methods. While a very valuable tool of method validation, Affycomp is not ideal for comparison of DE methods because:</p>
         <p>1. It uses data sets which only have a small number of DE spike-in probesets.</p>
         <p>2. It only uses fold change (FC) as a metric for DE detection, and hence cannot be used to compare other competing DE methods.</p>
         <p>More recently, the MicroArray Quality Control (MAQC) study <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> has developed a large number of reference data sets. The primary goal of this study was to show that microarray results can be reproducible, however, a secondary goal was to provide tools for benchmarking methods. The study concluded that using FC as a DE method gives results that are more reproducible than the other DE methods studied. However, the study could not give recommendations about other important metrics for DE methods such as sensitivity and specificity. The problem here is that we don't know for sure which genes are differentially expressed between the conditions. We could infer this by comparing results across the different microarray technologies used, but the different technologies may well have similar biases, invalidating the results. We could also infer which genes are differentially expressed by comparison with other technologies such as qRT-PCR, but again, there could be similar biases in these technologies. Furthermore, there are competing methods for detection of DE genes using qRT-PCR, so we may well get contradictory results when comparing different microarray DE methods against different qRT-PCR DE methods.</p>
         <p>The "Golden Spike" data set of Choe <it>et al</it>. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> includes two conditions; control (C) and sample (S), with 3 replicates per condition. Each array has 14,010 probesets. 3,866 of these probesets can be used to detect RNAs that have been spiked in. 2,535 of these spike-in probesets relate to RNAs that have been spiked-in at equal concentrations in the two conditions. The remaining 1,331 probesets relate to RNAs that have been spiked-in at higher concentrations in the S condition relative to the C condition. As such, this data set has a large number of probesets that are known to be DE, and a large number that are known to be not DE. This makes the Golden Spike data set potentially very valuable for validating DE methods.</p>
         <p>There have been criticisms of the Golden Spike data set from Dabney and Storey <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, Irizarry <it>et al</it>. <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> and Gaile and Miecznikowski <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. The main criticisms of <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> and <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> center around the fact that the non-DE probesets in the Golden Spike data set have non-uniform p-value distributions. This implies that any measure of significance of DE will be incorrect. Significance measures are valuable because they allow a researcher to make principled decisions about how many genes might be DE, which is a goal towards which we should strive. Unfortunately, we still have no way of knowing for sure whether the non-uniform p-value distributions of the non-DE probesets seen in the Golden Spike data set are particular to this data set, or are a general feature of microarray data sets. Indeed, a recent study by Fodor <it>et al</it>. <abbrgrp><abbr bid="B8">8</abbr></abbrgrp> has suggested non-uniform p-value distributions may be common. However, even if we cannot reliably predict the proportion of genes that are differentially expressed, we can still rank the genes from most likely to be DE to least likely to be DE. In many cases, a researcher might want a list of candidate genes which will be investigated further. A common though admittedly unprincipled approach is to choose the top N candidate genes where N is determined by available resources rather than statistical significance. In such situations it is the rank order of probability of being DE that is used. The tool that has been used most extensively for comparing methods on this data set is the receiver-operator characteristic (ROC) chart. The ROC chart only takes into account the rank order of DE probesets, and hence is not affected by concerns about non-uniform p-value distributions. Gaile and Miecznikowski <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> show that the Golden Spike data set is not suitable for comparison of methods of false discovery rate (FDR) control, but say nothing about whether or not the data set can be used for comparing methods of ranking genes by propensity to be DE.</p>
         <p>Irizarry <it>et al</it>. <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> detail three undesirable characteristics of the Golden Spike data set induced by the experimental design, and one artifact. The three undesirable characteristics are:</p>
         <p>1. Spike-in concentrations are unrealistically high.</p>
         <p>2. DE spike-ins are all one-way (up-regulated).</p>
         <p>3. Nominal concentrations and FC sizes are confounded.</p>
         <p>While we agree that these are indeed undesirable characteristics, and would recommend the creation of new spike-in data sets that do not have these characteristics, we do not believe that these completely invalidate the use of the Golden Spike data set as a useful comparison tool.</p>
         <p>Perhaps more serious is the artifact identified by Irizarry <it>et al</it>. <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. They show that the FCs of the spike-ins that are spiked in at equal levels are lower than the "empty" probesets (i.e. those not spiked in). Schuster <it>et al</it>. <abbrgrp><abbr bid="B9">9</abbr></abbrgrp> have recently suggested that this difference is due to differences in non-specific binding, which in turn is due to differences in amounts of labeled cRNA between the C and S conditions. We agree that this artifact invalidates comparison methods that use the set of all unchanging (equal FC and empty) probesets as true negatives when creating ROC charts. However, we argue that we can still use the Golden Spike data set as a valid benchmark by using ROC charts with just the equal FC probesets as our true negatives (i.e. by ignoring the empty probesets).</p>
         <p>The Golden Spike data set has been used to validate many different methods for summarizing Affymetrix data sets. Choe <it>et al</it>. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> originally used this data set to show that a modified form of MAS5.0 (which we will refer to as CP for Choe Preferred) outperforms RMA <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, GCRMA <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> and MBEI (the algorithm used in the dChip software) <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. Liu <it>et al</it>. <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> used the data set to show that multi-mgMOS <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> can outperform CP. Hochreiter <it>et al</it>. <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> used the data set to show that FARMS outperforms RMA, MAS5.0 and MBEI, and that RMA outperforms MAS5.0 and MBEI, in apparent contradiction to Choe <it>et al</it>. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. Chen <it>et al</it>. <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> used the data to show that DFW and GCRMA outperform RMA, MAS5.0, MBEI, PLIER <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>, FARMS and CP, again in apparent contradiction to Choe <it>et al</it>. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. All of these papers used some form of ROC curve in their analyses. The confusing, and seemingly contradictory results, make it difficult for typical Affymetrix users to decide between methods.</p>
         <p>The reason for the differing results arise from the different choices made at various stages of the analysis pipeline. In particular, different DE methods have been used in the papers cited above. Only Choe <it>et al</it>. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> and Liu <it>et al</it>. <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> have compared different DE methods on the results of the same normalization and summarization methods. Choices for DE methods include: fold change (FC); t-tests; modified t-tests such as those used by limma <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> and Cyber-T <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>; and the probability of positive log ratio (PPLR) method <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. In addition to choice of DE method, there are choices to be made at other stages of the analysis pipeline. We broadly summarize these as the following six choices, each of which can have a significant influence over results:</p>
         <p>1. Summary statistic used (e.g. RMA, GCRMA, MAS5.0, etc.). Note that Choe <it>et al</it>. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> broke this particular choice down to four separate sub-choices of methods for background correction, probe-level normalization, PM adjustment, and expression summary.</p>
         <p>2. Post-summarization normalization method. Choe <it>et al</it>. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> compared no further normalization against the use of a loess probeset-level normalization based on the known invariant probesets.</p>
         <p>3. Differential expression (DE) method. Choe <it>et al</it>. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> compared t-test, Cyber-T <abbrgrp><abbr bid="B19">19</abbr></abbrgrp> and SAM <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>.</p>
         <p>4. Direction of differential expression. Choe <it>et al</it>. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> used a 2-sided test (as opposed to, for example, a 1-sided test of up-regulation).</p>
         <p>5. Choice of true positives. Choe <it>et al</it>. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> used all spike-in probesets with fold-change (FC) greater than 1.</p>
         <p>6. Choice of true negatives. Choe <it>et al</it>. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> used all invariant probesets. This included both probesets that were spiked in at equal quantities, as well as the so-called "empty" probesets.</p>
         <p>Table <tblr tid="T1">1</tblr> shows the choices we believe were made in various studies of the Golden Spike data set. In addition to the studies identified in Table <tblr tid="T1">1</tblr>, Lemieux <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> and Hess and Iyer <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> report results of "probe-level" methods for detecting differential expression. We do not consider these approaches here. In addition to the choices at the six steps of the analysis pipeline highlighted above, there are choices to be made about how the data are displayed, and what metrics should be used for comparison. There are many types of "ROC-like" charts that can be created. An ROC chart is generally considered to be one where the x-axis shows the false-positive rate (FPR), and the y-axis the true-positive rate (TPR). This type of chart is used in the Liu <it>et al</it>. <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>, Hochreiter <it>et al</it>. <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> and Chen <it>et al</it>. <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> papers. Another type of ROC curve has the false-discovery rate (FDR) along the x-axis. This type of ROC curve was used in the original Choe <it>et al</it>. paper <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. There are a large range of other types of chart for visualizing classifier performance <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> that we have not considered. In addition, choices need to be made about whether to show the full ROC charts (with x- and y-axes both between 0 and 1), or whether to just display a part of the chart. While using the full ROC chart is the only way of assessing the performance of a method across the full range of data, this can result in charts where the lines of each method are very close together and hence difficult to distinguish. Often, an analyst is most interested in methods which will give the least number of false positives for a relatively small number of true positives, as only a small number of genes will be investigated further. In such cases it can often be informative to show the ROC chart for a much smaller range of FPRs, for example, between 0 and 0.05. The charts in the original Choe <it>et al</it>. paper <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> use different x-axis cutoffs to show different aspects of the analysis.</p>
         <tbl id="T1">
            <title>
               <p>Table 1</p>
            </title>
            <caption>
               <p>Analysis choices of various studies of the "Golden Spike" data set. These are choices we believe were made for each of the six stages of the analysis pipeline we have outlined.</p>
            </caption>
            <tblbdy cols="7">
               <r>
                  <c ca="left">
                     <p>Study</p>
                  </c>
                  <c ca="left">
                     <p>Summarization method</p>
                  </c>
                  <c ca="left">
                     <p>Post-summ Normalization</p>
                  </c>
                  <c ca="left">
                     <p>DE method</p>
                  </c>
                  <c ca="left">
                     <p>Dir</p>
                  </c>
                  <c ca="left">
                     <p>True positives</p>
                  </c>
                  <c ca="left">
                     <p>True negatives</p>
                  </c>
               </r>
               <r>
                  <c cspan="7">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Choe <it>et al</it>. [4]</p>
                  </c>
                  <c ca="left">
                     <p>CP, MAS5.0, RMA, GCRMA, MBEI plus many variants of these</p>
                  </c>
                  <c ca="left">
                     <p>none, loess_invariant</p>
                  </c>
                  <c ca="left">
                     <p>t-test, Cyber-T, SAM</p>
                  </c>
                  <c ca="left">
                     <p>either</p>
                  </c>
                  <c ca="left">
                     <p>FC >1</p>
                  </c>
                  <c ca="left">
                     <p>invariant</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Liu <it>et al</it>. [13]</p>
                  </c>
                  <c ca="left">
                     <p>CP, multi-mgMOS</p>
                  </c>
                  <c ca="left">
                     <p>loess_invariant</p>
                  </c>
                  <c ca="left">
                     <p>Cyber-T, PPLR</p>
                  </c>
                  <c ca="left">
                     <p>up</p>
                  </c>
                  <c ca="left">
                     <p>FC >1</p>
                  </c>
                  <c ca="left">
                     <p>invariant</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Hochreiter <it>et al</it>. [15]</p>
                  </c>
                  <c ca="left">
                     <p>MAS5.0, RMA, MBEI and FARMS</p>
                  </c>
                  <c ca="left">
                     <p>none</p>
                  </c>
                  <c ca="left">
                     <p>SAM</p>
                  </c>
                  <c ca="left">
                     <p>up</p>
                  </c>
                  <c ca="left">
                     <p>FC >1</p>
                  </c>
                  <c ca="left">
                     <p>invariant</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Chen <it>et al</it>. [16]</p>
                  </c>
                  <c ca="left">
                     <p>CP, MAS5.0, RMA, GCRMA, MBEI, PLIER, FARMS and DFW</p>
                  </c>
                  <c ca="left">
                     <p>none</p>
                  </c>
                  <c ca="left">
                     <p>FC</p>
                  </c>
                  <c ca="left">
                     <p>either</p>
                  </c>
                  <c ca="left">
                     <p>FC >1 and FC = x(for all x)</p>
                  </c>
                  <c ca="left">
                     <p>invariant</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Current study</p>
                  </c>
                  <c ca="left">
                     <p>CP, MAS5.0, RMA, GCRMA, MBEI, multi-mgMOS, FARMS, DFW, PLIER</p>
                  </c>
                  <c ca="left">
                     <p>none, loess_invariant, loess_equal, loess_all</p>
                  </c>
                  <c ca="left">
                     <p>FC, t-test, Cyber-T, limma and PPLR</p>
                  </c>
                  <c ca="left">
                     <p>either, up and down</p>
                  </c>
                  <c ca="left">
                     <p>FC >1, low FC, medium FC, high FC and FC = x(for all x)</p>
                  </c>
                  <c ca="left">
                     <p>equal and invariant</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <p>The most commonly used metric for assessing a DE detection method's performance is the Area Under the standard ROC Curve (AUC). This is typically calculated for the full ROC chart (i.e. FPR values from 0 to 1), but can also be calculated for a small portion of the chart (e.g. FPRs between 0 and 0.05). Other metrics that might be used are the number or proportion of true positives for a fixed number or proportion of false positives, or conversely the number or proportion of false positives for a fixed number or proportion of true positives.</p>
         <p>In this study we have analyzed all combinations of the various options shown in the last row of Table <tblr tid="T1">1</tblr>. In addition, we have created charts displaying the data in different ways. In the next section we show how results can vary when making different choices at the stages of the analysis pipeline highlighted above. We also discuss what we believe are good choices. We detail a web resource called AffyDEComp which can be used as a limited benchmark for DE methods on Affymetrix data. We also highlight some issues of reproducibility in comparative studies. We conclude by making recommendations on choices of Affymetrix analysis methods, and desired characteristics of future spike-in data sets.</p>
      </sec>
      <sec>
         <st>
            <p>Results and Discussion</p>
         </st>
         <sec>
            <st>
               <p>Direction of Differential Expression</p>
            </st>
            <p>We can see from Table <tblr tid="T1">1</tblr> that studies to date have used either a 1-sided test or a 2-sided test for differential expression. A potential problem with using a 2-sided test on this data set becomes apparent if we compare the tests using the other analysis choices of Chen <it>et al</it>. <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. Figure <figr fid="F1">1</figr> shows the ROC charts created using a 2-sided test of differential expression, and 1-sided tests of up- and down-regulation. This was created using just those probesets that have a FC of 1.2 as true positives. Figure <figr fid="F1">1a</figr> is the equivalent of Figure 3 of Chen <it>et al</it>. <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. This appears to show that DFW has the strongest performance. However, if we look at Figure <figr fid="F1">1b</figr> and Figure <figr fid="F1">1c</figr> we see that the methods that appear to be performing strongly in Figure <figr fid="F1">1a</figr> are actually mainly detecting down-regulated genes. The reason for this becomes clear when we look at Figure 2 from Irizarry <it>et al</it>. <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. There we see that spike-in genes with small fold changes greater than 1, actually have M values (i.e. fold changes) generally less than the M values of the "empty probesets" which form the majority of the negatives from which this chart was created.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Comparison of 1- and 2-sided tests of DE for very low FC genes</p>
               </caption>
               <text>
                  <p><b>Comparison of 1- and 2-sided tests of DE for very low FC genes</b>. ROC charts of Golden Spike data using a 2-sided and two 1-sided tests of DE. For these charts all unchanging probesets are used as true negatives, genes with FC of 1.2 are used as true positives, and no post-summarization normalization is used. We only show results for the FC DE detection method. The different charts show a.) probesets selected using a 2-sided test of DE, b.) probesets selected using a 1-sided test of up-regulation and c.) probesets selected using a 1-sided test of down-regulation. The diagonal line shows the "line of no-discrimination". This shows how well we would expect random guessing of class labels to perform.</p>
               </text>
               <graphic file="1471-2105-9-164-1"/>
            </fig>
            <p>The choice of whether 1-sided or 2-sided tests should be used for comparison of methods is debatable. A 1-sided test for down-regulation is clearly not a sensible choice given that all the known DE genes are up-regulated. We would expect a 1-sided test of up-regulation to give the strongest results, given that all the unequal spike-ins are up-regulated. However, in most real microarray data sets, we are likely to be interested in genes which show the highest likelihood of being DE, regardless of the direction of change. As such, we will continue to use both a 2-sided test, and a 1-sided test of up-regulation in the remainder of the paper. In our comprehensive analysis, however, we also include results for 1-sided tests of down-regulation for completeness.</p>
         </sec>
         <sec>
            <st>
               <p>True negatives</p>
            </st>
            <p>Figure <figr fid="F2">2</figr> shows the ROC charts created using the same choices as used in Figure <figr fid="F1">1</figr>, except that this time we use just the probesets which have been spiked in at equal concentrations as our true negatives. Here we see a very different picture. Firstly, the differences between different summarization methods are less pronounced when using a 2-sided test of DE. Also, the charts for detecting up- and down-regulated genes are quite similar. This indicates that it is actually very difficult for methods to distinguish these two classes. This is perhaps not surprising given the similarities in the fold changes (the true negatives have a FC of 1 and the true positives have a FC of 1.2). We should note, however, that the ROC curves detecting up-regulation (Figure <figr fid="F2">2b</figr>) are generally slightly above the diagonal (i.e. slightly better than random guessing), whereas the ROC curves detecting down-regulation (Figure <figr fid="F2">2c</figr>) are generally slightly below the diagonal (i.e. slightly worse than random guessing). This gives us confidence that by just using equal-valued spike-ins as our true negatives, our ROC curves can detect genuine improvements in detecting DE genes due to different methods.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Comparison of 1- and 2-sided tests using only equal spike-ins as true negatives</p>
               </caption>
               <text>
                  <p><b>Comparison of 1- and 2-sided tests using only equal spike-ins as true negatives</b>. ROC charts of Golden Spike data using a 2-sided and two 1-sided tests of DE, with only the equal spike-ins used as true negatives. Genes with FC of 1.2 are used as true positives, and no post-summarization normalization is used. We only show results for the FC DE detection method. The legend is the same as in Figure 1. The different charts show a.) probesets selected using a 2-sided test of DE, b.) probesets selected using a 1-sided test of up-regulation and c.) probesets selected using a 1-sided test of down-regulation. As with Figure 1, we include lines of no-discrimination.</p>
               </text>
               <graphic file="1471-2105-9-164-2"/>
            </fig>
            <p>Irizarry <it>et al</it>. <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> showed that the FCs of the equal concentration spike-ins are quite different from those of the empty probesets. Another difference between these two sets of probesets is in their intensities. Figure <figr fid="F3">3</figr> shows density plots of the intensities of the equal and empty probesets. Figure <figr fid="F3">3</figr> also shows density plots of intensities of unchanging (i.e. equal or empty) probesets, and of the true positives (spike-ins with FC > 1). The first thing to note is that the plots for empty and unchanging probesets are very similar. This is to be expected as there are many more empty probesets than equal probesets. We also see that, although there are differences between the equal and TP plots (the confounding between concentration and FC identified by Irizarry <it>et al</it>. <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>), these are not nearly so pronounced as the differences between the unchanging and TP plots. Indeed, from Figure <figr fid="F3">3</figr> we can see that a classifier based purely on intensity alone would separate well the unchanging probesets from the TPs. This fact, together with the artifact identified by Irizarry <it>et al</it>. <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, leads us to recommend using only the equal concentration spike-ins as the set of true negatives for method comparison. In our comprehensive analysis, however, we also include results when using all the unchanging probesets, for completeness.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Density plots of intensities for different choices of true negatives</p>
               </caption>
               <text>
                  <p><b>Density plots of intensities for different choices of true negatives</b>. These plots show the distributions of intensities of perfect match (PM) probes across all six arrays of the Golden Spike data, for different subsets of probesets. We show plots for three potential choices of true negative (TN) probesets: the Empty probesets are defined as those for which there is no corresponding spike-in RNAs. The Equal probesets are defined as those spiked in at equal concentrations in the C and S conditions. The Unchanging probesets are defined as the set of all Empty and Equal probesets. For this chart we have defined true positives (TP) as those probesets which have been spiked in at higher concentration in the S condition relative to the C condition.</p>
               </text>
               <graphic file="1471-2105-9-164-3"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Post-Summarization Normalization</p>
            </st>
            <p>Thus far, we have not considered the effect of post-summarization normalization, which was shown by Choe <it>et al</it>. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> to have a significant effect on results. Figure <figr fid="F4">4</figr> shows the effect of such normalizations. Note that unlike Figures <figr fid="F1">1</figr> and <figr fid="F2">2</figr> we are here treating all of the spike-ins with FC > 1 as our true positives, not just those with FC = 1.2. Here we can see that post-summarization loess normalization improves results, which is consistent with the results of Choe <it>et al</it>. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. Furthermore, we see that post-summarization normalization using just the equal-valued spike-ins improves results to a greater extent than using a loess normalization based on all probesets.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Comparison of different post-summarization normalization strategies</p>
               </caption>
               <text>
                  <p><b>Comparison of different post-summarization normalization strategies</b>. ROC charts of Golden Spike data using a 2-sided and a 1-sided test of DE, and using three different post-summarization normalization strategies. For these charts only the equal spike-ins are used as true negatives, and all spike-ins with FC > 1 are used as true positives. We only show results for the FC DE detection method. The top row relates to data sets created without any post-summarization normalization. The middle row relates to data sets created using all probesets for the loess normalization. The bottom row relates to data sets created using only the equal spike-in probesets for the loess normalization. The left column shows probesets selected using a 2-sided test of DE. The right column shows probesets selected using a 1-sided test of up-regulation.</p>
               </text>
               <graphic file="1471-2105-9-164-4"/>
            </fig>
            <p>We agree with Gaile and Miecznikowski <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> that "the invariant set of genes used for the pre-processing steps in Choe <it>et al</it>. should not have included the empty null probesets". As such, for the remainder of this paper will we not use the empty probesets in loess normalization. In our comprehensive analysis we also include, for completeness, results when using all of the following post-summarization normalization strategies: no post-summarization normalization, a loess normalization based on all spike-in probesets, a loess normalization based on all the unchanging probesets and a loess normalization based on the equal-valued spike-ins.</p>
         </sec>
         <sec>
            <st>
               <p>Differential Expression Detection Methods</p>
            </st>
            <p>We turn now to the issue of DE detection methods. Figure <figr fid="F5">5</figr> shows ROC charts created with different combinations of summarization and DE methods. Different colors are used to identify different DE methods, and different line types are used to identify different summarization methods. Tables <tblr tid="T2">2</tblr> and <tblr tid="T3">3</tblr> show the AUCs of the ROC charts of Figure <figr fid="F5">5</figr>, with the top 10 performing combinations of summarization and DE detection methods shown in bold. Of the DE methods, Cyber-T appears to have particularly good performance, with 5 of the top 10 AUCs when using a 2-sided test, and 4 of the top 10 AUCs when looking specifically for up-regulation. Of the other DE methods, limma is the only method to have more than 1 AUC in the top 10 for both 2-sided and 1-sided tests. Looking at the summarization methods, multi-mgMOS has 4 AUCs in the top 10 for both 2-sided and 1-sided tests, while both CP and GCRMA have 2 AUCs in the top 10 for both tests. The top AUC in both 2-sided and 1-sided tests is obtained using multi-mgMOS and PPLR.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Comparison of combinations of summarization/DE detection methods</p>
               </caption>
               <text>
                  <p><b>Comparison of combinations of summarization/DE detection methods</b>. ROC charts of Golden Spike data using a 2-sided and a 1-sided test of DE, using different combinations of summarization and DE detection methods. For these charts only the equal spike-ins are used as true negatives, and all spike-ins with FC > 1 are used as true positives. A post-summarization loess normalization based on the equal-valued spike-ins was used. The different charts show a.) probesets selected using a 2-sided test of DE, and b.) probesets selected using a 1-sided test of up-regulation. The two legends refer to both a.) and b.)</p>
               </text>
               <graphic file="1471-2105-9-164-5"/>
            </fig>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>AUCs for 2-sided test of DE. This table shows AUC values for different combinations of summarization and DE detection methods. The 10 highest AUC values are highlighted in bold. Note that the PPLR method is only applicable to summarization methods that give uncertainty estimates as well as mean expression levels for each probeset. These results were calculated using only the equal spike-ins as true negatives, and all spike-ins with FC > 1 as true positives. A post-summarization loess normalization using the equal-valued spike-ins was used. The results in this table are for 2-sided tests of DE.</p>
               </caption>
               <tblbdy cols="6">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="right">
                        <p>limma</p>
                     </c>
                     <c ca="right">
                        <p>FC</p>
                     </c>
                     <c ca="right">
                        <p>t-test</p>
                     </c>
                     <c ca="right">
                        <p>Cyber-T</p>
                     </c>
                     <c ca="right">
                        <p>PPLR</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>mmgMOS</p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>0.903</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>0.861</p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>0.902</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>0.919</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>0.922</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>MAS5</p>
                     </c>
                     <c ca="right">
                        <p>0.884</p>
                     </c>
                     <c ca="right">
                        <p>0.848</p>
                     </c>
                     <c ca="right">
                        <p>0.879</p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>0.905</b>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>CP</p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>0.905</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>0.873</p>
                     </c>
                     <c ca="right">
                        <p>0.898</p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>0.919</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>0.889</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>PLIER</p>
                     </c>
                     <c ca="right">
                        <p>0.898</p>
                     </c>
                     <c ca="right">
                        <p>0.889</p>
                     </c>
                     <c ca="right">
                        <p>0.889</p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>0.911</b>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>RMA</p>
                     </c>
                     <c ca="right">
                        <p>0.881</p>
                     </c>
                     <c ca="right">
                        <p>0.885</p>
                     </c>
                     <c ca="right">
                        <p>0.858</p>
                     </c>
                     <c ca="right">
                        <p>0.886</p>
                     </c>
                     <c ca="right">
                        <p>0.860</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>GCRMA</p>
                     </c>
                     <c ca="right">
                        <p>0.890</p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>0.902</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>0.883</p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>0.909</b>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>DFW</p>
                     </c>
                     <c ca="right">
                        <p>0.764</p>
                     </c>
                     <c ca="right">
                        <p>0.815</p>
                     </c>
                     <c ca="right">
                        <p>0.732</p>
                     </c>
                     <c ca="right">
                        <p>0.703</p>
                     </c>
                     <c ca="right">
                        <p>0.806</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>MBEI</p>
                     </c>
                     <c ca="right">
                        <p>0.885</p>
                     </c>
                     <c ca="right">
                        <p>0.884</p>
                     </c>
                     <c ca="right">
                        <p>0.870</p>
                     </c>
                     <c ca="right">
                        <p>0.897</p>
                     </c>
                     <c ca="right">
                        <p>0.855</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>FARMS</p>
                     </c>
                     <c ca="right">
                        <p>0.842</p>
                     </c>
                     <c ca="right">
                        <p>0.891</p>
                     </c>
                     <c ca="right">
                        <p>0.805</p>
                     </c>
                     <c ca="right">
                        <p>0.844</p>
                     </c>
                     <c ca="right">
                        <p>0.772</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>AUCs for 1-sided test of up-regulation. This table shows AUC values for different combinations of summarization and DE detection methods. The 10 highest AUC values are highlighted in bold. Note that the PPLR method is only applicable to summarization methods that give uncertainty estimates as well as mean expression levels for each probeset. These results were calculated using only the equal spike-ins as true negatives, and all spike-ins with FC > 1 as true positives. A post-summarization loess normalization using the equal-valued spike-ins was used. The results in this table are for 1-sided tests of up-regulation.</p>
               </caption>
               <tblbdy cols="6">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="right">
                        <p>limma</p>
                     </c>
                     <c ca="right">
                        <p>FC</p>
                     </c>
                     <c ca="right">
                        <p>t-test</p>
                     </c>
                     <c ca="right">
                        <p>Cyber-T</p>
                     </c>
                     <c ca="right">
                        <p>PPLR</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>mmgMOS</p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>0.940</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>0.920</p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>0.938</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>0.949</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>0.951</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>MAS5</p>
                     </c>
                     <c ca="right">
                        <p>0.924</p>
                     </c>
                     <c ca="right">
                        <p>0.908</p>
                     </c>
                     <c ca="right">
                        <p>0.921</p>
                     </c>
                     <c ca="right">
                        <p>0.934</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>CP</p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>0.940</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>0.928</p>
                     </c>
                     <c ca="right">
                        <p>0.935</p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>0.948</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>0.932</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>PLIER</p>
                     </c>
                     <c ca="right">
                        <p>0.934</p>
                     </c>
                     <c ca="right">
                        <p>0.929</p>
                     </c>
                     <c ca="right">
                        <p>0.930</p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>0.941</b>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>RMA</p>
                     </c>
                     <c ca="right">
                        <p>0.929</p>
                     </c>
                     <c ca="right">
                        <p>0.932</p>
                     </c>
                     <c ca="right">
                        <p>0.914</p>
                     </c>
                     <c ca="right">
                        <p>0.932</p>
                     </c>
                     <c ca="right">
                        <p>0.917</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>GCRMA</p>
                     </c>
                     <c ca="right">
                        <p>0.926</p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>0.946</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>0.921</p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>0.944</b>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>DFW</p>
                     </c>
                     <c ca="right">
                        <p>0.817</p>
                     </c>
                     <c ca="right">
                        <p>0.918</p>
                     </c>
                     <c ca="right">
                        <p>0.794</p>
                     </c>
                     <c ca="right">
                        <p>0.830</p>
                     </c>
                     <c ca="right">
                        <p>0.912</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>MBEI</p>
                     </c>
                     <c ca="right">
                        <p>0.928</p>
                     </c>
                     <c ca="right">
                        <p>0.928</p>
                     </c>
                     <c ca="right">
                        <p>0.920</p>
                     </c>
                     <c ca="right">
                        <p>0.934</p>
                     </c>
                     <c ca="right">
                        <p>0.915</p>
                     </c>
                  </r>
                  <r>
                     <c ca="right">
                        <p>FARMS</p>
                     </c>
                     <c ca="right">
                        <p>0.883</p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>0.938</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>0.847</p>
                     </c>
                     <c ca="right">
                        <p>0.908</p>
                     </c>
                     <c ca="right">
                        <p>0.893</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>The end goal of an analysis is often to identify a small number of genes for further analysis. As such, we might be interested not in how well a method performs on the whole of a data set, but specifically in how well it performs in identifying those genes determined to be most likely to be DE. As such we are particularly interested in the ROC chart at the lowest values of FPR. Figure <figr fid="F6">6</figr> shows the same ROC curves as Figure <figr fid="F5">5b</figr> up to FPR values of 0.04. From Figure <figr fid="F6">6</figr> we can see that, although the combination of multi-mgMOS and PPLR has the highest overall AUC, this method does not have the strongest performance for most values of FPR between 0 and 0.04. For FPR values between about 0.005 and 0.03, the combination of CP and Cyber-T has the strongest performance. For even lower FPR values, both FARMS and DFW in combination with FC are the strongest performers for small ranges of FPR.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>Comparison of combinations of summarization/DE detection methods at low false positive rates</p>
               </caption>
               <text>
                  <p><b>Comparison of combinations of summarization/DE detection methods at low false positive rates</b>. ROC charts of Golden Spike data using a 1-sided test of DE, using different combinations of summarization and DE detection methods, and showing only false positive rates between 0 and 0.04, and false negative rates between 0.5 and 0.9. For these charts only the equal spike-ins are used as true negatives, and all spike-ins with FC > 1 are used as true positives. A post-summarization loess normalization based on the equal-valued spike-ins was used. The legend is the same as in Figure 5.</p>
               </text>
               <graphic file="1471-2105-9-164-6"/>
            </fig>
            <p>Figure <figr fid="F6">6</figr> can be used for overall comparisons of DE methods. In general, we see that Cyber-T tends to outperform limma, and both of these methods generally outperform the use of standard t-tests. The performance of FC as a DE detection method varies much more, depending on the summarization method used. When FC is used in combination with DFW, FARMS or GCRMA, performance is generally amongst the best. However, performance of FC with RMA, MBEI and PLIER is less strong, and the combination of FC with multi-mgMOS, MAS5.0 or CP is particularly poor. Of the summarization methods that perform well with FC, FARMS and DFW have generally poor performance when used in combination with other methods. GCRMA has reasonable performance in combination with Cyber-T, but is in the lower half of summarization methods when used in combination with either limma or standard t-tests.</p>
         </sec>
         <sec>
            <st>
               <p>True positives</p>
            </st>
            <p>So far we have used all of the genes that are spiked-in at higher concentrations in the S samples relative to the C samples as our true positives. This is perhaps the best and fairest way to determine overall performance of a DE detection method. However, we might also be interested in whether certain methods perform particularly well in "easier" or "more difficult" cases. Indeed, many analysts are only interested in genes which are determined not only to have a probability of being DE that is significant, but also have a FC which is greater than some pre-determined threshold. In order to determine which methods perform more strongly in "easy" or "difficult" cases, we can restrict our true positives to just those genes than are known to be DE by just a small FC, or to those that are very highly DE.</p>
            <p>Figure <figr fid="F7">7</figr> shows AUC values where the true positives are a subset of all the DE genes. The subsets are determined by the known FCs. The first thing to note from Figure <figr fid="F7">7</figr> is that methods generally perform much better at detecting high FC genes, than they do in detecting low FC genes. This is to be expected of course. From Figure <figr fid="F7">7</figr> we can also see that methods that perform well overall tend to also perform well regardless of whether the FCs are low, medium or high. There are, nonetheless, differences in the ranking of methods in each case. For example, although the combination of multi-mgMOS and PPLR was shown to have the highest AUC overall, it is outperformed by the combination of RMA and FC when considering either medium or high FC genes as true positives. Conversely, RMA/FC is outperformed by many other summarization/DE detection combinations for low FC genes. These results show us that the performance of a method may depend on the balance of easy and difficult cases.</p>
            <fig id="F7">
               <title>
                  <p>Figure 7</p>
               </title>
               <caption>
                  <p>Comparison of different choices of true positives</p>
               </caption>
               <text>
                  <p><b>Comparison of different choices of true positives</b>. Areas under ROC curves of Golden Spike data using different combinations of summarization and DE detection methods, and different sets of true positives. For these charts only the equal spike-ins are used as true negatives. The chart shows probesets selected using a 1-sided test of up-regulation. The Low true positives are those spike-ins with a FC greater than 1 but less than or equal to 1.7. The Medium true positives are those spike-ins with a FC between 2 and 2.5 inclusive. The High true positives are those spike-ins with a FC greater than or equal to 3. The y-axis shows -log(1-AUC) rather than AUC, as this gives a better separation between the higher AUC values, but retains the same rank order of methods. The x-axis is categorical, with points jittered to avoid placement on top of each other.</p>
               </text>
               <graphic file="1471-2105-9-164-7"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Comprehensive Analysis</p>
            </st>
            <p>We have created ROC charts for each combination of analysis choices from the final row of Table <tblr tid="T1">1</tblr>. For each of these combinations we have created ROC charts where the x-axis shows FPR, and where the x-axis shows FDR. We have also created charts where FPR/FDR has the full range of 0 to 1, and where FPR/FDR has the range 0 to 0.05. We have created a web resource called AffyDEComp <abbrgrp><abbr bid="B24">24</abbr></abbrgrp> where ROC charts can be displayed by specifying the analysis pipeline choices. In addition, AUC charts similar to Figure <figr fid="F7">7</figr> are also shown for different combinations of analysis pipeline choices. AffyDEComp also includes a table of thirteen key performance metrics for each combination of summarization and DE detection methods. The metrics used are:</p>
            <p>1. AUC where equal-valued spike-ins are used as true negatives, spike-ins with FC > 1 are used as true positives, a post-summarization loess normalization based on the equal-valued spike-ins is used, and a 1-sided test of up-regulation is the DE metric. This gives the values shown in Table <tblr tid="T3">3</tblr>.</p>
            <p>2. as 1. but using a 2-sided test of DE. This gives the values shown in Table <tblr tid="T2">2</tblr>.</p>
            <p>3. as 1. but with low FC spike-ins used as true positives. This gives the values shown in Figure <figr fid="F7">7</figr>.</p>
            <p>4. as 1. but with medium FC spike-ins used as true positives. This gives the values shown in Figure <figr fid="F7">7</figr>.</p>
            <p>5. as 1. but with high FC spike-ins used as true positives. This gives the values shown in Figure <figr fid="F7">7</figr>.</p>
            <p>6. as 1. but with all unchanging probesets used as true negatives.</p>
            <p>7. as 1. but with all unchanging probesets used as true negatives, and a post-summarization loess normalization based on the unchanging probesets.</p>
            <p>8. as 1. but with a post-summarization loess normalization based on all spike-in probesets.</p>
            <p>9. as 1. but with a no post-summarization normalization.</p>
            <p>10. as 1. but giving the AUC for FPRs up to 0.01.</p>
            <p>11. the proportion of true positives without any false positives (i.e. the TPR for a FPR of 0), using the same conditions as 1.</p>
            <p>12. the TPR for a FPR of 0.5, using the same conditions as 1.</p>
            <p>13. the FPR for a TPR of 0.5, using the same conditions as 1.</p>
            <p>We are happy to include other methods if they are made available through Bioconductor packages. We also intend to extend AffyDEComp to include future spike-in data sets as they become available. In this way we expect this web resource to become a valuable tool in comparing the performance of both summarization and DE detection methods.</p>
         </sec>
         <sec>
            <st>
               <p>Reproducible Research</p>
            </st>
            <p>One of the main problems with comparing different analyses of the same data sets is knowing exactly what code has been used to create results. As an example, the loess normalization used in a number of the papers shown in Table <tblr tid="T1">1</tblr> has a "span" parameter. None of the papers mention what value has been used for this parameter, though Choe <it>et al</it>. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> have made all their source code available, albeit on their website rather than as supplementary information to their paper. We believe that the only way to provide analysis results that are reproducible is to either:</p>
            <p>1. provide full details of all parameter choices used in the papers Methods section, or</p>
            <p>2. make the code used to create the results available, ideally as supplementary information to ensure a permanent record.</p>
            <p>We recommend that journals should not accept method comparison papers unless either of these is done. This paper was prepared as a "Sweave" document <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. The source code for this document is a mixture of LaTeX and R code. We have made the source code available as Additional file <supplr sid="S1">1</supplr>. This means that all the code used to create all the results in this paper, and in AffyDEComp <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>, are available and all results can be recreated using open source tools.</p>
            <suppl id="S1">
               <title>
                  <p>Additional file 1</p>
               </title>
               <text>
                  <p><b>Source code used to create this paper and AffyDEComp</b>. This is a zip file containing R and Sweave code. Sweave code is a text document which contains both LaTeX and R code, and as such can be used to recreate exactly all the results in this paper using open source tools. Also included is R code to recreate all the charts available through AffyDEComp. See the README file for further details.</p>
               </text>
               <file name="1471-2105-9-164-S1.zip">
                  <p>Click here for file</p>
               </file>
            </suppl>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>We have performed the most comprehensive analysis to date of the Golden Spike data set. In doing so we have identified six stages in the analysis pipeline where choices need to be made. We have made firm recommendations about the choices that should be made for just one of these stages if using the Golden Spike data for comparison of summarization and DE expression detection methods using ROC curves: we recommend that only the probesets that have been spiked-in should be used as the true negatives for the ROC curves. By doing this we overcome the problems due to the artifact identified by Irizarry <it>et al</it>. <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. We would also recommend the following choices:</p>
         <p>1. The use of a post-summarization loess normalization, with the equal spike-in probesets used as the subset to normalize with. This is also recommended by Gaile and Miecznikowski <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>.</p>
         <p>2. The use of a 1-sided test for up-regulation of genes between the C and S conditions. This mimics the actual situation because all the non-equal spike-ins are up-regulated.</p>
         <p>3. The use of all up-regulated probesets as the true positives for the ROC chart.</p>
         <p>Using the above recommendations, we created ROC charts for all combinations of summarization and DE methods (Figure <figr fid="F5">5b</figr> and Table <tblr tid="T3">3</tblr>). This showed us that there was no clear DE detection method that stood out, but that what is important is the combination of summarization and DE method. We saw that the combination of multi-mgMOS and PPLR gave the largest AUC. One of the downsides with the PPLR approach is that there is no principled way of determining the proportion of genes that are DE, as is claimed by some FDR methods. Other combinations that had strong performance included GCRMA/FC, and Cyber-T used in conjunction with various normalization methods. By looking at very small FPRs (Figure <figr fid="F6">6</figr>), CP/Cyber-T, FARMS/FC and DFW/FC were all shown to be potentially valuable when identifying a small number of potential targets. If looking only for genes with larger FCs (Figure <figr fid="F7">7</figr>), RMA/FC was seen to give the strongest performance.</p>
         <p>It should be noted that the design of this experiment could favor certain methods. We have seen that the intensities of the spike-in probesets are particularly high. Estimates of expression levels are known to be more accurate for high intensity probesets. This could favor the FC method of determining DE.</p>
         <p>Furthermore, the replicates in the Golden Spike study are technical rather than biological, and hence the variability between arrays might be expected to be lower in this data set than in a typical data set. Again, this might favor the FC DE method.</p>
         <p>We agree with Irizarry <it>et al</it>. <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> that the Golden Spike data set is flawed. In particular, we recognize that in creating ROC charts from just those probesets which were spiked-in, we are using a data set where the probe intensities are higher than in many typical microarray data sets. Also, applying a post-summarization normalization is not something that many typical analysts will perform, but is believed to be necessary to overcome some of the limitations of this data set, namely that the experiment is unbalanced due to the fact that all the DE spike-ins are up-regulated. We believe that using only the equal-valued spike-in probesets, both as true negatives and for the post-summarization normalization, is the most appropriate way of analyzing this particular data set. Furthermore, given the issues highlighted in the introduction regarding Affycomp and comparisons with qRT-PCR results, we believe that the Golden Spike data set is still the most appropriate tool for comparing DE methods. To this end we have created the AffyDEComp benchmark to enable researchers to compare DE methods. However, we should stress that we are not, at this stage, recommending that AffyDEComp be used as a reliable benchmark as the Golden Spike data set might not be representative of data sets more generally. In particular, just because a method does well here, doesn't necessarily mean that the method will do well generally. At this time, AffyDEComp might better be suited to identifying combinations of summarization and DE detection methods that perform particularly poorly.</p>
         <p>We encourage the community to develop further spike-in data sets with large numbers of DE probesets. In particular, we encourage the generation of data sets where:</p>
         <p>1. Spike-in concentrations are realistic</p>
         <p>2. DE spike-ins are a mixture of up- and down-regulated</p>
         <p>3. Nominal concentrations and FC sizes are not confounded</p>
         <p>4. The number of arrays used is large enough to be representative of some of the larger studies being performed today</p>
         <p>We believe that only by creating such data sets will we be able to ascertain whether the artifact noted by Irizarry <it>et al</it>. <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> is a peculiarity of the Golden Spike data set, or is a general feature of spike-in data sets. More importantly, the creation of such data sets should improve the AffyDEComp benchmark, and hence enable the community to better evaluate DE detection methods for Affymetrix data.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <p>The raw data from the Choe <it>et al</it>. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> study was originally downloaded from the author's website <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. All analysis was carried out using the R language (version 2.6.0). MAS5.0, CP, RMA and MBEI expression measures were created using the Bioconductor <abbrgrp><abbr bid="B27">27</abbr></abbrgrp><it> affy </it>package (version 1.16.0). GCRMA expression measures were created using the Bioconductor <it>gcrma </it>package (version 2.10.0). PLIER expression measures were created using the Bioconductor <it>plier </it>package (version 1.8.0). multi-mgMOS expression measures were created using the Bioconductor <it>puma </it>package (version 1.4.1). FARMS expression measures were created using the <it>FARMS </it>package (version 1.1.1) from the author's website <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. DFW expression measures were created using the <it>affy </it>package and code from the author's website <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. Cyber-T results and Loess normalization were obtained using the <it>goldenspike </it>package (version 0.4) <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. All other analysis was carried out using the Bioconductor <it>puma </it>package (version 1.4.1).</p>
         <p>The code used to create all results in this document is included as Additional file <supplr sid="S1">1</supplr>.</p>
      </sec>
      <sec>
         <st>
            <p>List of abbreviations</p>
         </st>
         <p>DE &#8211; differentially expressed or differential expression, as appropriate. FC &#8211; fold change. MAQC -MicroArray Quality Control. ROC &#8211; receiver-operator characteristic. FPR &#8211; false-positive rate. TPR -true-positive rate. FDR &#8211; false-discovery rate. AUC &#8211; area under curve (in this paper this refers to the area under the ROC curve).</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>RDP designed the study, performed all analysis, developed the AffyDEComp website, and wrote the manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>The author thanks Magnus Rattray for a careful reading of the manuscript and useful comments. This work was supported by an NERC "Environmental Genomics/EPSRC" studentship.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Microarray data analysis: from disarray to consolidation and consensus</p>
            </title>
            <aug>
               <au>
                  <snm>Allison</snm>
                  <fnm>DB</fnm>
               </au>
               <au>
                  <snm>Cui</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Page</snm>
                  <fnm>GP</fnm>
               </au>
               <au>
                  <snm>Sabripour</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nat Rev Genet</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>55</fpage>
            <lpage>65</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nrg1749</pubid>
                  <pubid idtype="pmpid" link="fulltext">16369572</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>A benchmark for Affymetrix GeneChip expression measures</p>
            </title>
            <aug>
               <au>
                  <snm>Cope</snm>
                  <fnm>LM</fnm>
               </au>
               <au>
                  <snm>Irizarry</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Jaffee</snm>
                  <fnm>HA</fnm>
               </au>
               <au>
                  <snm>Wu</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Speed</snm>
                  <fnm>TP</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>20</volume>
            <issue>3</issue>
            <fpage>323</fpage>
            <lpage>31</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btg410</pubid>
                  <pubid idtype="pmpid" link="fulltext">14960458</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements</p>
            </title>
            <aug>
               <au>
                  <snm>Shi</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Reid</snm>
                  <fnm>LH</fnm>
               </au>
               <au>
                  <snm>Jones</snm>
                  <fnm>WD</fnm>
               </au>
               <au>
                  <snm>Shippy</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Warrington</snm>
                  <fnm>JA</fnm>
               </au>
               <au>
                  <snm>Baker</snm>
                  <fnm>SC</fnm>
               </au>
               <au>
                  <snm>Collins</snm>
                  <fnm>PJ</fnm>
               </au>
               <au>
                  <snm>de Longueville</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Kawasaki</snm>
                  <fnm>ES</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>KY</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nat Biotechnol</source>
            <pubdate>2006</pubdate>
            <volume>24</volume>
            <issue>9</issue>
            <fpage>1151</fpage>
            <lpage>61</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nbt1239</pubid>
                  <pubid idtype="pmpid" link="fulltext">16964229</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset</p>
            </title>
            <aug>
               <au>
                  <snm>Choe</snm>
                  <fnm>SE</fnm>
               </au>
               <au>
                  <snm>Boutros</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Michelson</snm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Church</snm>
                  <fnm>GM</fnm>
               </au>
               <au>
                  <snm>Halfon</snm>
                  <fnm>MS</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <issue>2</issue>
            <fpage>R16</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">551536</pubid>
                  <pubid idtype="pmpid" link="fulltext">15693945</pubid>
                  <pubid idtype="doi">10.1186/gb-2005-6-2-r16</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>A reanalysis of a published Affymetrix GeneChip control dataset</p>
            </title>
            <aug>
               <au>
                  <snm>Dabney</snm>
                  <fnm>AR</fnm>
               </au>
               <au>
                  <snm>Storey</snm>
                  <fnm>JD</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <issue>3</issue>
            <fpage>401</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1557755</pubid>
                  <pubid idtype="pmpid" link="fulltext">16563185</pubid>
                  <pubid idtype="doi">10.1186/gb-2006-7-3-401</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Feature-level exploration of a published Affymetrix GeneChip control dataset</p>
            </title>
            <aug>
               <au>
                  <snm>Irizarry</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Cope</snm>
                  <fnm>LM</fnm>
               </au>
               <au>
                  <snm>Wu</snm>
                  <fnm>Z</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <issue>8</issue>
            <fpage>404</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1779590</pubid>
                  <pubid idtype="pmpid" link="fulltext">16953902</pubid>
                  <pubid idtype="doi">10.1186/gb-2006-7-8-404</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Putative null distributions corresponding to tests of differential expression in the Golden Spike dataset are intensity dependent</p>
            </title>
            <aug>
               <au>
                  <snm>Gaile</snm>
                  <fnm>DP</fnm>
               </au>
               <au>
                  <snm>Miecznikowski</snm>
                  <fnm>JC</fnm>
               </au>
            </aug>
            <source>BMC Genomics</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <fpage>105</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1892022</pubid>
                  <pubid idtype="pmpid" link="fulltext">17445265</pubid>
                  <pubid idtype="doi">10.1186/1471-2164-8-105</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Towards the uniform distribution of null p-values on Affymetrix microarrays</p>
            </title>
            <aug>
               <au>
                  <snm>Fodor</snm>
                  <fnm>AA</fnm>
               </au>
               <au>
                  <snm>Tickle</snm>
                  <fnm>TL</fnm>
               </au>
               <au>
                  <snm>Richardson</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <issue>5</issue>
            <fpage>R69</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1929139</pubid>
                  <pubid idtype="pmpid" link="fulltext">17472745</pubid>
                  <pubid idtype="doi">10.1186/gb-2007-8-5-r69</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Estimation and correction of non-specific binding in a large-scale spike-in experiment</p>
            </title>
            <aug>
               <au>
                  <snm>Schuster</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Blanc</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Partridge</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Thornton</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Genome Biology</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <fpage>R126</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1186/gb-2007-8-6-r126</pubid>
                  <pubid idtype="pmpid" link="fulltext">17594493</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Exploration, normalization, and summaries of high density oligonucleotide array probe level data</p>
            </title>
            <aug>
               <au>
                  <snm>Irizarry</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Hobbs</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Collin</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Beazer-Barclay</snm>
                  <fnm>YD</fnm>
               </au>
               <au>
                  <snm>Antonellis</snm>
                  <fnm>KJ</fnm>
               </au>
               <au>
                  <snm>Scherf</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Speed</snm>
                  <fnm>TP</fnm>
               </au>
            </aug>
            <source>Biostatistics</source>
            <pubdate>2003</pubdate>
            <volume>4</volume>
            <issue>2</issue>
            <fpage>249</fpage>
            <lpage>64</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/biostatistics/4.2.249</pubid>
                  <pubid idtype="pmpid" link="fulltext">12925520</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>A Model-Based Background Adjustment for Oligonucleotide Expression Arrays</p>
            </title>
            <aug>
               <au>
                  <snm>Wu</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Irizarry</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Gentleman</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Martinez-Murillo</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Spencer</snm>
                  <fnm>F</fnm>
               </au>
            </aug>
            <source>Journal of the American Statistical Association</source>
            <pubdate>2004</pubdate>
            <volume>99</volume>
            <issue>468</issue>
            <fpage>909</fpage>
            <lpage>918</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1198/016214504000000683</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection</p>
            </title>
            <aug>
               <au>
                  <snm>Li</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Wong</snm>
                  <fnm>WH</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2001</pubdate>
            <volume>98</volume>
            <fpage>31</fpage>
            <lpage>6</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">14539</pubid>
                  <pubid idtype="pmpid" link="fulltext">11134512</pubid>
                  <pubid idtype="doi">10.1073/pnas.011404098</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Probe-level measurement error improves accuracy in detecting differential gene expression</p>
            </title>
            <aug>
               <au>
                  <snm>Liu</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Milo</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Lawrence</snm>
                  <fnm>ND</fnm>
               </au>
               <au>
                  <snm>Rattray</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>22</volume>
            <issue>17</issue>
            <fpage>2107</fpage>
            <lpage>13</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btl361</pubid>
                  <pubid idtype="pmpid" link="fulltext">16820429</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>A tractable probabilistic model for Affymetrix probe-level analysis across multiple chips</p>
            </title>
            <aug>
               <au>
                  <snm>Liu</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Milo</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Lawrence</snm>
                  <fnm>ND</fnm>
               </au>
               <au>
                  <snm>Rattray</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <issue>18</issue>
            <fpage>3637</fpage>
            <lpage>44</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti583</pubid>
                  <pubid idtype="pmpid" link="fulltext">16020470</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>A new summarization method for Affymetrix probe level data</p>
            </title>
            <aug>
               <au>
                  <snm>Hochreiter</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Clevert</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Obermayer</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>22</volume>
            <issue>8</issue>
            <fpage>943</fpage>
            <lpage>9</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btl033</pubid>
                  <pubid idtype="pmpid" link="fulltext">16473874</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>A distribution free summarization method for Affymetrix GeneChip arrays</p>
            </title>
            <aug>
               <au>
                  <snm>Chen</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>McGee</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Liu</snm>
                  <fnm>Q</fnm>
               </au>
               <au>
                  <snm>Scheuermann</snm>
                  <fnm>RH</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>23</volume>
            <issue>3</issue>
            <fpage>321</fpage>
            <lpage>7</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btl609</pubid>
                  <pubid idtype="pmpid" link="fulltext">17148508</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>PLIER White Paper</p>
            </title>
            <aug>
               <au>
                  <snm>Hubbell</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Affymetrix, Santa Clara, California</source>
            <pubdate>2005</pubdate>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments</p>
            </title>
            <aug>
               <au>
                  <snm>Smyth</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Statistical Applications in Genetics and Molecular Biology</source>
            <pubdate>2004</pubdate>
            <volume>3</volume>
            <fpage>Article 3</fpage>
            <xrefbib>
               <pubid idtype="doi">10.2202/1544-6115.1027</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes</p>
            </title>
            <aug>
               <au>
                  <snm>Baldi</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Long</snm>
                  <fnm>AD</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2001</pubdate>
            <volume>17</volume>
            <issue>6</issue>
            <fpage>509</fpage>
            <lpage>19</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/17.6.509</pubid>
                  <pubid idtype="pmpid" link="fulltext">11395427</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Significance analysis of microarrays applied to the ionizing radiation response</p>
            </title>
            <aug>
               <au>
                  <snm>Tusher</snm>
                  <fnm>VG</fnm>
               </au>
               <au>
                  <snm>Tibshirani</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Chu</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2001</pubdate>
            <volume>98</volume>
            <issue>9</issue>
            <fpage>5116</fpage>
            <lpage>21</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">33173</pubid>
                  <pubid idtype="pmpid" link="fulltext">11309499</pubid>
                  <pubid idtype="doi">10.1073/pnas.091062498</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Probe-level linear model fitting and mixture modeling results in high accuracy detection of differential gene expression</p>
            </title>
            <aug>
               <au>
                  <snm>Lemieux</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>391</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1579233</pubid>
                  <pubid idtype="pmpid" link="fulltext">16934150</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-7-391</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Fisher's combined p-value for detecting differentially expressed genes using Affymetrix expression arrays</p>
            </title>
            <aug>
               <au>
                  <snm>Hess</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Iyer</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>BMC Genomics</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <fpage>96</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1854896</pubid>
                  <pubid idtype="pmpid" link="fulltext">17419876</pubid>
                  <pubid idtype="doi">10.1186/1471-2164-8-96</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>ROCR:visualizing classifier performance in R</p>
            </title>
            <aug>
               <au>
                  <snm>Sing</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Sander</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Beerenwinkel</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Lengauer</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <issue>20</issue>
            <fpage>3940</fpage>
            <lpage>3941</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti623</pubid>
                  <pubid idtype="pmpid" link="fulltext">16096348</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>AffyDEComp</p>
            </title>
            <url>http://manchester.ac.uk/bioinformatics/affydecomp</url>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Sweave: Dynamic generation of statistical reports using literate data analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Leisch</snm>
                  <fnm>F</fnm>
               </au>
            </aug>
            <source>Compstat</source>
            <pubdate>2002</pubdate>
            <fpage>575</fpage>
            <lpage>580</lpage>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Golden Spike Experiment</p>
            </title>
            <url>http://www.elwood9.net/spike/</url>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Bioconductor: open software development for computational biology and bioinformatics</p>
            </title>
            <aug>
               <au>
                  <snm>Gentleman</snm>
                  <fnm>RC</fnm>
               </au>
               <au>
                  <snm>Carey</snm>
                  <fnm>VJ</fnm>
               </au>
               <au>
                  <snm>Bates</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Bolstad</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Dettling</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Dudoit</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Ellis</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Gautier</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Ge</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Gentry</snm>
                  <fnm>J</fnm>
               </au>
               <etal/>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <issue>10</issue>
            <fpage>R80</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">545600</pubid>
                  <pubid idtype="pmpid" link="fulltext">15461798</pubid>
                  <pubid idtype="doi">10.1186/gb-2004-5-10-r80</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>FARMS package</p>
            </title>
            <url>http://www.bioinf.jku.at/software/farms/farms_1.1.1.tar.gz</url>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Distribution Free Weighted Fold Change Summarization Method (DFW)</p>
            </title>
            <url>http://faculty.smu.edu/mmcgee/dfwcode.pdf</url>
         </bibl>
      </refgrp>
   </bm>
</art>
