<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-8-207</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Methodology article</dochead>
      <bibl>
         <title>
            <p>Orthogonal projections to latent structures as a strategy for microarray data normalization</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Bylesj&#246;</snm>
               <fnm>Max</fnm>
               <insr iid="I1"/>
               <email>max.bylesjo@chem.umu.se</email>
            </au>
            <au id="A2">
               <snm>Eriksson</snm>
               <fnm>Daniel</fnm>
               <insr iid="I2"/>
               <email>daniel.eriksson@genfys.slu.se</email>
            </au>
            <au id="A3">
               <snm>Sj&#246;din</snm>
               <fnm>Andreas</fnm>
               <insr iid="I3"/>
               <email>andreas.sjodin@plantphys.umu.se</email>
            </au>
            <au id="A4">
               <snm>Jansson</snm>
               <fnm>Stefan</fnm>
               <insr iid="I3"/>
               <email>stefan.jansson@plantphys.umu.se</email>
            </au>
            <au id="A5">
               <snm>Moritz</snm>
               <fnm>Thomas</fnm>
               <insr iid="I2"/>
               <email>thomas.moritz@genfys.slu.se</email>
            </au>
            <au id="A6">
               <snm>Trygg</snm>
               <fnm>Johan</fnm>
               <insr iid="I1"/>
               <email>johan.trygg@chem.umu.se</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Research group for Chemometrics, Department of Chemistry, Ume&#229; University, SE-901 87 Ume&#229;, Sweden</p>
            </ins>
            <ins id="I2">
               <p>Ume&#229; Plant Science Centre, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences, SE-901 83 Ume&#229;, Sweden</p>
            </ins>
            <ins id="I3">
               <p>Ume&#229; Plant Science Centre, Department of Plant Physiology, Ume&#229; University, SE-901 87 Ume&#229;, Sweden</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2007</pubdate>
         <volume>8</volume>
         <issue>1</issue>
         <fpage>207</fpage>
         <url>http://www.biomedcentral.com/1471-2105/8/207</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17577396</pubid>
               <pubid idtype="doi">10.1186/1471-2105-8-207</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>12</day>
               <month>2</month>
               <year>2007</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>18</day>
               <month>6</month>
               <year>2007</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>18</day>
               <month>6</month>
               <year>2007</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2007</year>
         <collab>Bylesj&#246; et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>During generation of microarray data, various forms of systematic biases are frequently introduced which limits accuracy and precision of the results. In order to properly estimate biological effects, these biases must be identified and discarded.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We introduce a normalization strategy for multi-channel microarray data based on orthogonal projections to latent structures (OPLS); a multivariate regression method. The effect of applying the normalization methodology on single-channel Affymetrix data as well as dual-channel cDNA data is illustrated. We provide a parallel comparison to a wide range of commonly employed normalization methods with diverse properties and strengths based on sensitivity and specificity from external (spike-in) controls. On the illustrated data sets, the OPLS normalization strategy exhibits leading average true negative and true positive rates in comparison to other evaluated methods.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>The OPLS methodology identifies joint variation within biological samples to enable the removal of sources of variation that are non-correlated (orthogonal) to the within-sample variation. This ensures that structured variation related to the underlying biological samples is separated from the remaining, bias-related sources of systematic variation. As a consequence, the methodology does not require any explicit knowledge regarding the presence or characteristics of certain biases. Furthermore, there is no underlying assumption that the majority of elements should be non-differentially expressed, making it applicable to specialized boutique arrays.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="bmc" subtype="user_supplied_xml" id="endnote"/>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>The microarray technology <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> is now a standard technique in many genomics laboratories due to the high-throughput capacities and relatively low cost in detecting gene expression levels <it>en masse</it>. Since the introduction, a vast number of biological studies have utilized the technology to identify regulatory patterns in various organisms <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>.</p>
         <p>In the commonly used spotted microarray platform, probes are attached to a solid surface on pre-defined positions. Sample RNA is reverse transcribed to cDNA, labeled with fluorescent dyes and allowed to hybridize to the probes. After washing away superfluous material, the remaining fluorescence signal from the probes is assumed to reflect the relative expression levels of the sample RNA. Typically, two RNA samples, labeled with different fluorophores (for instance Cy5 and Cy3), are measured in parallel on the same surface to partially compensate for variability in probe dispersion and concentration. Extensions from the current two-channel standard into a multi-channel platform have recently been gaining in popularity <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>.</p>
         <p>During data generation, numerous factors alter the outcome through the introduction of systematic biases. Different properties of the dyes (such as degree of dye incorporation and sensitivity to dye bleaching), irregular or overall disparities of the slide surfaces, variation in printing as well as scanner-introduced bias influence the RNA quantification process. We will generally refer to the main effects as dye, spatial and array bias in the following sections, which have been shown to be the most influential forms of systematic biases present in data from the spotted cDNA microarray platform <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>.</p>
         <p>As a means to identify and remove systematic biases, data normalization is typically performed. A considerable amount of published studies concern the subject of microarray normalization, see for instance <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> for a comprehensive comparison of existing methods or <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> for a review.</p>
         <p>The most widespread normalization methods aim to address the dye and possibly also spatial effects within each array. We will refer to these methods as within-array normalization methods in the following text. Global median normalization is a straightforward normalization method that addresses labeling issues by adjusting the median intensity value within each array. Global loess normalization <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> appeared early on as a means to address intensity-dependent dye bias. Subsequently, the loess method was applied locally within each print-tip group to additionally assess fixed spatial effects <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. The methodology of local regression normalization has recently been generalized to handle non-fixed dye and spatial effects in the OLIN normalization method <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. All of the mentioned methods explicitly or implicitly assume that the majority of the genes on the array (or in localized regions) are unaffected, i.e. that the log-transformed ratios <b>M </b>= log<sub>2</sub>(<b>R</b>/<b>G</b>) are centered at zero.</p>
         <p>As typical microarray experiments involve multiple arrays to characterize multiple samples, systematic differences between the arrays (array bias) are frequently introduced. Several normalization methods for independently addressing this bias have been suggested in the literature <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr></abbrgrp>. We will refer to these methods as between-array normalization methods. Between-array normalization methods are typically applied subsequent to within-slide normalization methods. The general strategy has been to normalize the empirical distributions of intensities across arrays, such as the Aquantile normalization <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B17">17</abbr></abbrgrp> that ensures that distributions of <b>A </b>= log<sub>2 </sub>(<inline-formula><m:math name="1471-2105-8-207-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msqrt><m:mrow><m:mi>R</m:mi><m:mi>G</m:mi></m:mrow></m:msqrt></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaGcaaqaaGqabiab=jfasjab=DeahbWcbeaaaaa@2F0D@</m:annotation></m:semantics></m:math></inline-formula>) values are maintained across the slides without altering the dye ratios. Another, closely related approach is the Tquantile normalization methodology <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B17">17</abbr></abbrgrp> that performs quantile normalization separately per group, where a group is defined as an arbitrary collection of quantified RNA samples (such as technical replicates of the same biological sample).</p>
         <p>Different approaches to microarray normalization have emerged that do not easily fall into any of these groups. For instance, the VSN normalization method <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp> performs channel-wise linear and non-linear transformations to reduce the mean value and variance dependence. Potentially powerful is the analysis of variance (ANOVA) approach <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp> where all effects are assessed simultaneously in one global model. Wolfinger <it>et al</it>. explicitly used two interconnected models; one for normalization purposes and the later for identification of differential expression (DE). The ANOVA approach is conceptually related to the presented methodology. Consequently, similarities and discrepancies will be elaborated further in a suitable context.</p>
         <p>Orthogonal signal correction (OSC) <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> is a technique originally developed and used for spectral data. The general concept of OSC is straightforward: structured variation that is orthogonal (non-correlated) to a given problem is identified and can subsequently be studied and discarded. Formally rephrased, systematic variation in the descriptor matrix <b>X </b>(containing, for instance, spectral measurements or microarray signal intensities) is recognized by utilizing information in the response matrix <b>Y </b>(containing, for instance, toxicity measurements or replicate sample information). Orthogonal projections to latent structures (OPLS) <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> was later introduced as an extension to the supervised multivariate regression method partial least squares (PLS) <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> featuring an integrated OSC-filter. OPLS employs information in the <b>Y </b>matrix to decompose the <b>X </b>matrix into correlated, orthogonal and residual structures of information, respectively. Further details of the OPLS method and related methods are described in the upcoming paragraphs.</p>
         <p>The following notation will be used throughout. Vectors are denoted by bold, lower-case letters and are assumed to be column vectors unless indicated by a transposition, e.g. <b>p</b><sup>T</sup>. Matrices are denoted by bold upper-case letters, for instance <b>X</b>, with optional dimensionality information, e.g. (<it>N </it>&#215; <it>K</it>). Matrix inverses are denoted by <b>X</b><sup>-1</sup>. All matrices are assumed to be column-wise mean-centered unless explicitly stated.</p>
         <sec>
            <st>
               <p>Linear regression methods</p>
            </st>
            <p>Linear regression relate two data matrices <b>X </b>(<it>N </it>&#215; <it>K</it>) and <b>Y </b>(<it>N </it>&#215; <it>M</it>) on the general form in Equation 1.</p>
            <p>
               <display-formula id="M1">
                  <b>Y = XB + F </b>
               </display-formula>
            </p>
            <p>The difficulty in linear regression lies in identifying <b>B </b>(<it>K </it>&#215; <it>M</it>) while maintaining certain objectives, such as minimization of the residual <b>F </b>(<it>N </it>&#215; <it>K</it>), high-quality predictions of future (unknown) <b>Y</b><sub>pred </sub>as well as high interpretability of <b>B</b>. One of the most frequently employed methods for estimating <b>B </b>is the multiple linear regression (MLR) method. As MLR is a least-squares solution, <b>B </b>is resolved so that the sum of squares of the residual matrix <b>F </b>is minimized (Equation 2A). The <b>X</b><sup>+ </sup>matrix denotes the generalized (Moore-Penrose) inverse (Equation 2B).</p>
            <p>
               <display-formula id="M2A">
                  <b>B = X</b>
                  <sup>+</sup>
                  <b>Y </b>
               </display-formula>
            </p>
            <p>
               <display-formula id="M2B"><b>X</b><sup>+ </sup>= (<b>X</b><sup>T</sup><b>X</b>)<sup>-1</sup><b>X</b><sup>T </sup></display-formula>
            </p>
            <p>If <b>X </b>is rank-deficient, (<b>X</b><sup>T</sup><b>X</b>)<sup>-1 </sup>will be undefined and, consequently, the method inapplicable. This generally happens when there is strong multi-collinearity between the columns (variables) in <b>X</b>. This scenario is typical for data matrices in the areas of biology and bioinformatics as biological systems are inherently full of co-variance patterns stemming from pathway regulations.</p>
            <p>One alternative to traditional MLR is employing latent variable regression (LVR) methods. The general assumption behind LVR methods is that a system can be described in terms of a small number of latent variables that characterize the main properties of the system. Multi-collinearity is, in such a system, both expected and handled appropriately. This is, for instance, employed in the PLS regression method <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> where <b>X </b>is decomposed into latent variable structures <b>T </b>and thereby circumvents the problems with potential rank-deficiency in <b>X </b>(Equation 3).</p>
            <p>
               <display-formula id="M3"><b>X </b>= <b>TP</b><sup>T </sup>+ <b>E </b></display-formula>
            </p>
            <p>The definition and calculation of <b>B </b>is distinctly different (Equations 4 and 5) by utilizing the latent variable structures in <b>T</b>.</p>
            <p>
               <display-formula id="M4">
                  <b>B = W(P</b>
                  <sup>T</sup>
                  <b>W)</b>
                  <sup>-1</sup>
                  <b>C</b>
                  <sup>T </sup>
               </display-formula>
            </p>
            <p>
               <display-formula id="M5"><b>C </b>= (<b>T</b><sup>T</sup><b>T</b>)<sup>-1</sup><b>T</b><sup>T</sup><b>Y </b></display-formula>
            </p>
            <p>In Equations 3, 4 and 5, <b>T </b>(<it>N </it>&#215; <it>A</it>) is the score matrix, describing properties at a sample (observational) level, <b>P</b><sup>T </sup>(<it>A </it>&#215; <it>K</it>) is the loading matrix, describing properties at a variable (descriptor) level, <b>W </b>(<it>K </it>&#215; <it>A</it>) is a weight matrix describing covariance between <b>X </b>and <b>Y, E </b>(<it>N </it>&#215; <it>K</it>) is the residual matrix of <b>X</b>. <it>N </it>denotes the number of observations (microarray channels) and <it>K </it>the number of variables (microarray elements). <it>A </it>is the number of latent variables and thus determines the latent variable rank of the solution, which is typically far less than the algebraic rank. A suitable value of <it>A </it>is determined using resampling methods such as cross-validation <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> or similar. See the supplied reference for further details.</p>
         </sec>
         <sec>
            <st>
               <p>The OPLS method</p>
            </st>
            <p>OPLS is a multivariate LVR method where the objective function is to find predictive components that simultaneously maximize the covariance and correlation between <b>X </b>and <b>Y </b>as in Equation 1. Compared to the PLS representation of <b>X </b>(Equation 3), OPLS utilizes information in the response matrix <b>Y </b>to further decompose the <b>X </b>matrix into three distinct structures as described in Equation 6. Here, <b>T</b><sub>p </sub>(<it>N </it>&#215; <it>A</it><sub>p</sub>) denotes the predictive score matrix for <b>X</b>, <b>P</b><sub>p</sub><sup>T </sup>(<it>A</it><sub>p </sub>&#215; <it>K</it>) denotes the predictive loading matrix for <b>X</b>, <b>T</b><sub>o </sub>(<it>N </it>&#215; <it>A</it><sub>o</sub>) denotes the corresponding <b>Y</b>-orthogonal score matrix, <b>P</b><sub>o</sub><sup>T </sup>(<it>A</it><sub>o </sub>&#215; <it>K</it>) denotes the loading matrix of <b>Y</b>-orthogonal components and <b>E </b>denotes the residual matrix of <b>X</b>.</p>
            <p>
               <display-formula id="M6"><b>X </b>= <b>T</b><sub>p</sub><b>P</b><sub>p</sub><sup>T </sup>+ <b>T</b><sub>o</sub><b>P</b><sub>o</sub><sup>T </sup>+ <b>E </b></display-formula>
            </p>
            <p>Important to note from Equation 6 is that <b>T</b><sub>p</sub><b>P</b><sub>p</sub><sup>T </sup>contains systematic covariance and correlation structures in relation to <b>Y, T</b><sub>o</sub><b>P</b><sub>o</sub><sup>T </sup>contains systematic <b>Y</b>-orthogonal (bias-related) variation and the residual matrix <b>E </b>contains the remaining un-modeled variation. The <it>A</it><sub>p </sub>and <it>A</it><sub>o </sub>parameters define the rank of the solution and will be discussed in more detail at a later point. More explicit information regarding the algorithm for identifying <b>T</b><sub>p</sub>, <b>P</b><sub>p</sub><sup>T</sup>, <b>T</b><sub>o </sub>and <b>P</b><sub>o</sub><sup>T</sup>, respectively, are described in <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B24">24</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Study summary</p>
            </st>
            <p>We will, in the upcoming sections, show how proper construction of the <b>X </b>and <b>Y </b>matrices and subsequent use of OPLS can be utilized as an efficient normalization step for multi-channel microarray data. Dual-channel microarray data will primarily be used in direct comparison with a set of common normalization methods to highlight differences. Additional data sets, both dual-channel and single-channel, have been evaluated and are presented in additional data file <supplr sid="S1">1</supplr>. The evaluation will primarily be based on differential expression for external controls where the true ratios are known <it>a priori</it>.</p>
            <suppl id="S1">
               <title>
                  <p>Additional data file 1</p>
               </title>
               <text>
                  <p>Supplementary information regarding the outlined strategy. Provides supplementary information, for instance results from additional data sets, details regarding the compared normalization methods, details regarding the cross-validation procedure as well as some additional figures depicting various forms of biases.</p>
               </text>
               <file name="1471-2105-8-207-S1.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <p>A brief summary of the outlined strategy is provided in the next paragraphs; for a more comprehensive description, consult the Methods section.</p>
         <p>The presented methodology identifies joint variation within biological samples to enable removal of sources of variation that are mathematically independent (orthogonal) to the within-sample variation. This ensures that systematic variation related to the underlying biological samples is separated from the remaining, bias-related sources of structured variation. The raw microarray data is, in the following text, contained in the <b>X </b>matrix whereas information regarding the different biological samples is contained in the <b>Y </b>matrix. The systematic covariance and correlation structures associated to the biological samples are characterized by the predictive score matrix <b>T</b><sub>p </sub>(<it>N </it>&#215; <it>A</it><sub>p</sub>) and predictive loading matrix <b>P</b><sub>p</sub><sup>T </sup>(<it>A</it><sub>p </sub>&#215; <it>K</it>) from the OPLS model. Here, the <b>T</b><sub>p </sub>matrix describes relations at a sample level whereas the <b>P</b><sub>p</sub><sup>T </sup>matrix describes corresponding characteristics at a variable (gene) level. The bias-related variation, henceforth referred to as the <b>Y</b>-orthogonal variation, is captured in the <b>Y</b>-orthogonal score matrix <b>T</b><sub>o </sub>(<it>N </it>&#215; <it>A</it><sub>o</sub>) and the <b>Y</b>-orthogonal loading matrix <b>P</b><sub>o</sub><sup>T </sup>(<it>A</it><sub>o </sub>&#215; <it>K</it>). In a similar fashion, the <b>T</b><sub>o </sub>matrix describes relations at a sample level whereas the <b>P</b><sub>o</sub><sup>T </sup>matrix describes corresponding characteristics at a variable (gene) level. Dimensionality of the solution is primarily related to the data set specific parameter <it>A</it><sub>o </sub>that is estimated by means of Monte Carlo Cross-Validation (MCCV) <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. Please consult additional data file <supplr sid="S1">1</supplr> for further information regarding the cross-validation procedure.</p>
         <p>In the presented study, we will explicitly illustrate the effects of the suggested normalization methodology primarily on a public dual-channel data set. This data set, which we will refer to as the <it>H8k </it>data set, contains 26 two-channel cDNA microarrays from a previously published study <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. The experimental design is a traditional dye-swap design containing a treated sample and a reference sample measured using technical replication. Further details regarding the data set are available in the supplied reference.</p>
         <p>We have further evaluated two different data sets using the presented methodology. The first is an in-house produced dual-channel data set (referred to as the <it>POP2.3 </it>data set), whereas the second is a public single-channel Affymetrix (<it>HGU95</it>) spike-in data set, available at the Affymetrix web page <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. Characteristics and results for these two data sets are mainly available in additional data file <supplr sid="S1">1</supplr>.</p>
         <p>The data has been normalized in parallel using a set of existing normalization methods of varying categories, which we believe to be in common use. Properties of the evaluated normalization methods and a list of abbreviations are available in Table <tblr tid="T1">1</tblr>. Note that we by no means aim to provide a comprehensive comparison of normalization methods; see <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> for such a study.</p>
         <tbl id="T1">
            <title>
               <p>Table 1</p>
            </title>
            <caption>
               <p>Overview of the evaluated normalization methods. The compared normalization methods and their corresponding properties. <sup>a </sup>The spatial effect is constrained to print-tip based effects. <sup>b </sup>The method can be extended to support this feature.</p>
            </caption>
            <tblbdy cols="5">
               <r>
                  <c ca="left">
                     <p>
                        <b>Method name</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>Short name</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>Ratio-based</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>Spatial</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>
                        <b>Between-array</b>
                     </p>
                  </c>
               </r>
               <r>
                  <c cspan="5">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <b>Global median</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>Median</p>
                  </c>
                  <c ca="left">
                     <p>Yes</p>
                  </c>
                  <c ca="left">
                     <p>No</p>
                  </c>
                  <c ca="left">
                     <p>No</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <b>Global loess</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>Loess</p>
                  </c>
                  <c ca="left">
                     <p>Yes</p>
                  </c>
                  <c ca="left">
                     <p>No</p>
                  </c>
                  <c ca="left">
                     <p>No</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <b>Print-tip loess</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>PT-loess</p>
                  </c>
                  <c ca="left">
                     <p>Yes</p>
                  </c>
                  <c ca="left">
                     <p>Yes<sup>a</sup></p>
                  </c>
                  <c ca="left">
                     <p>No</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p><b>Print-tip loess with Tquantile norm</b>.</p>
                  </c>
                  <c ca="left">
                     <p>PT-loess/Tq</p>
                  </c>
                  <c ca="left">
                     <p>Yes/Yes</p>
                  </c>
                  <c ca="left">
                     <p>Yes<sup>a</sup>/No</p>
                  </c>
                  <c ca="left">
                     <p>No/Yes</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <b>OLIN</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>OLIN</p>
                  </c>
                  <c ca="left">
                     <p>Yes</p>
                  </c>
                  <c ca="left">
                     <p>Yes</p>
                  </c>
                  <c ca="left">
                     <p>No</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <b>VSN</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>VSN</p>
                  </c>
                  <c ca="left">
                     <p>No</p>
                  </c>
                  <c ca="left">
                     <p>No<sup>b</sup></p>
                  </c>
                  <c ca="left">
                     <p>No</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <b>Global loess with ANOVA</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>ANOVA</p>
                  </c>
                  <c ca="left">
                     <p>Yes/No</p>
                  </c>
                  <c ca="left">
                     <p>No/Yes</p>
                  </c>
                  <c ca="left">
                     <p>No/Yes</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <b>OPLS</b>
                     </p>
                  </c>
                  <c ca="left">
                     <p>OPLS</p>
                  </c>
                  <c ca="left">
                     <p>No</p>
                  </c>
                  <c ca="left">
                     <p>Yes</p>
                  </c>
                  <c ca="left">
                     <p>Yes</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <p><b>X </b>and <b>Y </b>matrices for the H8k data set were constructed as described in the Methods section and fitted with an OPLS model with one predictive component and 10 <b>Y</b>-orthogonal components (<it>A</it><sub>p </sub>= 1 and <it>A</it><sub>o </sub>= 10) as recommended by group-balanced MCCV. Consult additional data file <supplr sid="S1">1</supplr> for details regarding the cross-validation. The total number of elements determined as differentially expressed for each method based on all microarray elements is available in Figure <figr fid="F1">1A</figr>. The TN and TP rates for each method, based on the external controls, are available in Figure <figr fid="F1">1B</figr>. One can see that the total number of identified differentially expressed genes is highest with the OPLS method while maintaining TN and TP rates at a high level. The TN rate of the OPLS methods is lower than some methods (98.2% as compared to 100.0% for raw data) but the TP rate is 100.0%.</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Normalization results for the H8k data set</p>
            </caption>
            <text>
               <p><b>Normalization results for the H8k data set</b>. In <b>A</b>, differences in the total number of identified DE microarray elements between the different normalization methods are displayed for the H8k data set. In <b>B</b>, the TP and TN rates for the H8k data set are displayed based on the DE of the external controls. The TP rates are presented using solid black bars whereas the TN rates are presented using striped bars. <it>Raw </it>refers to the un-normalized data.</p>
            </text>
            <graphic file="1471-2105-8-207-1"/>
         </fig>
         <p>The information in the <b>Y</b>-orthogonal <b>T</b><sub>o</sub><b>P</b><sub>o</sub><sup>T </sup>matrices is readily accessible for interpretational purposes. Recall that the <b>T</b><sub>o </sub>matrix describes relations at a sample level whereas the <b>P</b><sub>o</sub><sup>T </sup>matrix describes characteristics at a variable (gene) level. For this particular data set, <b>T</b><sub>o </sub>is composed of 10 score vectors that are orthogonal to <b>Y </b>and individually orthogonal to each other. We will explicitly interpret a selection of <b>Y</b>-orthogonal vectors to justify the discarded variation as well as to demonstrate the powerful interpretational alternatives available when employing OPLS as a normalization method.</p>
         <p>The first <b>Y</b>-orthogonal score vector <b>t</b><sub>o,1 </sub>is depicted in Figure <figr fid="F2">2</figr> in parallel with the average <b>A </b>values, representing the average intensity level of a slide, for each of the 26 slides. The Pearson correlation coefficient between the two series is 0.992, implying that the vector mainly identifies a baseline difference between the arrays (i.e. array bias). The corresponding loading vector <b>p</b><sub>o,1</sub><sup>T </sup>displays no systematic trends (not shown), which suggests that there are no evident array-dye or array-spatial interaction effects. The variation captured in this vector account for 68.0% of the total variation in <b>X</b>, which is by far the single highest source of structured variation in the data set.</p>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Illustration of the array baseline difference</p>
            </caption>
            <text>
               <p><b>Illustration of the array baseline difference</b>. The first <b>Y</b>-orthogonal score vector <b>t</b><sub>o,1 </sub>is shown together with the average <b>A </b>values for each slide. The <b>t</b><sub>o,1 </sub>values (averaged per slide) are displayed using point-up, light gray triangles whereas the average <b>A </b>values are displayed using point-down, dark gray triangles. The Pearson correlation coefficient between the two series is 0.992, suggesting that the score vector captures an array bias.</p>
            </text>
            <graphic file="1471-2105-8-207-2"/>
         </fig>
         <p>In the second <b>Y</b>-orthogonal score vector <b>t</b><sub>o,2</sub>, we noted that the score value of the sample labeled with the Cy3 dye was consistently higher than the sample labeled with the Cy5 dye placed on the same slide (see additional data file <supplr sid="S1">1</supplr>). This suggests that an independent dye-effect is contained in this vector, which accounts for 7.8% of the total variation in <b>X</b>. Remaining score and loading vectors describe various forms of dye-spatial effects which are primarily constrained to several problematic print-tip groups. This is most noticeable in the eighth <b>Y</b>-orthogonal loading vector <b>p</b><sub>o,8</sub><sup>T</sup>, shown in Figure <figr fid="F3">3</figr>. The print-tip effect partly explains the success of print-tip based normalizations as compared to global normalizations.</p>
         <fig id="F3">
            <title>
               <p>Figure 3</p>
            </title>
            <caption>
               <p>Illustration of a print-tip group effect</p>
            </caption>
            <text>
               <p><b>Illustration of a print-tip group effect</b>. The eighth <b>Y</b>-orthogonal loading vector <b>p</b><sub>o,8</sub><sup>T </sup>displayed using a spatial representation of the array layout. The 48 print-tip groups are delimited using solid lines. Darker areas denote higher absolute loading values whereas brighter areas denote lower absolute loading values. One distinct print-tip group with high-magnitude loading values can be seen in the upper right corner of the figure (indicated by the arrow), capturing a print-tip group effect.</p>
            </text>
            <graphic file="1471-2105-8-207-3"/>
         </fig>
         <p>The encouraging results from the H8k data set are supported by results from the dual-channel, in-house produced POP2.3 data set as well as the public single-channel HGU95 data set (see additional data file <supplr sid="S1">1</supplr> for details). For the POP2.3 data set, OPLS-normalized data exhibits leading average TP and TN rates. Furthermore, the first score vector <b>t</b><sub>o,1 </sub>characterizes a distinct array bias, consistent with the behavior of the H8k data set. For the HGU95 data set, OPLS-normalized data displays leading average TP and TN rates; signifying that the method is applicable also for single-channel data.</p>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>Microarray measurements frequently host various forms of systematic and data-set specific experimental errors that limit the accuracy and precision of the results. We have outlined a strategy based on recent advances in multivariate regression for identification of such bias. Using the OPLS method <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>, information across biological samples is employed to discard non-correlated information. With a sound underlying experimental design, this <b>Y</b>-orthogonal information will contain various forms of biases (such as array, dye, spatial and batch-related biases), which can subsequently be discarded from the data.</p>
         <p>The general form of the methodology arguably makes it likely to be of broad utility, which we discuss in more detail in the upcoming paragraphs.</p>
         <p>First, the methodology is intensity-based and thus not restricted to two-channel data and the explicit formation of ratios (<b>M </b>values). The main rationale behind usage of ratios is related to biases originating from spot size and overall intensity baseline disparities across arrays, but this effect is clearly captured with the present methodology (Figure <figr fid="F2">2</figr>). The intensity-based approach has obvious auxiliary advantages, in particular when it comes to evaluation of complex designs where treated samples are not consistently hybridized against a reference sample. Furthermore, the general arrangement supports normalization of single-channel data; such a setup is shown in additional data file <supplr sid="S1">1</supplr> with promising results. Moreover, the intensity-based approach enables future extensions to data containing more than two channels, which is presumably becoming an increasingly attractive choice. The extra information in the additional channel(s) could be used to increase the number of measurements <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> or for quality control <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>.</p>
         <p>Second, the methodology does not rely on assumptions that the majority of genes on the array, globally or in localized regions, are non-differentially expressed. Thus, the approach is also applicable to custom-made arrays where the majority of genes are in fact assumed to be DE (commonly referred to as <it>boutique </it>arrays). This is not true for the majority of the currently available methods, although recent extensions of the loess normalization method support such data <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. There is an apparent danger in applying traditional normalization methods if this underlying assumption of abundance of non-DE genes is not met. Specifically, true biological effects will be eliminated by the normalization to an unknown extent in such situations, which may ultimately obscure the final interpretation of the results. Furthermore, it is not always evident beforehand if the assumption is valid without prior knowledge of the studied system and the anticipated effects.</p>
         <p>Third, the methodology does not assume presence or absence of certain categories of biases (such as ANOVA and print-tip based methods) or characteristics of these biases. For instance, assume that there exists an (unknown) structured variation related to the production of the microarray slides in different batches. This variation will not be captured by the general ANOVA model unless such an effect is anticipated; which is not true for the OPLS model. The only prerequisite for the present methodology to fully identify and discard bias-related variation is that it is orthogonal to the differences related to the biological samples and is systematic (structured).</p>
         <p>The evaluation and rationale behind the potential strengths of the method is, to a great extent, based on the use of external controls to certify the reliability of the results. We believe that the employment of external (spike-in) controls is a very powerful approach as one estimates the accuracy of the arrays, not only the precision across replicates. See also <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> for useful discussions regarding evaluation of microarray performance and external validation.</p>
         <p>One common criticism concerning the usage of global models, such as ANOVA, for normalization purposes is that the construction and evaluation requires statistical expertise (see, for instance <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>, discussion section, on this subject). For the outlined method, the only prerequisite by the user is a specification of the sample groups. The remaining tasks, including model fitting, are fully automated using MCCV and need not be any more complicated for novice users than methods for within-slide normalization based on local regression. Model evaluation, as described in the results section, is a recommended but not mandatory step in the outlined strategy if high-throughput is required.</p>
         <p>One known limitation of the methodology arises in situations where the group information is unavailable. This applies in unsupervised analyses, for instance when one is interested in detecting subclasses of a particular cancer type. As the true origin of the samples is unknown, this information cannot be utilized for normalization purposes.</p>
         <p>In the main text, we have briefly discussed the similarities of the outlined method as compared to a two-step ANOVA approach as described in <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. From a conceptual point of view, the approaches are related as both techniques aim to assess specific effects that can subsequently be retained or discarded. In the first step (normalization step) of the two-step ANOVA approach, various forms of bias are explicitly characterized. In the second step, the biological effects are estimated on the remaining sources of variation (residual). The presented OPLS approach roughly operates in the reverse order, as the biological effects are estimated at an initial stage and the systematic <b>Y</b>-orthogonal effects (bias-related) are discarded at a subsequent step. The OPLS normalization procedure could analogously be arranged to explicitly model unwanted effects in <b>Y </b>(such as array and print-tip effects) and subsequently retain <b>T</b><sub>o</sub><b>P</b><sub>o</sub><sup>T </sup>+ <b>E </b>posterior to modeling. Differences in the results are essentially related to overlapping covariance structures. Assume that there exists structured variation in a data set that is co-varying with both a biological effect as well as an unwanted effect. In the two-step ANOVA approach, this variation will be identified and discarded in the first (normalization) step. Consequently, systematic biological information can be discarded if co-varying with unwanted effects, which is a stringent normalization criterion. In the presented OPLS approach, only variation that is completely unrelated (orthogonal) to the biological sample variation will be discarded. Using the same hypothetical example as for the two-step ANOVA approach, the OPLS method will thus retain the biological variation in the data set after normalization. We see that in some cases (as in the presented examples) this approach can be more powerful in identifying differential expression. This is essentially a consequence of the fact that if we are not aware of all the present bias-related effects, then explicit modeling is not viable in practice.</p>
         <p>One could easily imagine situations where one is interested in non-categorical information, for instance exact spike-in sample concentration gradients for the single-channel platform. It is certainly possible to use OPLS for such purposes; specifically to calibrate the measured concentrations to the known concentrations and subsequently predict the unknown (but measured) concentrations according to the same model. This is an example of multivariate calibration <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>, which is an established field of linear modeling. However, since this would involve a different setup, aim and partly also notation compared to the presented method, we will not discuss such a potential normalization strategy in detail.</p>
         <p>Several remaining features of the OPLS methodology, when utilized for normalization purposes, are left un-evaluated in this study. As the normalization is model-based, a finite model space is covered where the regression is defined to be valid. This implies that one can test for model outliers, which can for instance be exploited as a quality control step to detect flawed hybridizations. Furthermore, the outlined strategy makes no explicit use of the predictive information in <b>T</b><sub>p</sub><b>P</b><sub>p</sub><sup>T</sup>, reflecting biological differences at a sample (<b>T</b><sub>p</sub>) and variable (<b>P</b><sub>p</sub><sup>T</sup>) level. In relation to the two-step ANOVA method, this would roughly correspond to the second step where biological effects are differentiated. The OPLS method host numerous capabilities for interpretation of this information (see for instance <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> on this subject), but remains the scope of a future paper.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>Presented is a methodology for normalization of microarray data using multivariate regression as implemented in the OPLS method. The strengths of the strategy are demonstrated based on both public and in-house produced data, where identification of known differential expression is shown to be augmented compared to other evaluated methods. Illustrated examples are based on data from the dual-channel microarray platform but the general setup of the strategy allows simple extensions to multi-channel platforms as well.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Constructing the data tables</p>
            </st>
            <p>The following text refers to the two-channel platform but can easily be generalized to single-channel or multi-channel data. Let <b>X </b>consist of the log<sub>2</sub>-transformed intensity values from each channel, i.e. not using ratios for the intensity estimate within the same array. If we are measuring intensity values on <it>S </it>arrays on array layouts containing <it>K </it>elements, the dimensionality of <b>X </b>will thus be (2<it>S </it>&#215; <it>K</it>).</p>
            <p>Now let us assume that the data consists of <it>L </it>groups, which are measured in replicates. In the demonstrated examples, <it>L </it>is the biological replicates of different treatments, which are measured several times, but could also be some other effect of interest. <b>Y </b>is constructed as a sparse binary matrix of dimensionality (2<it>S </it>&#215; <it>L</it>), where each element in <b>Y </b>is either 0 (sample does not belong to group) or 1 (sample belongs to group). For the sake of simplicity, we will assume that no sample belongs to multiple groups, which implies that the algebraic rank of <b>Y </b>is <it>L</it>-1 when <b>Y </b>is mean-centered, but this is not a general restriction. In the H8k and POP2.3 data sets, one treated sample and one reference sample have been used with a varying number of biological and technical replicates. Each measured channel will denote one row in the <b>Y </b>matrix. In this particular case, <b>Y </b>will consist of two columns (one for each treatment) and will, posterior to mean-centering, have the algebraic rank <it>L</it>-1 = 1. The readers that are familiar with discriminant analysis theory will note that the structure of <b>Y </b>essentially describes a classification problem <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>.</p>
            <p>An example of the <b>Y </b>matrix is provided in Equation 7, where four slides, containing the samples <it>S</it><sub>1 </sub>- <it>S</it><sub>4</sub>, have been hybridized in a dye-swap fashion. Columns in the un-centered <b>Y</b><sub>e </sub>(8 &#215; 4) correspond to samples; rows correspond to channel-wise measurements whereas the elements conceptually correspond to presence or absence of the sample in the channel. Note that the demonstrated example matrix <b>Y</b><sub>e </sub>is un-centered and thus has algebraic rank <it>L</it>; but will after column-wise mean-centering achieve the algebraic rank <it>L</it>-1 (not shown). The mean-centered <b>Y </b>matrix is subsequently used in OPLS modeling. Note also that no information regarding the utilized array or fluorophores is explicitly used; sound underlying experimental design is required to separate array and dye effects from sample effects. See <abbrgrp><abbr bid="B33">33</abbr></abbrgrp> for an excellent review on the subject of experimental design for the two-channel microarray platform or <abbrgrp><abbr bid="B34">34</abbr></abbrgrp> for design issues regarding multi-channel data.</p>
            <p>
               <display-formula id="M7">
                  <m:math name="1471-2105-8-207-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mtable>
                              <m:mtr>
                                 <m:mtd>
                                    <m:mrow>
                                       <m:mtable>
                                          <m:mtr>
                                             <m:mtd>
                                                <m:mrow>
                                                   <m:msub>
                                                      <m:mi>S</m:mi>
                                                      <m:mn>1</m:mn>
                                                   </m:msub>
                                                </m:mrow>
                                             </m:mtd>
                                             <m:mtd>
                                                <m:mrow>
                                                   <m:msub>
                                                      <m:mi>S</m:mi>
                                                      <m:mn>2</m:mn>
                                                   </m:msub>
                                                </m:mrow>
                                             </m:mtd>
                                             <m:mtd>
                                                <m:mrow>
                                                   <m:msub>
                                                      <m:mi>S</m:mi>
                                                      <m:mn>3</m:mn>
                                                   </m:msub>
                                                </m:mrow>
                                             </m:mtd>
                                             <m:mtd>
                                                <m:mrow>
                                                   <m:msub>
                                                      <m:mi>S</m:mi>
                                                      <m:mn>4</m:mn>
                                                   </m:msub>
                                                </m:mrow>
                                             </m:mtd>
                                          </m:mtr>
                                       </m:mtable>
                                    </m:mrow>
                                 </m:mtd>
                              </m:mtr>
                              <m:mtr>
                                 <m:mtd>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mi>Y</m:mi>
                                          <m:mi>e</m:mi>
                                       </m:msub>
                                       <m:mo>=</m:mo>
                                       <m:mrow>
                                          <m:mo>[</m:mo>
                                          <m:mrow>
                                             <m:mtable>
                                                <m:mtr>
                                                   <m:mtd>
                                                      <m:mn>1</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                </m:mtr>
                                                <m:mtr>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>1</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                </m:mtr>
                                                <m:mtr>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>1</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                </m:mtr>
                                                <m:mtr>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>1</m:mn>
                                                   </m:mtd>
                                                </m:mtr>
                                                <m:mtr>
                                                   <m:mtd>
                                                      <m:mn>1</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                </m:mtr>
                                                <m:mtr>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>1</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                </m:mtr>
                                                <m:mtr>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>1</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                </m:mtr>
                                                <m:mtr>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>0</m:mn>
                                                   </m:mtd>
                                                   <m:mtd>
                                                      <m:mn>1</m:mn>
                                                   </m:mtd>
                                                </m:mtr>
                                             </m:mtable>
                                          </m:mrow>
                                          <m:mo>]</m:mo>
                                       </m:mrow>
                                    </m:mrow>
                                 </m:mtd>
                              </m:mtr>
                           </m:mtable>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqabeGabaaabaqbaeqabeabaaaabaGamaiGWjaaam4uam1aiaiGWjaaaSbaaSqaiaiGWjaaaiadaciCcaaaigdaXaqajaiGWjaaaaaakeaacWaGacEaaaWGtbWudGaGacEaaaWgaaWcbGaGacEaaaGamaiGGhaaaGOmaidabKaGacEaaaaaaOqaaiadaciubaaadofatnacaciubaaaBaaaleacaciubaaacWaGacvaaaaIZaWmaeqcaciubaaaaaGcbaGamaiGydaaam4uam1aiaiGydaaaSbaaSqaiaiGydaaaiadaci2aaaaisda0aqajaiGydaaaaaaaaGcbaacbeGae8xwaK1aaSbaaSqaaiab=vgaLbqabaGccqGH9aqpdaWadaqaauaabeqaiqaaaaaaaeaacqaIXaqmaeaacqaIWaamaeaacqaIWaamaeaacqaIWaamaeaacqaIWaamaeaacqaIXaqmaeaacqaIWaamaeaacqaIWaamaeaacqaIWaamaeaacqaIWaamaeaacqaIXaqmaeaacqaIWaamaeaacqaIWaamaeaacqaIWaamaeaacqaIWaamaeaacqaIXaqmaeaacqaIXaqmaeaacqaIWaamaeaacqaIWaamaeaacqaIWaamaeaacqaIWaamaeaacqaIXaqmaeaacqaIWaamaeaacqaIWaamaeaacqaIWaamaeaacqaIWaamaeaacqaIXaqmaeaacqaIWaamaeaacqaIWaamaeaacqaIWaamaeaacqaIWaamaeaacqaIXaqmaaaacaGLBbGaayzxaaaaaaaa@7AF8@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>As previously stated, the objective function of OPLS is to find predictive components that simultaneously maximize the covariance and correlation between <b>X </b>and <b>Y</b>. As a consequence of the structure of <b>Y</b>, the predictive information in <b>T</b><sub>p</sub><b>P</b><sub>p</sub><sup>T </sup>describes the maximum difference between the groups, which is the main biological discrepancies given that the groups denote different biological samples as in the presented examples. In relation to the ANOVA strategy outlined by <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>, the information in <b>T</b><sub>p </sub>resembles what is characterized by the <it>V </it>(variety) term and <b>P</b><sub>p</sub><sup>T </sup>what is characterized by the <it>G </it>(gene) term (see also discussion on this subject). The <b>Y</b>-orthogonal variation in <b>T</b><sub>o</sub><b>P</b><sub>o</sub><sup>T </sup>will then portray the remaining structured variation, which is independent of the sample groups. Array effects, dye effects, spatial effects and possible interactions between these effects will all fall into this category. Fundamental to the concept is that these effects are not confounded with the sample group effects in <b>Y </b>due to improper experimental design. Note also that we are not using degrees of freedom to explicitly distinguish these sources of systematic biases from each other.</p>
            <p>The normalized data matrix <b>X</b><sub>norm </sub>(2<it>S </it>&#215; <it>K</it>) is subsequently reconstructed as in Equation 8, i.e. without the <b>Y</b>-orthogonal structures.</p>
            <p>
               <display-formula id="M8"><b>X</b><sub>norm </sub>= <b>T</b><sub>p</sub><b>P</b><sub>p</sub><sup>T </sup>+ <b>E </b></display-formula>
            </p>
         </sec>
         <sec>
            <st>
               <p>Model estimation</p>
            </st>
            <p>In OPLS modeling, two parameters <it>A</it><sub>p </sub>and <it>A</it><sub>o </sub>need to be estimated, which are related to the dimensionality of <b>T</b><sub>p</sub><b>P</b><sub>p</sub><sup>T </sup>and <b>T</b><sub>o</sub><b>P</b><sub>o</sub><sup>T</sup>, respectively. For the problems described here, we will set <it>A</it><sub>p </sub>to the algebraic rank of the mean-centered <b>Y</b>, i.e. to <it>L</it>-1. This corresponds to the fundamental assumption that discriminatory variation between the groups is present in <b>X</b>. The remaining parameter <it>A</it><sub>o </sub>determines the amount of variance that is peeled off from the <b>X </b>matrix (in this case, microarray signals). The value of <it>A</it><sub>o </sub>is essentially dataset-specific. A too low value of <it>A</it><sub>o </sub>implies that there is still systematic variation in <b>X </b>that is unrelated to <b>Y</b>, which lowers the possibilities of identifying differential expression (increases type II errors). A too high value of <it>A</it><sub>o </sub>will, on the contrary, increase the risk of false positives (type I errors) due to the decrease in variance in <b>T</b><sub>p</sub><b>P</b><sub>p</sub><sup>T</sup>. For the data set described here, we have utilized group-balanced Monte Carlo Cross-Validation (MCCV) <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> to estimate a suitable value of <it>A</it><sub>o</sub>. More detailed descriptions of the employed MCCV strategy, which is fully automated, are available in additional data file <supplr sid="S1">1</supplr>.</p>
         </sec>
         <sec>
            <st>
               <p>External controls</p>
            </st>
            <p>The demonstrated H8k data set contain external (spike-in) controls, based on the Lucidea Universal Scorecard (GE Healthcare, Uppsala, Sweden) system where expression ratios are known beforehand. The external controls are essentially of two different types. The <it>calibration </it>clones are printed at a 1:1 ratio in various concentrations on the slide. As these clones are known not to be differentially expressed (DE), any erroneous assessment of DE will yield false positives (FP). We will utilize the calibration clones to determine the true negative (TN) rate, where TN = 1 &#8211; FP. The <it>ratio </it>clones are printed at ratios of 1:3, 3:1, 1:10 and 10:1 in different concentrations on the slide. As these clones are known to be DE, we will use these clones to determine the true positive (TP) rates. Other capabilities of the Lucidea scorecard system, such as the utility clones, have not been utilized in this study. The calibration and ratio clones are spatially scattered across the arrays and constitute a representative subset of approximate two percent of the total number of elements on the microarrays.</p>
         </sec>
         <sec>
            <st>
               <p>Differential expression</p>
            </st>
            <p>The results of the normalization methods based on the true negative (TN) rates from the calibration controls, the true positive (TP) rates from the ratio controls and the total number of differentially expressed genes are illustrated. Differential expression was set at <it>p</it><sub>adjusted </sub>&lt; 0.05 based on Student's <it>t</it>-test after employment of the step-wise false discovery rate method of Benjamini and Hochberg <abbrgrp><abbr bid="B35">35</abbr></abbrgrp> to account for multiple testing inflation. All available clones were employed for multiple-test correction, not only the external (spike-in) control subset. All calculations of differential expression are, for consistency, based on the log<sub>2</sub>-transformed ratios (<b>M </b>values) within each slide, even for methods that do not employ ratios for normalization purposes.</p>
         </sec>
         <sec>
            <st>
               <p>Implementation and availability</p>
            </st>
            <p>A R package <abbrgrp><abbr bid="B36">36</abbr></abbrgrp> including all required sources is available on request from the corresponding author.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Abbreviations</p>
         </st>
         <p>ANOVA Analysis of variance</p>
         <p>OLIN Optimized local intensity-dependent normalization</p>
         <p>VSN Variance stabilization</p>
         <p>OSC Orthogonal signal correction</p>
         <p>OPLS Orthogonal projections to latent structures</p>
         <p>MCCV Monte Carlo Cross-Validation</p>
         <p>MLR Multiple linear regression</p>
         <p>LVR Latent variable regression</p>
         <p>DE Differential expression</p>
         <p>TN True negative</p>
         <p>TP True positive</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>MB conceived the study, evaluated the various normalization methods and drafted the manuscript. DE generated the POP2.3 data set and helped to draft the manuscript. AS provided expertise primarily regarding the ANOVA normalization and helped to draft the manuscript. TM, SJ and JT supervised the project. All authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>The authors are grateful to Dr. Gordon K. Smyth and colleagues at the Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia for kindly supplying the H8k data set. This work was supported by grants from the Swedish Foundation for Strategic Research (MB, JT), the Knut and Alice Wallenberg Foundation (JT), the European Commission through the Directorate General Research within the Fifth Framework for Research (AS, SJ), the Swedish Research Council (MB, DE, AS, TM, SJ, JT), EU-strategic funding (DE) and the Functional Genomics Initiative at Swedish University of Agricultural Sciences (DE, TM).</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Quantitative monitoring of gene expression patterns with a complementary DNA microarray</p>
            </title>
            <aug>
               <au>
                  <snm>Schena</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Shalon</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Davis</snm>
                  <fnm>RW</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>PO</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1995</pubdate>
            <volume>270</volume>
            <issue>5235</issue>
            <fpage>467</fpage>
            <lpage>470</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.270.5235.467</pubid>
                  <pubid idtype="pmpid" link="fulltext">7569999</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>The transcriptional program in the response of human fibroblasts to serum</p>
            </title>
            <aug>
               <au>
                  <snm>Iyer</snm>
                  <fnm>VR</fnm>
               </au>
               <au>
                  <snm>Eisen</snm>
                  <fnm>MB</fnm>
               </au>
               <au>
                  <snm>Ross</snm>
                  <fnm>DT</fnm>
               </au>
               <au>
                  <snm>Schuler</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Moore</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>JC</fnm>
               </au>
               <au>
                  <snm>Trent</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Staudt</snm>
                  <fnm>LM</fnm>
               </au>
               <au>
                  <snm>Hudson</snm>
                  <fnm>J</fnm>
                  <suf>Jr.</suf>
               </au>
               <au>
                  <snm>Boguski</snm>
                  <fnm>MS</fnm>
               </au>
               <au>
                  <snm>Lashkari</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Shalon</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Botstein</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>PO</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1999</pubdate>
            <volume>283</volume>
            <issue>5398</issue>
            <fpage>83</fpage>
            <lpage>87</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.283.5398.83</pubid>
                  <pubid idtype="pmpid" link="fulltext">9872747</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>A genomic approach to investigate developmental cell death in woody tissues of Populus trees</p>
            </title>
            <aug>
               <au>
                  <snm>Moreau</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Aksenov</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Lorenzo</snm>
                  <fnm>MG</fnm>
               </au>
               <au>
                  <snm>Segerman</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Funk</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Nilsson</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Jansson</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Tuominen</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <issue>4</issue>
            <fpage>R34</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1088962</pubid>
                  <pubid idtype="pmpid" link="fulltext">15833121</pubid>
                  <pubid idtype="doi">10.1186/gb-2005-6-4-r34</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Global analysis of carbohydrate utilization by Lactobacillus acidophilus using cDNA microarrays</p>
            </title>
            <aug>
               <au>
                  <snm>Barrangou</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Azcarate-Peril</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Duong</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Conners</snm>
                  <fnm>SB</fnm>
               </au>
               <au>
                  <snm>Kelly</snm>
                  <fnm>RM</fnm>
               </au>
               <au>
                  <snm>Klaenhammer</snm>
                  <fnm>TR</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>2006</pubdate>
            <volume>103</volume>
            <issue>10</issue>
            <fpage>3816</fpage>
            <lpage>3821</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1533782</pubid>
                  <pubid idtype="pmpid" link="fulltext">16505367</pubid>
                  <pubid idtype="doi">10.1073/pnas.0511287103</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Three color cDNA microarrays: quantitative assessment through the use of fluorescein-labeled probes</p>
            </title>
            <aug>
               <au>
                  <snm>Hessner</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Hulse</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Meyer</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Wu</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Nye</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Guo</snm>
                  <fnm>SW</fnm>
               </au>
               <au>
                  <snm>Ghosh</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <issue>4</issue>
            <fpage>e14</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">150246</pubid>
                  <pubid idtype="pmpid" link="fulltext">12582259</pubid>
                  <pubid idtype="doi">10.1093/nar/gng014</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Use of three-color cDNA microarray experiments to assess the therapeutic and side effect of drugs</p>
            </title>
            <aug>
               <au>
                  <snm>Zhao</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Wong</snm>
                  <fnm>RNS</fnm>
               </au>
               <au>
                  <snm>Fang</snm>
                  <fnm>KT</fnm>
               </au>
               <au>
                  <snm>Yue</snm>
                  <fnm>PYK</fnm>
               </au>
            </aug>
            <source>Chemometrics Intell Lab Syst</source>
            <pubdate>2006</pubdate>
            <volume>82</volume>
            <issue>1-2</issue>
            <fpage>31</fpage>
            <lpage>36</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/j.chemolab.2005.06.021</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Triple-target microarray experiments: a novel experimental strategy</p>
            </title>
            <aug>
               <au>
                  <snm>Forster</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Costa</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Roy</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Cooke</snm>
                  <fnm>HJ</fnm>
               </au>
               <au>
                  <snm>Maratou</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>BMC Genomics</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <issue>1</issue>
            <fpage>13</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">365026</pubid>
                  <pubid idtype="pmpid" link="fulltext">15018645</pubid>
                  <pubid idtype="doi">10.1186/1471-2164-5-13</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Analysis of variance for gene expression microarray data</p>
            </title>
            <aug>
               <au>
                  <snm>Kerr</snm>
                  <fnm>MK</fnm>
               </au>
               <au>
                  <snm>Martin</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Churchill</snm>
                  <fnm>GA</fnm>
               </au>
            </aug>
            <source>J Comput Biol</source>
            <pubdate>2000</pubdate>
            <volume>7</volume>
            <issue>6</issue>
            <fpage>819</fpage>
            <lpage>837</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1089/10665270050514954</pubid>
                  <pubid idtype="pmpid" link="fulltext">11382364</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Assessing gene significance from cDNA microarray expression data via mixed models</p>
            </title>
            <aug>
               <au>
                  <snm>Wolfinger</snm>
                  <fnm>RD</fnm>
               </au>
               <au>
                  <snm>Gibson</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Wolfinger</snm>
                  <fnm>ED</fnm>
               </au>
               <au>
                  <snm>Bennett</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Hamadeh</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Bushel</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Afshari</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Paules</snm>
                  <fnm>RS</fnm>
               </au>
            </aug>
            <source>J Comput Biol</source>
            <pubdate>2001</pubdate>
            <volume>8</volume>
            <issue>6</issue>
            <fpage>625</fpage>
            <lpage>637</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1089/106652701753307520</pubid>
                  <pubid idtype="pmpid" link="fulltext">11747616</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Evaluation of normalization methods for cDNA microarray data by k-NN classification</p>
            </title>
            <aug>
               <au>
                  <snm>Wu</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Xing</snm>
                  <fnm>EP</fnm>
               </au>
               <au>
                  <snm>Myers</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Mian</snm>
                  <fnm>IS</fnm>
               </au>
               <au>
                  <snm>Bissell</snm>
                  <fnm>MJ</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <fpage>191</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1201132</pubid>
                  <pubid idtype="pmpid" link="fulltext">16045803</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-6-191</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Microarray data normalization and transformation</p>
            </title>
            <aug>
               <au>
                  <snm>Quackenbush</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2002</pubdate>
            <volume>32 Suppl</volume>
            <fpage>496</fpage>
            <lpage>501</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/ng1032</pubid>
                  <pubid idtype="pmpid" link="fulltext">12454644</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Normalization for cDNA microarray data</p>
            </title>
            <aug>
               <au>
                  <snm>Yang</snm>
                  <fnm>YH</fnm>
               </au>
               <au>
                  <snm>Dudoit</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Luu</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Speed</snm>
                  <fnm>TP</fnm>
               </au>
            </aug>
            <source>Microarrays: Optical Technologies and Informatics</source>
            <editor>Bittner ML, Chen Y, Dorsel AN, Dougherty ER</editor>
            <series>
               <title>
                  <p>Proceedings of SPIE</p>
               </title>
            </series>
            <pubdate>2001</pubdate>
            <volume>4266</volume>
            <fpage>141</fpage>
            <lpage>152</lpage>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation</p>
            </title>
            <aug>
               <au>
                  <snm>Yang</snm>
                  <fnm>YH</fnm>
               </au>
               <au>
                  <snm>Dudoit</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Luu</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Lin</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Peng</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Ngai</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Speed</snm>
                  <fnm>TP</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <issue>4</issue>
            <fpage>e15</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">100354</pubid>
                  <pubid idtype="pmpid" link="fulltext">11842121</pubid>
                  <pubid idtype="doi">10.1093/nar/30.4.e15</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Model selection and efficiency testing for normalization of cDNA microarray data</p>
            </title>
            <aug>
               <au>
                  <snm>Futschik</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Crompton</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <issue>8</issue>
            <fpage>R60</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">507885</pubid>
                  <pubid idtype="pmpid" link="fulltext">15287982</pubid>
                  <pubid idtype="doi">10.1186/gb-2004-5-8-r60</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>A comparison of normalization methods for high density oligonucleotide array data based on variance and bias</p>
            </title>
            <aug>
               <au>
                  <snm>Bolstad</snm>
                  <fnm>BM</fnm>
               </au>
               <au>
                  <snm>Irizarry</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Astrand</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Speed</snm>
                  <fnm>TP</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <issue>2</issue>
            <fpage>185</fpage>
            <lpage>193</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/19.2.185</pubid>
                  <pubid idtype="pmpid" link="fulltext">12538238</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application</p>
            </title>
            <aug>
               <au>
                  <snm>Li</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Wong</snm>
                  <fnm>WH</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2001</pubdate>
            <volume>2</volume>
            <issue>8</issue>
            <fpage>RESEARCH0032</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">55329</pubid>
                  <pubid idtype="pmpid" link="fulltext">11532216</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Normalization for two-color cDNA microarray data.</p>
            </title>
            <aug>
               <au>
                  <snm>Yang</snm>
                  <fnm>YH</fnm>
               </au>
               <au>
                  <snm>Thorne</snm>
                  <fnm>NP</fnm>
               </au>
            </aug>
            <source>Science and Statistics: A Festschrift for Terry Speed</source>
            <publisher> IMS Lecture Notes - Monograph Series</publisher>
            <editor>Goldstein DR</editor>
            <pubdate>2003</pubdate>
            <volume>40</volume>
            <fpage>403</fpage>
            <lpage>418</lpage>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Variance stabilization applied to microarray data calibration and to the quantification of differential expression</p>
            </title>
            <aug>
               <au>
                  <snm>Huber</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>von Heydebreck</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Sultmann</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Poustka</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Vingron</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18 Suppl 1</volume>
            <fpage>S96</fpage>
            <lpage>104</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12169536</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Parameter estimation for the calibration and variance stabilization of microarray data</p>
            </title>
            <aug>
               <au>
                  <snm>Huber</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>von Heydebreck</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Sueltmann</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Poustka</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Vingron</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Stat Appl Genet Mol Biol</source>
            <pubdate>2003</pubdate>
            <volume>2</volume>
            <issue>1</issue>
            <fpage>Article3</fpage>
            <xrefbib>
               <pubid idtype="pmpid">16646781</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Orthogonal signal correction of near-infrared spectra</p>
            </title>
            <aug>
               <au>
                  <snm>Wold</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Antti</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Lindgren</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>&#214;hman</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Chemometrics Intell Lab Syst</source>
            <pubdate>1998</pubdate>
            <volume>44</volume>
            <fpage>175</fpage>
            <lpage>185</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/S0169-7439(98)00109-9</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Orthogonal projections to latent structures (O-PLS)</p>
            </title>
            <aug>
               <au>
                  <snm>Trygg</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Wold</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>J Chemometrics</source>
            <pubdate>2002</pubdate>
            <volume>16</volume>
            <fpage>119</fpage>
            <lpage>128</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1002/cem.695</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>PLS-regression: a basic tool of chemometrics</p>
            </title>
            <aug>
               <au>
                  <snm>Wold</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Sj&#246;str&#246;m</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Eriksson</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Chemometrics Intell Lab Syst</source>
            <pubdate>2001</pubdate>
            <volume>58</volume>
            <issue>2</issue>
            <fpage>109</fpage>
            <lpage>130</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/S0169-7439(01)00155-1</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Cross Validatory Estimation of the Number of Components in Factor and Principal Components Models.</p>
            </title>
            <aug>
               <au>
                  <snm>Wold</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Technometrics</source>
            <pubdate>1978</pubdate>
            <volume>20</volume>
            <fpage>397</fpage>
            <lpage>406</lpage>
            <xrefbib>
               <pubid idtype="doi">10.2307/1267639</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>O2-PLS for qualitative and quantitative analysis in multivariate calibration</p>
            </title>
            <aug>
               <au>
                  <snm>Trygg</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>J Chemometrics</source>
            <pubdate>2002</pubdate>
            <volume>16</volume>
            <fpage>283</fpage>
            <lpage>293</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1002/cem.724</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Linear-Model Selection by Cross-Validation</p>
            </title>
            <aug>
               <au>
                  <snm>Shao</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>J Am Stat Assoc</source>
            <pubdate>1993</pubdate>
            <volume>88</volume>
            <issue>422</issue>
            <fpage>486</fpage>
            <lpage>494</lpage>
            <xrefbib>
               <pubid idtype="doi">10.2307/2290328</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Use of within-array replicate spots for assessing differential expression in microarray experiments</p>
            </title>
            <aug>
               <au>
                  <snm>Smyth</snm>
                  <fnm>GK</fnm>
               </au>
               <au>
                  <snm>Michaud</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Scott</snm>
                  <fnm>HS</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <issue>9</issue>
            <fpage>2067</fpage>
            <lpage>2075</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti270</pubid>
                  <pubid idtype="pmpid" link="fulltext">15657102</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Affymetrix sample data set repository</p>
            </title>
            <url>http://www.affymetrix.com/support/technical/sample_data/datasets.affx</url>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Normalization of boutique two-color microarrays with a high proportion of differentially expressed probes</p>
            </title>
            <aug>
               <au>
                  <snm>Oshlack</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Emslie</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Corcoran</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Smyth</snm>
                  <fnm>GK</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <issue>1</issue>
            <fpage>R2</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1839120</pubid>
                  <pubid idtype="pmpid" link="fulltext">17204140</pubid>
                  <pubid idtype="doi">10.1186/gb-2007-8-1-r2</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>In control: systematic assessment of microarray performance</p>
            </title>
            <aug>
               <au>
                  <snm>van Bakel</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Holstege</snm>
                  <fnm>FC</fnm>
               </au>
            </aug>
            <source>EMBO Rep</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <issue>10</issue>
            <fpage>964</fpage>
            <lpage>969</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1299153</pubid>
                  <pubid idtype="pmpid" link="fulltext">15459748</pubid>
                  <pubid idtype="doi">10.1038/sj.embor.7400253</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Multivariate Calibration</p>
            </title>
            <aug>
               <au>
                  <snm>Martens</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Naes</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <publisher>Chichester , John Wiley &amp; Sons</publisher>
            <pubdate>1992</pubdate>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Prediction and spectral profile estimation in multivariate calibration</p>
            </title>
            <aug>
               <au>
                  <snm>Trygg</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>J Chemometrics</source>
            <pubdate>2004</pubdate>
            <volume>18</volume>
            <fpage>166</fpage>
            <lpage>172</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1002/cem.860</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>OPLS discriminant analysis: combining the strengths of PLS-DA and SIMCA classification</p>
            </title>
            <aug>
               <au>
                  <snm>Bylesj&#246;</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Rantalainen</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Cloarec</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Nicholson</snm>
                  <fnm>JK</fnm>
               </au>
               <au>
                  <snm>Holmes</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Trygg</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>J Chemometrics</source>
            <pubdate>2006</pubdate>
            <volume>20</volume>
            <fpage>341</fpage>
            <lpage>351</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1002/cem.1006</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Fundamentals of experimental design for cDNA microarrays</p>
            </title>
            <aug>
               <au>
                  <snm>Churchill</snm>
                  <fnm>GA</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2002</pubdate>
            <volume>32 Suppl</volume>
            <fpage>490</fpage>
            <lpage>495</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/ng1031</pubid>
                  <pubid idtype="pmpid" link="fulltext">12454643</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B34">
            <title>
               <p>Experimental design for three-color and four-color gene expression microarrays</p>
            </title>
            <aug>
               <au>
                  <snm>Woo</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Krueger</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Kaur</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Churchill</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21 Suppl 1</volume>
            <fpage>i459</fpage>
            <lpage>i467</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti1031</pubid>
                  <pubid idtype="pmpid" link="fulltext">15961491</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Controlling the false discovery rate: a practical and powerful approach to multiple testing</p>
            </title>
            <aug>
               <au>
                  <snm>Benjamini</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Hochberg</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <source>J R Stat Soc B</source>
            <pubdate>1995</pubdate>
            <volume>57</volume>
            <issue>1</issue>
            <fpage>289</fpage>
            <lpage>300</lpage>
         </bibl>
         <bibl id="B36">
            <title>
               <p>The R project for statistical computing</p>
            </title>
            <url>http://www.r-project.org/</url>
         </bibl>
      </refgrp>
   </bm>
</art>
