<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-11-27</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>Biomarker discovery in heterogeneous tissue samples -taking the in-silico deconfounding approach</p>
         </title>
         <aug>
            <au ca="yes" id="A1">
               <snm>Repsilber</snm>
               <fnm>Dirk</fnm>
               <insr iid="I1"/>
               <email>repsilber@fbn-dummerstorf.de</email>
            </au>
            <au id="A2">
               <snm>Kern</snm>
               <fnm>Sabine</fnm>
               <insr iid="I2"/>
               <email>SabineKern.Privat@web.de</email>
            </au>
            <au id="A3">
               <snm>Telaar</snm>
               <fnm>Anna</fnm>
               <insr iid="I1"/>
               <email>telaar@fbn-dummerstorf.de</email>
            </au>
            <au id="A4">
               <snm>Walzl</snm>
               <fnm>Gerhard</fnm>
               <insr iid="I3"/>
               <email>GWALZL@sun.ac.za</email>
            </au>
            <au id="A5">
               <snm>Black</snm>
               <mi>F</mi>
               <fnm>Gillian</fnm>
               <insr iid="I3"/>
               <email>gfb@sun.ac.za</email>
            </au>
            <au id="A6">
               <snm>Selbig</snm>
               <fnm>Joachim</fnm>
               <insr iid="I2"/>
               <email>Selbig@mpimp-golm.mpg.de</email>
            </au>
            <au id="A7">
               <snm>Parida</snm>
               <mi>K</mi>
               <fnm>Shreemanta</fnm>
               <insr iid="I4"/>
               <email>parida@mpiib-berlin.mpg.de</email>
            </au>
            <au id="A8">
               <snm>Kaufmann</snm>
               <mi>HE</mi>
               <fnm>Stefan</fnm>
               <insr iid="I4"/>
               <email>kaufmann@mpiib-berlin.mpg.de</email>
            </au>
            <au id="A9">
               <snm>Jacobsen</snm>
               <fnm>Marc</fnm>
               <insr iid="I4"/>
               <insr iid="I5"/>
               <email>jacobsen@bni-hamburg.de</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Genetics and Biometry, Research Institute for the Biology of Farm Animals, Wilhelm-Stahl Allee 2, D 18196 Dummerstorf, Germany</p>
            </ins>
            <ins id="I2">
               <p>Bioinformatics Chair, Institute for Biochemistry and Biology at the University of Potsdam, Karl-Liebknecht-Str. 24-25, D 14476 Potsdam-Golm, Germany</p>
            </ins>
            <ins id="I3">
               <p>Molecular Biology and Human Genetics, University of Stellenbosch, Tygerberg, Cape Town 7505, South Africa</p>
            </ins>
            <ins id="I4">
               <p>Department of Immunology, Max-Planck-Institute for Infection Biology, Charit&#233;platz 1, D 10117 Berlin, Germany</p>
            </ins>
            <ins id="I5">
               <p>Department of Immunology, Bernhard-Nocht-Institute for Tropical Medicine, Bernhard-Nocht-Str. 74, D 20359 Hamburg, Germany</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2010</pubdate>
         <volume>11</volume>
         <issue>1</issue>
         <fpage>27</fpage>
         <url>http://www.biomedcentral.com/1471-2105/11/27</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="doi">10.1186/1471-2105-11-27</pubid>
               <pubid idtype="pmpid">20070912</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>3</day>
               <month>9</month>
               <year>2009</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>14</day>
               <month>1</month>
               <year>2010</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>14</day>
               <month>1</month>
               <year>2010</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2010</year>
         <collab>Repsilber et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>For heterogeneous tissues, such as blood, measurements of gene expression are confounded by relative proportions of cell types involved. Conclusions have to rely on estimation of gene expression signals for homogeneous cell populations, e.g. by applying micro-dissection, fluorescence activated cell sorting, or <it>in-silico </it>deconfounding. We studied feasibility and validity of a non-negative matrix decomposition algorithm using experimental gene expression data for blood and sorted cells from the same donor samples. Our objective was to optimize the algorithm regarding detection of differentially expressed genes and to enable its use for classification in the difficult scenario of reversely regulated genes. This would be of importance for the identification of candidate biomarkers in heterogeneous tissues.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>Experimental data and simulation studies involving noise parameters estimated from these data revealed that for valid detection of differential gene expression, quantile normalization and use of non-log data are optimal. We demonstrate the feasibility of predicting proportions of constituting cell types from gene expression data of single samples, as a prerequisite for a deconfounding-based classification approach.</p>
               <p>Classification cross-validation errors with and without using deconfounding results are reported as well as sample-size dependencies. Implementation of the algorithm, simulation and analysis scripts are available.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusions</p>
               </st>
               <p>The deconfounding algorithm without decorrelation using quantile normalization on non-log data is proposed for biomarkers that are difficult to detect, and for cases where confounding by varying proportions of cell types is the suspected reason. In this case, a deconfounding ranking approach can be used as a powerful alternative to, or complement of, other statistical learning approaches to define candidate biomarkers for molecular diagnosis and prediction in biomedicine, in realistically noisy conditions and with moderate sample sizes.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>For studies involving heterogeneous tissue samples, detection of differential gene expression from molecular profiles, as well as identification of biomarkers is a problem of validity: molecular profile variation and changes in cell type proportions between tissue samples are confounded <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>. However, heterogeneous tissues are frequently used (e.g. blood, tumor) and further confounded in pathological situations where diseased tissue is frequently infiltrated by immune cell populations. The most widely used material is blood, which is frequently sampled for diagnostic or prognostic purposes. Blood is frequently used as surrogate tissue in many clinical studies for reasons of accessibility, ease of storage and processing. Valid biomarkers from blood are thus often targeted <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. Regarding tissue heterogeneity, however, blood is an extreme example since inter-individual differences and disease-specific changes, amongst other reasons, lead to high variability in composition (<abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>, cf. our data, figure <figr fid="F1">1</figr>).</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Experimentally defined proportions of different blood cell types</p>
            </caption>
            <text>
               <p><b>Experimentally defined proportions of different blood cell types</b>. Individual proportions of PBMCs are depicted (CD3<sup>+ </sup>cells, CD14<sup>+ </sup>cells, and Others) for groups of TB patients and TST+ individuals. Cell type proportions are highly variable, even between individuals within a group.</p>
            </text>
            <graphic file="1471-2105-11-27-1" hint_layout="single"/>
         </fig>
         <p>Cell sorting of blood cells, or - in the case of solid tissues - micro-dissection <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>, depend on sophisticated equipment. Hence, biomarker studies under field conditions, especially in resource-poor countries, have to rely on molecular profiling from whole blood samples. Ideally, biomarkers with prominent and clear signals can be used which remain detectable in spite of varying cell type populations. However, biomarker signals for more subtle differences are most likely not detectable due to confounding tissue compositions. Figure <figr fid="F2">2</figr> gives an overview over possible scenarios:</p>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Cases of gene expression in tissue context</p>
            </caption>
            <text>
               <p><b>Cases of gene expression in tissue context</b>. (A) Non-problematic cases; (B) Confounding of cell type proportions and cell-type specific gene expression: simple and problematic case, deconfounding is possible; (C) worst case: gene expression depends on cell type proportion, deconfounding not possible.</p>
            </text>
            <graphic file="1471-2105-11-27-2" hint_layout="single"/>
         </fig>
         <p>Figure <figr fid="F2">2A</figr>, shows the non-problematic case for homogeneous tissue (e.g. culture of homogeneous cell populations under synchronizing conditions) without any confounding or interpretation problems (left), or tissue with fixed cell type proportions. For these cases, there is no confounding problem <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>.</p>
         <p>Figure <figr fid="F2">2B</figr>, refers to two cases for which <it>in-silico </it>approaches exist for deconfounding: The simple case (figure <figr fid="F2">2B</figr>, left) refers to a situation in which a gene of interest is exclusively expressed in a certain cell type (one amongst others, in varying proportions), and the proportions of this cell type in the study samples have been determined.</p>
         <p>Such cell type-specific gene is differentially expressed if the interaction term in the linear model</p>
         <p>
            <display-formula id="M1">
               <graphic file="1471-2105-11-27-i1.gif"/>
            </display-formula>
         </p>
         <p>is significant. Here, <it>y</it><sub><it>i </it></sub>depicts the log-ratio of gene expression signals for a specific gene in a common reference design (sample <it>i</it>), but it could also be a vector of log-intensities for one-color chips after normalization. <it>&#946;</it><sub>0 </sub>is the overall mean for this gene, representing the background signal (without any cells of the cell type exclusively expressing this gene). The binary factor <it>g </it>represents patient (<it>g </it>= 1) or control status (if <it>g </it>= 0) for the respective sample, and <it>p</it><sub><it>i </it></sub>denotes the proportion of the immune cell population in question as confounding factor. The variable <it>g </it>&#215; <it>p </it>indicates the interaction effect of study group and immune cell proportion. Finally, <it>&#949;</it><sub><it>i </it></sub>denotes the residual for sample <it>i</it>. An important assumption for this modeling approach is that single-cell gene expression is independent of cell type proportions. For an example of this type of analysis see the contribution by Jacobsen et al. (2006) <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. Similar problems and their solutions were presented by Kriete and Boyce (2005) <abbrgrp><abbr bid="B8">8</abbr></abbrgrp> combining tissue composition data and gene expression data, as well as by Gosh (2004) <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>, for the latter without estimates of the cell type proportions. If gene expression is no longer restricted to a specific cell type (as in figure <figr fid="F2">2B</figr>, right), we are dealing with the problematic case for which it is harder to disentangle influences of single-cell gene expression and variation in cell type proportions. A few similar approaches exist dealing with such a case, all employing an iterative optimization of the decomposition as given by equation 2:</p>
         <p>
            <display-formula id="M2">
               <graphic file="1471-2105-11-27-i2.gif"/>
            </display-formula>
         </p>
         <p>Here, <it>X </it>denotes the classical <it>gene expression matrix </it>(genes by samples). <it>S</it>, the <it>signature matrix</it>, gives the cell type specific gene expression profiles (genes by cell types), and <it>C</it>, the <it>concentration matrix</it>, gives cell type proportions over samples (cell types by samples).</p>
         <p>An alternative formulation is given in equation 3 for the mixture of two cell types (with cell type specific expression signatures <it>s</it><sub>1,<it>i </it></sub>and <it>s</it><sub>2,<it>i </it></sub>for gene <it>i</it>):</p>
         <p>
            <display-formula id="M3">
               <graphic file="1471-2105-11-27-i3.gif"/>
            </display-formula>
         </p>
         <p>where <inline-formula><graphic file="1471-2105-11-27-i4.gif"/></inline-formula> denotes the expression value of the <it>i</it>th gene in the <it>k</it>th heterogeneous sample and 0 &#8804; <it>c</it><sub><it>k </it></sub>&#8804; 1 denotes the fraction of the first cell type in the <it>k</it>th mixture; equivalent expressions are used in <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp>. Venet et al. (2001) <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> were first to study this approach. In their contribution they made use of a de-correlation approach, which tends to improve the reconstruction of simulated cell type specific gene expression profiles. Experimental data were also used but without the possibility to validate their deconfounding results in a straight forward way. Lu et al. (2003) <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> described a similar approach for analyzing yeast cell cycle expression patterns. Likewise, Stuart et al. <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> investigated prostate tumor tissue. Lahdesmaki et al. (2005) <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> for the first time introduced an approach, which also estimated the appropriate numbers of cell types for deconfounding analysis.</p>
         <p>However, none of these prior approaches systematically studied:</p>
         <p indent="1">- reconstruction of cell type specific gene expression profiles validated with experimental data;</p>
         <p indent="1">- sample size effects;</p>
         <p indent="1">- realistic simulation parameter settings derived from appropriate experimental data, with noise conditions as in a typical clinical study;</p>
         <p indent="1">- the power of detection of differential gene expression in comparison with a classical approach;</p>
         <p indent="1">- how to use a deconfounding approach in a classification task.</p>
         <p>These are the core objectives which our study aims to contribute.</p>
         <p>The experimental basis includes an experimental gene expression data set of 40 Agilent two-color arrays for two groups of a field study: tuberculosis patients (denoted TB cases) and healthy household contacts with a positive tuberculin skin test (denoted TST+, healthy controls).This dataset is part of the Grand Challenges in Global Health Project: Grant Number 37772, &#8220;Biomarkers of protective immunity against Tuberculosis in the context of HIV/AIDS in Africa&#8221; (funded by the Bill &amp; Melinda Gates Foundation through the Grand Challenges in Global Health Initiative). From each of the enrolled individuals, RNA was prepared from a whole-blood sample. From the same samples, cells with active gene expression, peripheral blood mononuclear cells (PBMC), were isolated and cell type proportions determined. CD3<sup>+</sup>-cells (T-lymphocytes) were enriched in these samples and collected for RNA preparation (for more details on the experimental dataset see Methods). Resulting data contain proportion and cell type specific gene expression profile for the most prominent RNA containing cell type in blood, as well as the whole blood gene expression signal of the same samples. This design constitutes a valuable validation dataset for testing and further developing an algorithm for deconfounding, as estimated cell type specific gene expression profiles can be compared to those of FACS-sorted cells.</p>
         <p>In our contribution, we study applicability and optimization of the deconfounding approach for detection of differential regulation of features in a univariate approach, as well as an approach using deconfounding for the classification task, towards identification of biomarker panels in heterogeneous tissues.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Experimental data</p>
            </st>
            <p>Gene expression data are part of the Grand Challenges in Global Health Project:   Grant Number 37772, &#8220;Biomarkers of protective immunity against Tuberculosis in the context of HIV/AIDS in Africa&#8221; (funded by the Bill &amp; Melinda Gates Foundation through the Grand Challenges in Global Health Initiative; http://www.biomarkers-for-tb.net/. PBMC from 40 TB cases and from 40 healthy household contact controls were extracted and analyzed by flow cytometry for proportions of CD3<sup>+ </sup>T-lymphocytes and CD14<sup>+</sup>mononuclear phagocytes as described before <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. All donors gave informed consent. This study was approved by local ethics committees in Stellenbosch (South Africa) (N05/11/187) and Berlin (EA 10 1/176/07, Germany).</p>
            <p>Signals of gene expression in whole blood as well as in CD3<sup>+ </sup>cells, for the Human Whole Genome Oligo 44K Agilent arrays (GE2_44k_1005) were measured according to manufacturer's protocols. The microarray design was an <it>independent swop design </it>as recommended by Landgrebe <it>et al</it>. <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>: 50% of each group ("TB", "TST+", "TST-") were labelled with Cy3, the other half using Cy5. Pairs for hybridization on an array were chosen to match regarding age and gender. For validation of the deconfounding algorithm we used CD3<sup>+ </sup>proportions. CD3<sup>+ </sup>cells sorted by fluorescence-activated cell sorting (FACS) were subjected to RNA extraction and microarray measurements of gene expression following the same procedure as for the whole blood samples. More details about the observational field study as well as the gene expression dataset will be published separately (see <url>http://www.biomarkers-for-tb.net/publications</url>).</p>
            <p>Gene expression data were normalized using R-package <it>limma </it><abbrgrp><abbr bid="B14">14</abbr></abbrgrp>: background correction using the method <it>normexp </it><abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, lowess normalization was applied for each array (within array normalisation), quantile normalisation on the set of all arrays (between array normalisation) as recommended <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. As proposed by <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>, gene expression intensities for both groups were obtained as in Equations 4 and 5 from re-parameterizing the normalized log-ratios (M) and mean log-intensities (A) resulting from the <it>limma </it>analysis.</p>
            <p>
               <display-formula id="M4">
                  <graphic file="1471-2105-11-27-i5.gif"/>
               </display-formula>
            </p>
            <p>
               <display-formula id="M5">
                  <graphic file="1471-2105-11-27-i6.gif"/>
               </display-formula>
            </p>
            <p>Summarizing, for each of the 40 TB cases and the 40 healthy household contact controls we were able to analyze gene expression data of whole blood as well as for the sorted CD3<sup>+ </sup>cells of the same samples together with their FACS-measured cell type proportions for the CD3<sup>+ </sup>cell population.</p>
         </sec>
         <sec>
            <st>
               <p>Deconfounding algorithm: implementation and enhancements</p>
            </st>
            <p>The basis of our deconfounding algorithm was implemented as proposed by Venet et al. (2001) <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> and Lahdesmaki et al. (2005) <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> using R <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>:</p>
            <p>
               <monospace>&#160;&#160;&#160;input X and n</monospace>
            </p>
            <p>
               <monospace>&#160;&#160;normalize columns of X (either centre, or by quantile normalization)</monospace>
            </p>
            <p>
               <monospace>&#160;&#160;&#160;generate start values for S and C</monospace>
            </p>
            <p>
               <monospace>&#160;&#160;&#160;apply constraints to S and C (see below)</monospace>
            </p>
            <p>
               <monospace>(*) fix S, calculate C using lsqnonneg-algorithm</monospace>
            </p>
            <p>
               <monospace>&#160;&#160;&#160;apply constraints for S</monospace>
            </p>
            <p>
               <monospace>&#160;&#160;&#160;fix C, calculate S using lsqnonneg-algorithm</monospace>
            </p>
            <p>
               <monospace>&#160;&#160;&#160;apply constraints for C</monospace>
            </p>
            <p>
               <monospace>&#160;&#160;&#160;if | X - SC | &lt; a or number iterations > b then EXIT and report S and C</monospace>
            </p>
            <p>
               <monospace>&#160;&#160;&#160;else continue at (*)</monospace>
            </p>
            <p>where <it>X </it>is the gene expression matrix measured from heterogeneous tissue (rows: genes, columns: samples), <it>S </it>and <it>C </it>as in equation 2, iteration exit criteria were set <it>a </it>= 0.1 and <it>b </it>= 100. The Least squares non-negative matrix factorization algorithm is implemented as in the MATLAB function lsqnonneg <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>. The constraints are:</p>
            <p indent="1">1. <it>S </it>non-negative and normalized (either centered, or by quantile normalization <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>)</p>
            <p indent="1">2. 0 &#8804; <it>c</it><sub><it>ij </it></sub>&#8804; 1 for all elements of <it>C </it>(cell type <it>i</it>, sample <it>j</it>)</p>
            <p indent="1">3. &#8721;<sub><it>i </it></sub><it>c<sub>ij </sub></it>= 1 for all samples <it>j </it>(i.e. cell type proportions sum to 100%)</p>
            <p>Our implementation is available as an R-package and has additional options for using quantile normalization instead of global normalization proposed previously <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. Moreover, it is possible to run the deconfounding on log-values of the normalized intensities or on non-log data. Finally, our implementation does not apply the de-correlation proposed by Venet et al. <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>.</p>
            <p>To assign the right cell type for each of the estimated profiles, our implementation relies on a majority count decision involving the estimated gene expression profiles from <it>n</it><sub>marker </sub>= 9 markers. Five of these markers are considered to be expressed exclusively for a specific cell type (positive marker genes) and the remaining four exclusively <it>not </it>in this cell type (negative marker genes). Marker genes were chosen according to a priori molecular immunological knowledge. For our experimental dataset we used CD3D, CD3E, CD3G, CD2 and CD7 as positive markers, and CD19, FCGR1A, CD14 and MARCO as negative markers for the CD3<sup>+ </sup>T cells.</p>
         </sec>
         <sec>
            <st>
               <p>Simulated data</p>
            </st>
            <p>Cell type specific gene expression profiles (columns of the signature matrix <it>S</it>) were simulated according to a gamma distribution such that expectation value and variance were those of the experimental data (shape <it>a </it>= 12.5 and scale <it>b </it>= 0.65):</p>
            <p>
               <display-formula id="M6">
                  <graphic file="1471-2105-11-27-i7.gif"/>
               </display-formula>
            </p>
            <p>As by Venet et al. <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, biological variance was modeled as multiplicative error term &#1013;, technical variance as additive error term &#1013;. For our experimental data, variation was found to increase with mean signal intensities. Therefore, we decided to model a constant coefficient of variation instead of standard deviation:</p>
            <p>
               <display-formula id="M7">
                  <graphic file="1471-2105-11-27-i8.gif"/>
               </display-formula>
            </p>
            <p>where <it>&#951; </it>= 0.17 and &#1013; ~ N(0, <it>&#967; </it>&#183; <it>I</it><sub>gene</sub>), using <it>&#967; </it>= 0.1 as estimated from our experimental data.</p>
            <p>Gene expression values for negative marker genes had expression <it>X</it><sub>marker, neg </sub>= 6.0, positive marker genes had <it>X</it><sub>marker, pos </sub>= 12.0 in the expressing cell type - as observed for the marker genes in our experimental study. Cell type proportions, <it>C</it><sub>sim</sub>, were drawn from the uniform distribution between cell type specific maximum and minimum values as in our experimental flow cytometry data. The simulated gene expression matrix, <it>X</it><sub>sim</sub>, was calculated from simulated cell type-specific gene expression profiles, <it>S</it><sub>sim</sub>, and simulated cell type proportions, <it>C</it><sub>sim</sub>, corresponding to equation 2:</p>
            <p>
               <display-formula id="M8">
                  <graphic file="1471-2105-11-27-i9.gif"/>
               </display-formula>
            </p>
            <p>To investigate the algorithm's capabilities regarding detection of differential expression of single features and for classification, two groups of gene expression profiles were simulated, e.g. corresponding to TB patients and TST+ controls in our experimental data. We simulated <it>n</it><sub>sample </sub>= 100 individuals in each group. For each gene expression profile <it>n</it><sub>genes </sub>&#8712; {1000, 10000} genes were considered, with <it>n</it><sub>markers </sub>= 10 and <it>n</it><sub>diff </sub>&#8712; {20, 600} differentially expressed biomarkers.</p>
            <p>Differential expression was simulated by adding &#916;<sub>diff </sub>&#8712; 1, 2, 5} to the expression values of the biomarker genes in the first cell type. Figure <figr fid="F3">3</figr> illustrates the generation of simulated profiles.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Simulated cell type-specific gene expression profiles</p>
               </caption>
               <text>
                  <p><b>Simulated cell type-specific gene expression profiles</b>. Left: <it>S </it>matrices for cell-types CD3<sup>+</sup>and Others. Right: <it>I </it>matrix before and after adding empirical noise.</p>
               </text>
               <graphic file="1471-2105-11-27-3" hint_layout="single"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Power study: valid biomarkers with and without deconfounding</p>
            </st>
            <p>We simulated a gene expression experiment with samples mixed out of two cell types (CD3 and other) for 10,000 genes, where 600 genes were differentially expressed. For the differentially expressed genes we simulated all eight possible combinations of NEUTRAL, UP and DOWN. Sample sizes of the two groups under comparison (alike TB and TST<sup>+ </sup>healthy control) varied from <it>n</it><sub>samples </sub>&#8712; {4, 10, 20, 40, 80, 120}. Simulated gene expression data were analyzed as the experimental data. As for the latter we were able to analyze a simulated whole blood sample (mixture of the two cell types) as well as the two cell type-specific gene expression profiles after deconfounding. Simulated whole blood gene expression data were analyzed using the <it>t</it>-test, ranking candidates for differential expression using <it>p</it>-values and - to enable a direct comparison - considering the 100 top candidates as positive candidates for differential expression. The cell type-specific gene expression profiles (columns of the signature matrix <inline-formula><graphic file="1471-2105-11-27-i10.gif"/></inline-formula>) estimated from deconfounding were ranked using absolute log-fold-change values. Also here the 100 top candidates were chosen.</p>
         </sec>
         <sec>
            <st>
               <p>Classification in the case of reversely regulated differentially expressed biomarkers</p>
            </st>
            <p>The worst-case scenario for biomarker detection in heterogeneous tissues arises when cell types involved express differentially regulated biomarkers in opposite directions. In this case, in the tissue RNA isolate, signals for differential expression likely cancel each other and hamper detection of respective biomarkers markedly. To identify a possible exit strategy, we conducted a simulation study for this worst-case scenario, again considering noise values estimated from the experimental data in this study.</p>
            <p>To exemplify the worst-case classification task, we simulated differential gene expression as above, but also subtracted the same value from the expression values of the second cell type. This way, for all cells in the mixture averaged over all samples, no differential expression is expected, while for the single cell types it is more or less evident. Gene expression profiles for new samples, for validation of the trained classifiers in the classification scenario, were generated using the identical signature matrices, <it>S</it><sub>sim</sub>, as for the training step, but with new values for the concentration matrices as well as for the noise term realizations.</p>
            <sec>
               <st>
                  <p>Canonical classification approach</p>
               </st>
               <p>For feature selection, <it>t</it>-tests were used to identify biomarker candidates from the simulated heterogeneous tissue gene expression data: The top <it>n</it><sub>cand </sub>&#1013; {10, 20} were chosen to train a linear discriminant function as classificator. Classification errors in a validation (500 new cases simulated) for this classical classification approach were then compared to a deconfounding ranking approach, which is described in the following.</p>
            </sec>
            <sec>
               <st>
                  <p>Deconfounding ranking approach</p>
               </st>
               <p>For the training dataset, a deconfounding analysis was run and <it>n</it><sub>cand </sub>candidates top ranked for differential expression were picked from gene-wise mean absolute differences between the corresponding columns of the estimated signature matrices, <inline-formula><graphic file="1471-2105-11-27-i10.gif"/></inline-formula>, for the two groups. In addition, using the simulated whole blood expression data, <it>X</it><sub>sim</sub>, from the training dataset, a random forest predictor was trained to estimate the cell type proportions <inline-formula><graphic file="1471-2105-11-27-i11.gif"/></inline-formula>, resulting from the deconfounding algorithm run from the same training-data <abbrgrp><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr></abbrgrp>.</p>
               <p>Input to this statistical learning step were the gene expression data in <it>X</it><sub>sim </sub>for the <it>n</it><sub>markers </sub>= 20 marker genes. For each new individual during the validation part of the study, cell type proportions were estimated from the simulated whole blood gene expression profile using the trained random forest machine. Deconfounding results <inline-formula><graphic file="1471-2105-11-27-i10.gif"/></inline-formula> of the training dataset for the two groups A and B were then multiplied with the estimated cell type proportions for the new individual, to result in group-specific gene expression profiles <inline-formula><graphic file="1471-2105-11-27-i12.gif"/></inline-formula> and with <inline-formula><graphic file="1471-2105-11-27-i13.gif"/></inline-formula> with the cell type proportions of the sample in question. The actual gene expression signals of the sample at the chosen <it>n</it><sub>cand </sub>biomarker loci were then compared to these group-specific gene expression matrices and the following summary score computed:</p>
               <p>
                  <display-formula id="M9">
                     <graphic file="1471-2105-11-27-i14.gif"/>
                  </display-formula>
               </p>
               <p>Classification was based on choosing the group for which <it>&#947;</it><sub>group </sub>was minimal.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Implementation and availability</p>
            </st>
            <p><monospace>R</monospace>-package <monospace>deconf</monospace> implementing the deconfounding algorithm and options, <monospace>R</monospace>-scripts for data simulation, data analysis and an anonymized part of the experimental dataset is available as additional file <supplr sid="S1">1</supplr> (Windows R-package) and additional file <supplr sid="S1">2</supplr> (tar-gz archive).</p>
            <suppl id="S1">
               <title>
                  <p>Additional file 1</p>
               </title>
               <text>
                  <p><b>R-package </b><monospace><b>deconf</b></monospace><b>(Windows) including example data and script</b>. <it>R</it>-package <monospace>deconf</monospace> (Windows version) which implements the deconfounding algorithm together with options for normalization, run-time options for the iteration process, and number of cell-type specific gene expression profiles to be estimated. Also, some toy examples and part of the experimental dataset are included together with executable example scripts for demonstration purposes.</p>
               </text>
               <file name="1471-2105-11-27-S1.ZIP">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S2">
               <title>
                  <p>Additional file 2</p>
               </title>
               <text>
                  <p><b>R-package </b><monospace><b>deconf</b></monospace><b>(tar-gz archive)</b>. <it>R</it>-package <monospace>deconf</monospace> (tar-gz archive)</p>
               </text>
               <file name="1471-2105-11-27-S2.GZ">
                  <p>Click here for file</p>
               </file>
            </suppl>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <p>As the experimental data offered gene expression profiles for whole blood, i.e. a heterogeneous tissue which is a mixture of several cell types, and in addition the gene expression profiles from CD3<sup>+ </sup>cells of the same samples, and the respective CD3<sup>+ </sup>proportions (determined by FACS), we were able to use this information as a basis for a validation study for the proposed deconfounding algorithm.</p>
         <p>In addition, to methodologically optimize the deconfounding algorithm as well as to investigate its usability to detect differentially expressed genes and biomarkers usable for classification of new patients (with only whole blood expression profiles measured) - we had to rely on simulation studies.</p>
         <p>Summarizing, our study was designed to answer four questions. For which data scale and algorithm settings do we achieve:</p>
         <p indent="1">- The best estimate of cell type-specific expression profiles (columns of signature matrix)? Data basis: experimental data.</p>
         <p indent="1">- The best marker-based identification of reconstructed cell type-specific gene expression profiles (columns of <inline-formula><graphic file="1471-2105-11-27-i10.gif"/></inline-formula>)? Data basis: simulation study (parameters estimated from experimental data).</p>
         <p indent="1">- The largest power to detect differential expression? Data basis: simulation study (parameters estimated from experimental data).</p>
         <p indent="1">- The smallest prediction errors for the classification task? Data basis: simulation study (parameters estimated from experimental data).</p>
         <sec>
            <st>
               <p>Reconstruction of cell type-specific gene expression profiles and cell type proportions in experimental data</p>
            </st>
            <p>The deconfounding algorithm was applied to the whole blood gene expression matrices for both groups of individuals (TB and TST<sup>+</sup>) both using the quantile normalization as well as global mean normalization approach for log- and non-log-intensities. Numbers of cell types was set to <it>n</it><sub>CT </sub>= 2. Deconfounding results - estimated cell type-specific gene expression profiles <inline-formula><graphic file="1471-2105-11-27-i10.gif"/></inline-formula> as well as cell type proportions <inline-formula><graphic file="1471-2105-11-27-i11.gif"/></inline-formula> - could be compared to the actual experimental data (see figure <figr fid="F4">4</figr>, figure <figr fid="F5">5</figr> and table <tblr tid="T1">1</tblr>):</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Profile reconstruction versus differential gene expression: alternatives for deconfounding algorithm settings</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="4" ca="center">
                        <p>
                           <b>Optimal deconfounding algorithm settings</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>
                           <b>log/quant</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>log/not quant</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>not log/quant</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>not log/not quant</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>cor reconstr</p>
                     </c>
                     <c ca="center">
                        <p>0.86</p>
                     </c>
                     <c ca="center">
                        <p>0.85</p>
                     </c>
                     <c ca="center">
                        <p>0.73</p>
                     </c>
                     <c ca="center">
                        <p>0.68</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>DGE power</p>
                     </c>
                     <c ca="center">
                        <p>0.47</p>
                     </c>
                     <c ca="center">
                        <p>0.46</p>
                     </c>
                     <c ca="center">
                        <p>70</p>
                     </c>
                     <c ca="center">
                        <p>0.62</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Correlations of measured and estimated cell type-specific gene expression profiles ("cor reconstr.") as well as power for detection of differential expression ("DGE power", see text) -- for all four combinations of using logs or not, applying quantile or global mean normalization, respectively, for the deconfounding algorithm.</p>
               </tblfn>
            </tbl>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Validation of estimated gene expression profiles</p>
               </caption>
               <text>
                  <p><b>Validation of estimated gene expression profiles</b>. Validation of gene expression profile estimates with experimental data from FACS sorted CD3<sup>+</sup>cells: Left panel: measured gene expression intensities for CD3<sup>+</sup>cells versus intensities estimated for cell type 1. Right panel: measured gene expression intensities for CD3<sup>+</sup>cells versus intensities estimated for cell type 2.</p>
               </text>
               <graphic file="1471-2105-11-27-4" hint_layout="single"/>
            </fig>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Validation of estimated cell type proportions</p>
               </caption>
               <text>
                  <p><b>Validation of estimated cell type proportions</b>. Validation of cell type proportion estimates with experimental data (FACS counts for CD3<sup>+</sup>cells): A: measured proportion of CD3<sup>+ </sup>cells versus estimated proportions for cell type 1. B: measured proportion of CD3<sup>+ </sup>cells versus estimated proportions for cell type 2. C: measured proportion of non-CD3<sup>+</sup>cells versus estimated proportions for cell type 1. D: measured proportion of non-CD3<sup>+</sup>cells versus estimated proportions for cell type 2. Linear regression lines are displayed in red.</p>
               </text>
               <graphic file="1471-2105-11-27-5" hint_layout="single"/>
            </fig>
            <p>Figure <figr fid="F4">4</figr> displays mean values of the measured CD3<sup>+ </sup>expression profile in TB patients against both estimated columns of the signature matrix <inline-formula><graphic file="1471-2105-11-27-i10.gif"/></inline-formula>.</p>
            <p>For the displayed example in figure <figr fid="F4">4</figr>, non-log data were quantile normalized: Experimental data show considerable variation when compared to the estimates after deconfounding. As expected, cell type 1 is evidently better correlated with the experimental CD3<sup>+ </sup>profile than cell type 2. The correlation is best for large expression values.</p>
            <p>Also, referring to figure <figr fid="F5">5</figr>, even though there is lower correlation between experimental and estimated cell type proportions, the indicated regression lines in the scatter plots for experimental and estimated proportions show the correct tendencies for the respective cell types.</p>
            <p>Table <tblr tid="T1">1</tblr> (first row) depicts correlations of mean measured profiles with the estimates from deconfounding results for the comparison between the four methodological algorithmic alternatives.</p>
         </sec>
         <sec>
            <st>
               <p>Deconfounding quality as function of sample size (simulation study)</p>
            </st>
            <p>To investigate the influence of sample size on the quality of deconfounding results, we had to rely on simulation studies which were aimed at mirroring experimental data distribution and noise as realistically as possible. Figure <figr fid="F6">6</figr> (middle panel) shows the simulation results for <it>n</it><sub>sample </sub>= 20, which approximates the sample size for the GC6 experimental data (cf. figure <figr fid="F4">4</figr>) - and also a typical value for such type of clinical study involving high-throughput analyses. The effect of sample size is clearly distinguishable for simulation results using <it>n</it><sub>sample </sub>= 4 (figure <figr fid="F6">6</figr>, left) and <it>n</it><sub>sample </sub>= 120 (figure <figr fid="F6">6</figr>, right) respectively.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>Profile estimates for simulated data</p>
               </caption>
               <text>
                  <p><b>Profile estimates for simulated data</b>. Gene expression profile estimates for simulated data, realistic noise, quantile normalisation, and sample sizes of 4 (left panel), 20 (middle panel), or 120 samples (right panel).</p>
               </text>
               <graphic file="1471-2105-11-27-6" hint_layout="single"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Cell type assignment using markers (simulation study)</p>
            </st>
            <p>The deconfounding algorithm itself does not assign a cell type to the estimated cell type specific expression profiles (columns of <inline-formula><graphic file="1471-2105-11-27-i10.gif"/></inline-formula>). Therefore, to find out in which of the two possible orders the two estimated cell type profiles (CD3<sup>+ </sup>and others) reside, one has to rely on expression signals of cell type specific markers. Regarding the analysis of the experimental data, such markers were chosen based on <it>a priori </it>immunological knowledge. In our simulation studies, we simulated 5-10 positive CD3<sup>+ </sup>marker genes, which were expressed at high levels (simulated level for <it>X</it><sub>marker, CD3 </sub>= 12), whereas these marker genes showed a low mean expression in the alternative cell type (simulated level for <it>X</it><sub>marker, other </sub>= 6). These expression levels were used as observed for the experimental data. Another group of marker genes was simulated in the reverse manner. Figure <figr fid="F7">7</figr> shows the distributions of estimated marker gene expression levels from simulated data after deconfounding employing global mean (A) or quantile normalization (B). Here, use of the robust quantile normalization was rewarding for this critical step: Lack of a possibility to assign the right cell types thwarts the analysis as a whole. It is also evident, that marker gene expression levels were estimated mostly correctly regarding relative values in both cell types, whereas absolute gene expression levels were scaled down in the estimates. However, to be able to use these cell type specific estimated marker gene expression levels to assign the right cell types it is only necessary that positive markers have top expression levels in the cell type exclusively expressing them.</p>
            <fig id="F7">
               <title>
                  <p>Figure 7</p>
               </title>
               <caption>
                  <p>Cell type-specific assignment using markers</p>
               </caption>
               <text>
                  <p><b>Cell type-specific assignment using markers</b>. Mean and range for marker intensities after deconfounding without (A) and with quantile normalization (B). Cell-types CD3<sup>+</sup>(red) and Other (blue). The first five marker positions are positive markers (exclusively expressed in CD3<sup>+</sup>), the remaining five are negative markers (not expressed in CD3<sup>+</sup>).</p>
               </text>
               <graphic file="1471-2105-11-27-7" hint_layout="single"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Valid detection of cell type-specific differential gene expression (simulation study)</p>
            </st>
            <p>Because we want to study the use of deconfounding for biomarker discovery, in our power-study we compared the <it>t</it>-test and our deconfounding approach regarding their power to detect differential gene expression (candidate biomarkers). Figure <figr fid="F8">8</figr> shows the central results: <it>t</it>-test and deconfounding approach show comparable results for higher sample sizes (40 &#8804; <it>n</it><sub>sample </sub>&#8804; 120) and cases A and B, for which differential gene expression is either in the same direction in both cell-types or differential in one cell type only. However, for small sample sizes in all cases, and especially also for large sample sizes in figure <figr fid="F8">8C</figr>, the deconfounding ranking approach detects more of the true differentially expressed genes than the <it>t</it>-test. As it is for this worst-case scenario (figure <figr fid="F8">8C</figr>), where differentially expressed signals of the cell types involved cancel each other, we aimed at assessing application of the deconfounding ranking approach for the <it>classification objective </it>for this case.</p>
            <fig id="F8">
               <title>
                  <p>Figure 8</p>
               </title>
               <caption>
                  <p>Power comparison for detection of differential gene expression</p>
               </caption>
               <text>
                  <p><b>Power comparison for detection of differential gene expression</b>. Power comparison for detection of differentially expressed genes in the simulation study with realistic noise. Three cases: (A) gene is up-regulated in both cell-types, CD3<sup>+</sup>and Other; (B) up-regulation only in CD3<sup>+</sup>cell-type, no regulation in Other; (C) up-regulation in CD3<sup>+</sup>, but down-regulation in Other. Mean values and range of numbers of detected candidates are displayed.</p>
               </text>
               <graphic file="1471-2105-11-27-8" hint_layout="single"/>
            </fig>
            <p>As power for detection of differential expression ("DGE power") we define the proportion of truly differentially expressed genes in the 100 top-ranked 100 candidates. Table <tblr tid="T1">1</tblr> (second row) depicts this power for detection of differential expression for four algorithmic alternatives. Choosing quantile normalization for intensity values and using non-log values gives optimal results.</p>
         </sec>
         <sec>
            <st>
               <p>Applying the deconfounding approach for classification (simulation study)</p>
            </st>
            <p>As an important objective is to find biomarkers from the estimated cell type specific gene expression signatures resulting from the deconfounding, we have to show how such biomarkers could be applied to a <it>new </it>patient's whole blood expression dataset. The deconfounding algorithm results in estimates for the signature matrix and the concentration matrix for a given group of samples. In our case, the procedure uses simulated gene expression profiles of 40 individuals (per study group) to estimate two cell type-specific gene expression profiles (CD3+ and NotCD3+). It is, however, not possible to use a single individual's profile for deconfounding, as for a single case there is no information available about how a change in cell type proportions influences measured gene expression signals. To enable the use of the deconfounding results for classification of a new individual, we have to either measure or estimate a single individual's cell type proportions. To estimate cell type proportions from a single whole blood expression profile we employed a random forest machine to learn to predict cell type proportions from simulated whole blood gene expression data using the training dataset and the deconfounding estimates of <inline-formula><graphic file="1471-2105-11-27-i11.gif"/></inline-formula>. For a new individual, this trained random forest was then used to estimate cell type proportions.</p>
            <p>These were multiplied to the group-specific signature matrices estimated by deconfounding from the two groups in the training data. The resulting group-specific gene expression matrices - based on cell type proportions as in the new individual - were used in a majority votes comparison approach and the individual classified accordingly.</p>
            <p>We show that this deconfounding ranking approach significantly improves classification results regarding prediction error rates, if the differential expression of a biomarker panel relies on genes that are regulated in the opposite direction in the cell types involved. Figure <figr fid="F9">9</figr> shows distributions of classification errors in 100 validation runs. Clearly, the <it>t</it>-test-LDA approach is not better than mere guessing, whereas -dependent on noise and numbers of differentially expressed genes - the deconfounding ranking approach correctly classifies most of the simulated cases.</p>
            <fig id="F9">
               <title>
                  <p>Figure 9</p>
               </title>
               <caption>
                  <p>Applying the deconfounding approach for classification: defining biosignatures in a simulated scenario</p>
               </caption>
               <text>
                  <p><b>Applying the deconfounding approach for classification: defining biosignatures in a simulated scenario</b>. Classification error rates in a simulated scenario using realistic empirical noise and differentially expressed genes, which are reversely regulated in a mixture of two cell-types for a <it>t</it>-test-LDA approach (blue boxes) and a deconfounded-biosignature approach (red boxes). Boxplots comprise median, 1<sup>st </sup>and 3<sup>rd </sup>quartiles, as well as the 95% confidence interval (assuming normality).</p>
               </text>
               <graphic file="1471-2105-11-27-9" hint_layout="single"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Predicting cell type proportions in a single whole blood profile in experimental data</p>
            </st>
            <p>We also regressed cell type proportions on marker gene expression (CD3G and MARCO) in the experimental whole blood dataset and achieved a correlation of 34% between leave-one-out samples and their estimated proportions of CD3+ cells. Figure <figr fid="F10">10</figr> shows a scatterplot of the leave-one-out samples and their estimated proportions, as well as the distribution of correlations with 200 permutated values for the cell type proportions. Prediction is significant, and its precision comparable to what the deconfounding is able to reproduce in the simulated data (compare figure <figr fid="F5">5</figr>).</p>
            <fig id="F10">
               <title>
                  <p>Figure 10</p>
               </title>
               <caption>
                  <p>Prediction of cell type proportions of single whole blood marker expression profiles</p>
               </caption>
               <text>
                  <p><b>Prediction of cell type proportions of single whole blood marker expression profiles</b>. A leave-one-out cross validation approach was used to predict cell type proportions (CD3+) in single samples from their marker gene expression profiles (CD3G and MARCO). A: Scatterplot of estimated CD3+ proportions against true proportions (<it>r </it>= 0.34). B: Significance of this prediction precision based on 200 permutations of the true CD3+ proportions and identical analysis as for (A).</p>
               </text>
               <graphic file="1471-2105-11-27-10" hint_layout="single"/>
            </fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>Gene expression in heterogeneous tissues thwarts valid interpretation of results, detection of differential expression, especially cell type specific regulation in opposite directions, and hence represents a major obstacle towards definition of biomarkers in difficult cases. We propose a modified version of an <it>in-silico </it>deconfounding ranking approach which estimates cell type specific gene expression profiles from tissue expression data, even under realistic noisy conditions. We were able to validate these results with experimental data, both from heterogeneous tissue (peripheral blood) and sorted cells. In a realistically simulated example we show how deconfounding ranking can help in detecting differential gene expression in heterogeneous tissues. We developed an approach to use deconfounding results for the task of finding biomarker candidates for classification of a new patient on the basis of his whole blood gene expression profile and information about his cell type proportions (either predicted or measured): This way deconfounding ranking can propose biomarker signatures even in the worst-case scenario where biomarkers are regulated in opposite directions in different tissue cell-types under investigation. The resulting tissue specific biomarkers can be considered as an initial step for the identification of candidate biomarkers for classification. Clearly, any candidate molecular biomarker has to be tested against existing markers, especially clinical markers, and demonstrate a diagnostic or prognostic gain. However, in our contribution we targeted the principal problem of detection of molecular biomarkers from heterogeneous tissue. Our experimental example and the simulation studies demonstrate the problem of confounding cell type proportions and a solution approach using the in-silico deconfounding approach. Our results show that by estimating cell type proportions and cell type specific gene expression patterns, the search for biomarker candidates for classification can be significantly enhanced.</p>
         <sec>
            <st>
               <p>Significance and applicability of the proposed deconfounding ranking approach</p>
            </st>
            <p>For the purpose of biomarker detection, homogeneous cell populations are not generally a prerequisite as there may be markers so clear that their signal can be read in spite of the considerable variation introduced by tissue heterogeneity. This is mostly a desired result. However, especially in experiments where biomarkers are sought for cases which are not easily separable otherwise (e.g. prospective studies), they might be detected better after taking tissue heterogeneity into account - with our work and manuscript we want to propose an approach for such cases.</p>
            <p>Others have implemented and studied principles of <it>in-silico </it>deconfounding <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B22">22</abbr></abbrgrp>, but our study for the first time combines the following results:</p>
            <p indent="1">- validates <it>in-silico </it>deconfounding results using experimental data of a molecular field study;</p>
            <p indent="1">- implements a realistic simulation study with noise parameters estimated from the experimental dataset;</p>
            <p indent="1">- systematically investigates the influence of sample size on quality of estimated cell type specific gene expression profiles;</p>
            <p indent="1">- compares the power to detect differential expression (i.e. univariate biomarker candidates) with a classical <it>t</it>-test approach;</p>
            <p indent="1">- optimizes the deconfounding algorithm employing a quantile normalization step as well as marker-assisted cell type profile recognition under realistic noise conditions;</p>
            <p indent="1">- proposes a classification approach using the results of a deconfounding ranking analysis and compares these results with a classical <it>t</it>-test-LDA approach for the worst-case scenario of biomarkers regulated in opposite directions.</p>
            <p>Our results show that, even under noisy, realistic conditions of a molecular field study - involving field-collected whole blood samples and considerable individual variations between enrolled individuals -the deconfounding ranking approach using non-log, quantile-normalized gene expression data from whole-blood RNA can facilitate identification of valid differential gene expression signals. These biomarker candidates can then be used in a classification approach which - for the case where biomarkers are regulated in opposite directions in different cell-types - is far more powerful than canonical discriminant analysis. In the applied clinical situation, our approach will of course be not more than an initial step for the identification of candidate biomarkers for classification - which then would be entered into further validation studies before applicable for cost efficient clinical routine diagnostics.</p>
         </sec>
         <sec>
            <st>
               <p>Methodological constraints and requirements</p>
            </st>
            <p>A critical prerequisite of our deconfounding approach is that, in principle, we assume independence of a cell type-specific gene expression profile and the proportion of the respective cell type within the heterogeneous tissue. Figure <figr fid="F2">2C</figr>, illustrates the unfavorable case for which gene expression on the single-cell level is regulated as a function of the expressing cell-type's proportion in the tissue. It is conceivable that such a regulation is indeed real for some genes - and this would not only blur estimates of cell type-specific gene expression profiles, but also produce false estimates for such specific genes. As shown in our validation study, however, in general the independence assumption does not lead to false results for the estimated profile as a whole. Thus, biosignature detection will still be enhanced by use of deconfounding ranking even if the independence assumption for single-cell gene expression and cell type proportion does not hold in every respect.</p>
            <p>Some methodological details of our study remain an illustrative approach, and further investigations are thus called for. The normalization procedure has apparent influence on the quality of cell specific profile reconstruction as well as on the power of detection of differential expression. Our decision to use quantile normalization was based on the finding that using the original overall mean normalization by Venet et al. <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> led to poor recognition of cell-types using marker gene expression signals. Single outlier measurements could significantly shift the whole profile, thus thwarting cell-type identification. The quantile normalization approach resulted in a robust, more reliable marker-assisted cell type recognition. An improvement of the algorithm's capability to reconstruct cell type-specific gene expression profiles could be obtained if the starting profiles for the iterative optimization were already seeded with an approximate guess of what the specific cell type profile may look like. Such information could be provided by FACS analysis, or by expression profiles available in the literature (see for example the work of Watkins et al., 2009 <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>). Caution, however, is necessary to avoid inadequate influences on study group-specific differences. Also, averaging multiple deconfounding optimization runs could lead to a stabilizing effect for the resulting predicted cell-type profiles. Here as well, detailed studies are necessary.</p>
            <p>Estimates of cell type-specific gene expression profiles were optimal given that deconfounding was run on log-intensities, whereas detection of differential expression was optimal using non-log input values. We may speculate about the reasons for this difference: Possibly, non-log inputs filter out or down-weight small expression values - which in turn often play a minor role in differential expression.</p>
            <p>For the simulated worst-case scenarios, i.e. genes which are reciprocally regulated in the participating cell types, the deconfounding ranking approach produced promising results - both for achieving valid estimates of differential gene expression and for the classification task. However, the existing implementation could be improved by implementing a bootstrap test for differential expression, such that not only a ranking of candidates for differential expression, but also an estimate of the number of differentially expressed features becomes feasible. A first approach could be to draw bootstrap samples and compute 95% confidence intervals as quantiles from the bootstrap distribution of the resulting bootstrap estimates for <inline-formula><graphic file="1471-2105-11-27-i15.gif"/></inline-formula> (<it>b </it>denoting a bootstrap index). Such a bootstrap approach could also enable analysis of gene set enrichment with currently available methods (e.g. <abbrgrp><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr></abbrgrp>).</p>
         </sec>
         <sec>
            <st>
               <p>Outlook</p>
            </st>
            <p>The proposed deconfounding ranking approach to classification has to be considered as a first heuristic approach. Its performance sufficiently demonstrates superiority over approaches that do not take into account confounding with cell-type proportions (figure <figr fid="F9">9</figr>). However, a multivariate model of gene expression patterns (biosignatures) is still missing. It would be desirable to arrive at an analysis interface enabling the use of the plethora of available statistical learning methods. Also, the classification approach is dependent on either measurements or estimates of cell type proportions in the sample that is to be classified. If the field of application was gene expression signatures in blood, it is certainly conceivable that a cell type proportions profile is measured, as the necessary laboratory equipment is now available in labs all over the world. However, in our work we propose to try a regression approach based on the expression profiles of the marker genes which are also used to identify the cell type specific expression signatures after deconfounding. This approach worked well for our simulation study, figure <figr fid="F10">10</figr> shows that it also delivers sufficient results for experimental data - comparable to what the deconfounding algorithm delivers (compare figure <figr fid="F5">5</figr>). However, there is certainly room for improvement - as apparently better estimates of cell type proportions based on single sample whole blood expression profiles would enable improved classification performance.</p>
            <p>The presence of up- and down-regulated biomarkers suggest two further possible improvements. First, gene filtering with regard to absolut expression signals, i.e. focussing on medium to highly expressed genes may provide more robust signatures. Second, the identification of gene pairs as in the top scoring pair method <abbrgrp><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr></abbrgrp> may be an alternative to the ranking approach taken in our initial study here - and improve reliability in the presence of noisy field measurements.</p>
            <p>There also exist alternative approaches to the non-negative matrix factorization approach taken by us and <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B10">10</abbr></abbrgrp>. For example, Ghosh proposes a mixture model approach <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>, and there also exist Bayesian approaches for this task <abbrgrp><abbr bid="B22">22</abbr><abbr bid="B28">28</abbr></abbrgrp>. A comparison of existing methods for the application with biological data from heterogeneous tissues would certainly be an exciting and rewarding field of further work. Especially modern Bayesian methods promise to further improve the results, also regarding more than two cell types in the heterogeneous tissue to be resolved.</p>
            <p>In our contribution, the deconfounding ranking approach is applied to gene expression profiles in peripheral blood samples. In principle, it is also applicable for other molecular profiles from heterogeneous tissues, e.g. metabolome or proteome profiles.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusions</p>
         </st>
         <p>In heterogeneous tissue samples, molecular profiling is confounded by variable cell type proportions. If confounding is severe, as in the important surrogate tissue blood, valid molecular profile measurements are hampered. If micro-dissection or cell sorting are unavailable or too expensive, <it>in-silico </it>deconfounding offers an alternative. We have demonstrated possible algorithmic adjustments and approaches for detection of cell type-specific differential gene expression and for molecular profile-based classification. Both these objectives have not been studied previously for approaches of <it>in-silico </it>deconfounding. The vigor of our study rests in the use of an experimental validation dataset, which also served to select appropriate realistic simulation parameters to emulate conditions of a molecular field study.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>DR conceived the study, ideas of approaches, design and coordination, ran part of the simulation studies and prepared the manuscript. SKe performed most of the statistical programming, implementation of algorithms, estimation of simulation parameters and conduction of simulations. Most of the work in this contribution was part of SKe's diploma thesis at the University of Potsdam. AT summarized the existing R-scripts to form a publishable R-package. GW designed the clinical study and supervised sample preparations in the laboratory, GB managed and supervised the clinical study. MJ contributed significantly to the development of our deconfounding approaches, co-organized the data collection and helped to draft the manuscript. SHEK supervised the Grand Challenges consortium GC6: "Biomarkers for protection against Tuberculosis on the background of AIDS/HIV in Africa" and helped in designing the study and writing the manuscript. SKP significantly contributed with scientific input and design of the GC6 study as well as project coordination. JS contributed to study design and methodological discussions and helped prepare the manuscript. All authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We thank Marco Ende for his support of our work on the Linux Cluster of the Institute for Biochemistry and Biology, University of Potsdam.  The Grand Challenges in Global Health Project: Grant Number 37772, &#8220;Biomarkers of protective immunity against Tuberculosis in the context of HIV/AIDS in Africa&#8221;, was funded by a grant from the Bill &amp; Melinda Gates Foundation through the Grand Challenges in Global Health Initiative.  </p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Deconfounding microarray analysis: independent measurements of cell type proportions used in a regression model to resolve tissue heterogeneity bias</p>
            </title>
            <aug>
               <au>
                  <snm>Jacobsen</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Repsilber</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Gutschmidt</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Neher</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Feldmann</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Mollenkopf</snm>
                  <fnm>HJ</fnm>
               </au>
               <au>
                  <snm>Kaufmann</snm>
                  <fnm>SH</fnm>
               </au>
               <au>
                  <snm>Ziegler</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Methods of Information in Medicine</source>
            <pubdate>2006</pubdate>
            <volume>45</volume>
            <issue>5</issue>
            <fpage>557</fpage>
            <lpage>563</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">17019511</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Novel strategies to identify biomarkers in tuberculosis</p>
            </title>
            <aug>
               <au>
                  <snm>Jacobsen</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Mattow</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Repsilber</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Kaufmann</snm>
                  <fnm>SHE</fnm>
               </au>
            </aug>
            <source>Biological Chemistry</source>
            <pubdate>2008</pubdate>
            <volume>389</volume>
            <fpage>487</fpage>
            <lpage>495</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1515/BC.2008.053</pubid>
                  <pubid idtype="pmpid">18953715</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Sample selection for microarray gene expression studies</p>
            </title>
            <aug>
               <au>
                  <snm>Repsilber</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Fink</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Jacobsen</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Bl&#228;asing</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Ziegler</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Methods of Information in Medicine</source>
            <pubdate>2005</pubdate>
            <volume>44</volume>
            <issue>3</issue>
            <fpage>461</fpage>
            <lpage>467</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">16113774</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>In silico microdissection of microarray data from heterogeneous cell populations</p>
            </title>
            <aug>
               <au>
                  <snm>Lahdesmaki</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Shmulevich</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Dunmire</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Yli-Harja</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <fpage>54</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1186/1471-2105-6-54</pubid>
                  <pubid idtype="pmcid">1274251</pubid>
                  <pubid idtype="pmpid" link="fulltext">15766384</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Lymphocytes. 3. Distribution: Distribution of lymphocytes in health</p>
            </title>
            <aug>
               <au>
                  <snm>Ford</snm>
                  <fnm>WL</fnm>
               </au>
            </aug>
            <source>Journal of Clinical Pathology</source>
            <pubdate>1979</pubdate>
            <volume>32</volume>
            <issue>13</issue>
            <fpage>63</fpage>
            <lpage>69</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1136/jcp.s3-13.1.63</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Monoclonal antibodies and the FACS: complementary tools for immunobiology and medicine</p>
            </title>
            <aug>
               <au>
                  <snm>Herzenberg</snm>
                  <fnm>LA</fnm>
               </au>
               <au>
                  <snm>De Rosa</snm>
                  <fnm>SC</fnm>
               </au>
               <au>
                  <snm>Herzenberg</snm>
                  <fnm>LA</fnm>
               </au>
            </aug>
            <source>Immunology Today</source>
            <pubdate>2000</pubdate>
            <volume>21</volume>
            <issue>8</issue>
            <fpage>383</fpage>
            <lpage>390</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0167-5699(00)01678-9</pubid>
                  <pubid idtype="pmpid" link="fulltext">10916141</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Laser capture microdissection</p>
            </title>
            <aug>
               <au>
                  <snm>Emmert-Buck</snm>
                  <fnm>MR</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1996</pubdate>
            <volume>274</volume>
            <issue>5289</issue>
            <fpage>998</fpage>
            <lpage>1001</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.274.5289.998</pubid>
                  <pubid idtype="pmpid" link="fulltext">8875945</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Automated Tissue Analysis a Bioinformatics Perspective</p>
            </title>
            <aug>
               <au>
                  <snm>Kriete</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Boyce</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>Methods of Information in Medicine</source>
            <pubdate>2005</pubdate>
            <volume>44</volume>
            <fpage>32</fpage>
            <lpage>37</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15778792</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Mixture models for assessing differential expression in complex tissues using microarray data</p>
            </title>
            <aug>
               <au>
                  <snm>Ghosh</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>20</volume>
            <issue>11</issue>
            <fpage>1663</fpage>
            <lpage>1669</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bth139</pubid>
                  <pubid idtype="pmpid" link="fulltext">14988124</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Separation of samples into their constituents using gene expression data</p>
            </title>
            <aug>
               <au>
                  <snm>Venet</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Pecasse</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Maenhaut</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Bersini</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2001</pubdate>
            <volume>17</volume>
            <issue>Suppl.1</issue>
            <fpage>S279</fpage>
            <lpage>S287</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11473019</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Expression deconvolution: a reinterpretation of DNA microarray data reveals dynamic changes in cell populations</p>
            </title>
            <aug>
               <au>
                  <snm>Lu</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Nakorchevskiy</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Marcotte</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2003</pubdate>
            <volume>100</volume>
            <issue>18</issue>
            <fpage>10370</fpage>
            <lpage>5</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1073/pnas.1832361100</pubid>
                  <pubid idtype="pmcid">193568</pubid>
                  <pubid idtype="pmpid" link="fulltext">12934019</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>In silico dissection of cell-type-associated patterns of gene expression in prostate cancer</p>
            </title>
            <aug>
               <au>
                  <snm>Stuart</snm>
                  <fnm>RO</fnm>
               </au>
               <au>
                  <snm>Wachsman</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Berry</snm>
                  <fnm>CC</fnm>
               </au>
               <au>
                  <snm>Wang-Rodriguez</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Wasserman</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Klacansky</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Masys</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Arden</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Goodison</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>McClelland</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Sawyers</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Kalcheva</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Tarin</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Mercola</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>PNAS</source>
            <pubdate>2004</pubdate>
            <volume>101</volume>
            <issue>2</issue>
            <fpage>615</fpage>
            <lpage>620</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1073/pnas.2536479100</pubid>
                  <pubid idtype="pmcid">327196</pubid>
                  <pubid idtype="pmpid" link="fulltext">14722351</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Efficient two-sample designs for microarray experiments with biological replications</p>
            </title>
            <aug>
               <au>
                  <snm>Landgrebe</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Bretz</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Brunner</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Silico Biology</source>
            <pubdate>2004</pubdate>
            <volume>4</volume>
            <fpage>0038</fpage>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Limma: linear models for microarray data. In Bioinformatics and Computational Biology Solutions using R and Bioconductor</p>
            </title>
            <aug>
               <au>
                  <snm>Smyth</snm>
                  <fnm>GK</fnm>
               </au>
            </aug>
            <publisher>New York: Springer</publisher>
            <editor>Gentleman R, Carey V, Dudoit S, Irizarry R, Huber W</editor>
            <pubdate>2005</pubdate>
            <fpage>397</fpage>
            <lpage>420</lpage>
         </bibl>
         <bibl id="B15">
            <title>
               <p>A comparison of background correction methods for two-colour microarrays</p>
            </title>
            <aug>
               <au>
                  <snm>Ritchie</snm>
                  <fnm>ME</fnm>
               </au>
               <au>
                  <snm>Silver</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Oshlack</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Holmes</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Diyagama</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Holloway</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Smyth</snm>
                  <fnm>GK</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>23</volume>
            <fpage>2700</fpage>
            <lpage>2707</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btm412</pubid>
                  <pubid idtype="pmpid" link="fulltext">17720982</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Normalization of cDNA microarray data</p>
            </title>
            <aug>
               <au>
                  <snm>Smyth</snm>
                  <fnm>GK</fnm>
               </au>
               <au>
                  <snm>Speed</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Methods</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <fpage>265</fpage>
            <lpage>273</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S1046-2023(03)00155-5</pubid>
                  <pubid idtype="pmpid" link="fulltext">14597310</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation</p>
            </title>
            <aug>
               <au>
                  <snm>Yang</snm>
                  <fnm>YH</fnm>
               </au>
               <au>
                  <snm>Dudoit</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Luu</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Lin</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Peng</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Ngai</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Speed</snm>
                  <fnm>TP</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <issue>4</issue>
            <fpage>e15</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/30.4.e15</pubid>
                  <pubid idtype="pmcid">100354</pubid>
                  <pubid idtype="pmpid" link="fulltext">11842121</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <aug>
               <au>
                  <cnm>R Development Core Team</cnm>
               </au>
            </aug>
            <source>a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria</source>
            <pubdate>2005</pubdate>
         </bibl>
         <bibl id="B19">
            <aug>
               <au>
                  <snm>Lawson</snm>
                  <fnm>CL</fnm>
               </au>
               <au>
                  <snm>Hanson</snm>
                  <fnm>RJ</fnm>
               </au>
            </aug>
            <source>Solving Least-Squares Problems</source>
            <publisher>Englewood Cliffs, New Jersey: Prentice-Hall</publisher>
            <pubdate>1974</pubdate>
            <note>[Chapter 23].</note>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Random Forests</p>
            </title>
            <aug>
               <au>
                  <snm>Breiman</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Machine Learning</source>
            <pubdate>2001</pubdate>
            <volume>45</volume>
            <fpage>5</fpage>
            <lpage>32</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1023/A:1010933404324</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Classification and Regression by randomForest</p>
            </title>
            <aug>
               <au>
                  <snm>Liaw</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Wiener</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>R News</source>
            <pubdate>2002</pubdate>
            <volume>2</volume>
            <issue>3</issue>
            <fpage>18</fpage>
            <lpage>22</lpage>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Application of Bayesian decomposition for analysing microarray data</p>
            </title>
            <aug>
               <au>
                  <snm>Moloshok</snm>
                  <fnm>TD</fnm>
               </au>
               <au>
                  <snm>Klevecz</snm>
                  <fnm>RR</fnm>
               </au>
               <au>
                  <snm>Grant</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Manion</snm>
                  <fnm>FJ</fnm>
               </au>
               <au>
                  <snm>Speier</snm>
                  <fnm>WF</fnm>
               </au>
               <au>
                  <snm>Ochs</snm>
                  <fnm>MF</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <issue>4</issue>
            <fpage>566</fpage>
            <lpage>75</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/18.4.566</pubid>
                  <pubid idtype="pmpid" link="fulltext">12016054</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>A HaemAtlas: characterizing gene expression in differentiated human blood cells</p>
            </title>
            <aug>
               <au>
                  <snm>Watkins</snm>
                  <fnm>NA</fnm>
               </au>
               <au>
                  <snm>Gusnanto</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>de Bono</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>De</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Miranda-Saavedra</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Hardie</snm>
                  <fnm>DL</fnm>
               </au>
               <au>
                  <snm>Angenent</snm>
                  <fnm>WG</fnm>
               </au>
               <au>
                  <snm>Attwood</snm>
                  <fnm>AP</fnm>
               </au>
               <au>
                  <snm>Ellis</snm>
                  <fnm>PD</fnm>
               </au>
               <au>
                  <snm>Erber</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Foad</snm>
                  <fnm>NS</fnm>
               </au>
               <au>
                  <snm>Garner</snm>
                  <fnm>SF</fnm>
               </au>
               <au>
                  <snm>Isacke</snm>
                  <fnm>CM</fnm>
               </au>
               <au>
                  <snm>Jolley</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Koch</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Macaulay</snm>
                  <fnm>IC</fnm>
               </au>
               <au>
                  <snm>Morley</snm>
                  <fnm>SL</fnm>
               </au>
               <au>
                  <snm>Rendon</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Rice</snm>
                  <fnm>KM</fnm>
               </au>
               <au>
                  <snm>Taylor</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Thijssen-Timmer</snm>
                  <fnm>DC</fnm>
               </au>
               <au>
                  <snm>Tijssen</snm>
                  <fnm>MR</fnm>
               </au>
               <au>
                  <snm>Schoot</snm>
                  <mnm>van der</mnm>
                  <fnm>CE</fnm>
               </au>
               <au>
                  <snm>Wernisch</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Winzer</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Dudbridge</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Buckley</snm>
                  <fnm>CD</fnm>
               </au>
               <au>
                  <snm>Langford</snm>
                  <fnm>CF</fnm>
               </au>
               <au>
                  <snm>Teichmann</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Gottgens</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Ouwehand</snm>
                  <fnm>WH</fnm>
               </au>
            </aug>
            <source>Blood</source>
            <pubdate>2009</pubdate>
            <volume>113</volume>
            <issue>19</issue>
            <fpage>e1</fpage>
            <lpage>9</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1182/blood-2008-06-162958</pubid>
                  <pubid idtype="pmcid">2680378</pubid>
                  <pubid idtype="pmpid" link="fulltext">19228925</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>On Testing the Significance of sets of gens</p>
            </title>
            <aug>
               <au>
                  <snm>Efron</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Tibshirani</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>The Annals of Applied Statistics</source>
            <pubdate>2007</pubdate>
            <volume>1</volume>
            <fpage>107</fpage>
            <lpage>129</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1214/07-AOAS101</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>GlobalANCOVA: exploration and assessment of gene group effects</p>
            </title>
            <aug>
               <au>
                  <snm>Hummel</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Meister</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Mansmann</snm>
                  <fnm>U</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2008</pubdate>
            <volume>24</volume>
            <fpage>78</fpage>
            <lpage>85</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btm531</pubid>
                  <pubid idtype="pmpid" link="fulltext">18024976</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Classifying gene expression profiles from pairwise mRNA comparisons</p>
            </title>
            <aug>
               <au>
                  <snm>Geman</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>d'Avignon</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Naiman</snm>
                  <fnm>DQ</fnm>
               </au>
               <au>
                  <snm>Winslow</snm>
                  <fnm>RL</fnm>
               </au>
            </aug>
            <source>Stat Appl Genet Mol Biol</source>
            <pubdate>2004</pubdate>
            <volume>3</volume>
            <note>Article19.</note>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1989150</pubid>
                  <pubid idtype="pmpid" link="fulltext">16646797</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Weighted Top Score Pair Method for Gene Selection and Classification</p>
            </title>
            <aug>
               <au>
                  <snm>Luo</snm>
                  <fnm>Huaien</fnm>
               </au>
               <au>
                  <snm>Sudibyo</snm>
                  <fnm>Yuliansa</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <mi>D</mi>
                  <fnm>Lance</fnm>
               </au>
               <au>
                  <snm>Karuturi</snm>
                  <mnm>Murthy</mnm>
                  <fnm>R Krishna</fnm>
               </au>
            </aug>
            <source>Lecture Notes in Computer Science: Pattern Recognition in Bioinformatics</source>
            <pubdate>2008</pubdate>
            <volume>5265</volume>
            <fpage>323</fpage>
            <lpage>333</lpage>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Bayesian Factor Regression Models in the "Large p, Small m" Paradigm</p>
            </title>
            <aug>
               <au>
                  <snm>West</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Bayesian Statistics</source>
            <pubdate>2003</pubdate>
            <volume>7</volume>
            <fpage>723</fpage>
            <lpage>732</lpage>
         </bibl>
      </refgrp>
   </bm>
</art>

