<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art><ui>1471-2105-12-496</ui><ji>1471-2105</ji><fm>
<dochead>Methodology article</dochead>
<bibl>
<title>
<p>A mixture model with a reference-based automatic selection of components for disease classification from protein and/or gene expression levels</p>
</title>
<aug>
<au id="A1" ca="yes"><snm>Kopriva</snm><fnm>Ivica</fnm><insr iid="I1"/><email>ikopriva@irb.hr</email></au>
<au id="A2"><snm>Filipovi&#263;</snm><fnm>Marko</fnm><insr iid="I1"/><email>filipov@irb.hr</email></au>
</aug>
<insg>
<ins id="I1"><p>Division of Laser and Atomic R&amp;D, Ru&#273;er Bo&#353;kovi&#263; Institute, Bijeni&#269;ka cesta 54, 10000 Zagreb, Croatia</p></ins>
</insg>
<source>BMC Bioinformatics</source>
<issn>1471-2105</issn>
<pubdate>2011</pubdate>
<volume>12</volume>
<issue>1</issue>
<fpage>496</fpage>
<url>http://www.biomedcentral.com/1471-2105/12/496</url>
<xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-12-496</pubid><pubid idtype="pmpid">22208882</pubid></pubidlist></xrefbib>
</bibl>
<history><rec><date><day>29</day><month>6</month><year>2011</year></date></rec><acc><date><day>30</day><month>12</month><year>2011</year></date></acc><pub><date><day>30</day><month>12</month><year>2011</year></date></pub></history>
<cpyrt><year>2011</year><collab>Kopriva and Filipovi&#263;; licensee BioMed Central Ltd.</collab><note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note></cpyrt>
<abs>
<sec>
<st>
<p>Abstract</p>
</st>
<sec>
<st>
<p>Background</p>
</st>
<p>Bioinformatics data analysis is often using linear mixture model representing samples as additive mixture of components. Properly constrained blind matrix factorization methods extract those components using mixture samples only. However, automatic selection of extracted components to be retained for classification analysis remains an open issue.</p>
</sec>
<sec>
<st>
<p>Results</p>
</st>
<p>The method proposed here is applied to well-studied protein and genomic datasets of ovarian, prostate and colon cancers to extract components for disease prediction. It achieves average sensitivities of: 96.2 (sd = 2.7%), 97.6% (sd = 2.8%) and 90.8% (sd = 5.5%) and average specificities of: 93.6% (sd = 4.1%), 99% (sd = 2.2%) and 79.4% (sd = 9.8%) in 100 independent two-fold cross-validations.</p>
</sec>
<sec>
<st>
<p>Conclusions</p>
</st>
<p>We propose an additive mixture model of a sample for feature extraction using, in principle, sparseness constrained factorization on a sample-by-sample basis. As opposed to that, existing methods factorize complete dataset simultaneously. The sample model is composed of a reference sample representing control and/or case (disease) groups and a test sample. Each sample is decomposed into two or more components that are selected automatically (without using label information) as control specific, case specific and not differentially expressed (neutral). The number of components is determined by cross-validation. Automatic assignment of features (<it>m</it>/<it>z </it>ratios or genes) to particular component is based on thresholds estimated from each sample directly. Due to the locality of decomposition, the strength of the expression of each feature across the samples can vary. Yet, they will still be allocated to the related disease and/or control specific component. Since label information is not used in the selection process, case and control specific components can be used for classification. That is not the case with standard factorization methods. Moreover, the component selected by proposed method as disease specific can be interpreted as a <it>sub-mode </it>and retained for further analysis to identify potential biomarkers. As opposed to standard matrix factorization methods this can be achieved on a sample (experiment)-by-sample basis. Postulating one or more components with indifferent features enables their removal from disease and control specific components on a sample-by-sample basis. This yields selected components with reduced complexity and generally, it increases prediction accuracy.</p>
</sec>
</sec>
</abs>
</fm><bdy>
<sec>
<st>
<p>Background</p>
</st>
<p>Bioinformatics data analysis is often based on the use of a linear mixture model (LMM) of a sample <abbrgrp>
<abbr bid="B1">1</abbr>
<abbr bid="B2">2</abbr>
<abbr bid="B3">3</abbr>
<abbr bid="B4">4</abbr>
<abbr bid="B5">5</abbr>
<abbr bid="B6">6</abbr>
<abbr bid="B7">7</abbr>
<abbr bid="B8">8</abbr>
<abbr bid="B9">9</abbr>
<abbr bid="B10">10</abbr>
<abbr bid="B11">11</abbr>
<abbr bid="B12">12</abbr>
<abbr bid="B13">13</abbr>
<abbr bid="B14">14</abbr>
<abbr bid="B15">15</abbr>
</abbrgrp>, whereas mixture is composed of components generated by unknown number of interfering sources. As an example, components can be generated during disease progression that causes cancerous cells to produce proteins and/or other molecules that can serve as early indicators (biomarkers) representing disease correlated chemical entities. Their correct identification may be very beneficial for an early detection and diagnosis of disease <abbrgrp>
<abbr bid="B16">16</abbr>
</abbrgrp>. However, an identification of individual components within a sample is complicated by the fact that they can be "buried" within multiple substances. In addition to that, dynamic range of their concentrations can vary even several orders of magnitude <abbrgrp>
<abbr bid="B16">16</abbr>
</abbrgrp>, i.e., single components could no longer be recognizable <abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp>. Nevertheless, there are the algorithms able to extract either individual components or a group of components with similar concentrations within a sample. These algorithms are known under the name blind source separation (BSS) <abbrgrp>
<abbr bid="B17">17</abbr>
</abbrgrp>, and they commonly include independent component analysis (ICA) <abbrgrp>
<abbr bid="B18">18</abbr>
</abbrgrp>, and nonnegative matrix factorization (NMF) <abbrgrp>
<abbr bid="B19">19</abbr>
</abbrgrp>. However, BSS methods perform unsupervised decomposition of the mixture samples. Thus, it is not clear which of the extracted components are to be retained for further prediction/classification analysis. To this end, several contributions toward solution of this problem have been published in <abbrgrp>
<abbr bid="B1">1</abbr>
<abbr bid="B2">2</abbr>
<abbr bid="B3">3</abbr>
<abbr bid="B4">4</abbr>
<abbr bid="B5">5</abbr>
<abbr bid="B8">8</abbr>
</abbrgrp>. In <abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp>, a matrix factorization approach to the decomposition of infrared spectra of a sample is proposed taking into account class labels i.e., the classification phase and the components inference tasks are unified. Thus, the concept proposed in <abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp> is a classifier specific. It is formulated as the multiclass assignment problem where the number of components equals the number of classes and must be less than the number of samples available. As opposed to <abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp>, the method proposed here selects automatically the case and control specific components on a sample-by-sample basis. Afterwards, these components can be used to train arbitrary classifier. In <abbrgrp>
<abbr bid="B2">2</abbr>
</abbrgrp> gene expression profile is modelled as a linear superposition of three components comprised of up-regulated, down-regulated and differentially not expressed genes, whereas existence of two <it>fixed thresholds </it>is assumed to enable a decision to which of the three components the particular gene belongs. The thresholds are defined heuristically and in each specific case the optimal value must be obtained by cross-validation. Moreover, the upper threshold <it>c</it>
<sub>u </sub>and the lower one <it>c</it>
<sub>l </sub>are mutually related through <it>c</it>
<sub>u </sub>= 1/<it>c</it>
<sub>l</sub>. As opposed to that, the method proposed here decomposes each sample (experiment) into components comprised of up-regulated, down-regulated and not differentially expressed features using data adaptive thresholds. They are based on mixing angles of an innovative linear mixture model of a sample. The method proposed in <abbrgrp>
<abbr bid="B3">3</abbr>
</abbrgrp> uses available sample labels (the clinical diagnosis of the experiments) to select component(s), extracted by independent component analysis (ICA) or nonnegative matrix factorization (NMF), for further analysis. ICA or NMF are used to factorize the whole dataset simultaneously and one selected component (gene expression mode for ICA and metagene for NMF) is used for further analysis related to gene marker extraction. This component cannot be used for classification. Alternatively, basis matrix with labelled column vectors (for ICA) or row vectors (for NMF) can be used for classification in which case the test sample needs to be projected to space spanned by the column/row vectors, respectively. However, in this case no feature extraction can be performed. As opposed to ICA/NMF method proposed in <abbrgrp>
<abbr bid="B3">3</abbr>
</abbrgrp>, the method proposed here extracts disease and control specific component from each sample separately. Since no label information is used in the selection process, extracted components can be used for classification and that is the goal in this paper. The disease specific component can, however, be also retained for further biomarker related analysis as in <abbrgrp>
<abbr bid="B3">3</abbr>
</abbrgrp>. The important difference is that by the method proposed here such component can be obtained from each sample separately while the method in <abbrgrp>
<abbr bid="B3">3</abbr>
</abbrgrp>, as well as in <abbrgrp>
<abbr bid="B4">4</abbr>
<abbr bid="B5">5</abbr>
<abbr bid="B8">8</abbr>
</abbrgrp>, needs the whole dataset. The method <abbrgrp>
<abbr bid="B4">4</abbr>
</abbrgrp> uses again ICA (the FastICA algorithm <abbrgrp>
<abbr bid="B20">20</abbr>
</abbrgrp>) to factorize the microarray dataset. Extracted components (gene expression modes) were analyzed to discriminate between those with biological significance and those representing noise. However, biologically significant components can be used for further gene marker related analysis but not for classification. The reason is that, as in <abbrgrp>
<abbr bid="B3">3</abbr>
</abbrgrp>, the whole dataset composed of case and control samples is reduced to several biologically interesting components only. In the extreme case it can only be one such component. In <abbrgrp>
<abbr bid="B5">5</abbr>
</abbrgrp> the JADE ICA algorithm is used to decompose whole dataset into components (gene expression modes). As in <abbrgrp>
<abbr bid="B3">3</abbr>
<abbr bid="B4">4</abbr>
</abbrgrp> these components cannot be used for classification. They are used for further decomposition into sub-modes to identify a regulating network in the problem considered there. We want to emphasize that the component selected as disease specific by the method proposed here can also be interpreted as a sub-mode and used for the similar type of analysis. However, since it is extracted from an individual and labelled sample it can be used for the classification as well. That is the main goal in this paper. The method in <abbrgrp>
<abbr bid="B8">8</abbr>
</abbrgrp> again uses ICA (the maximum likelihood with natural gradient <abbrgrp>
<abbr bid="B18">18</abbr>
</abbrgrp>) to extract components (gene expression modes). Similarly, as in <abbrgrp>
<abbr bid="B3">3</abbr>
<abbr bid="B4">4</abbr>
<abbr bid="B5">5</abbr>
</abbrgrp> these components are not used for a classification. Instead, they are further analyzed by data clustering to determine biological relevance and extract gene markers. Similar types of comments as those discussed in relation to <abbrgrp>
<abbr bid="B3">3</abbr>
<abbr bid="B4">4</abbr>
<abbr bid="B5">5</abbr>
<abbr bid="B8">8</abbr>
</abbrgrp> can also be raised to other methods that use either ICA or NMF to extract components from the whole dataset, <abbrgrp>
<abbr bid="B6">6</abbr>
<abbr bid="B7">7</abbr>
<abbr bid="B10">10</abbr>
<abbr bid="B11">11</abbr>
<abbr bid="B12">12</abbr>
</abbrgrp>. Hence, although related to the component selection methods <abbrgrp>
<abbr bid="B1">1</abbr>
<abbr bid="B3">3</abbr>
<abbr bid="B4">4</abbr>
<abbr bid="B5">5</abbr>
<abbr bid="B8">8</abbr>
</abbrgrp> the method proposed here is dissimilar to all of them by the fact that it extracts most interesting components on a sample (experiment)-by-sample basis. To achieve this, the linear mixture model (LMM) used for components extraction is composed of a test sample and a reference sample representing control and/or case group. Hence, a test sample is, in principle, associated with two LMMs. Each LMM describes a sample as an additive mixture of two or more components. Two of them are selected automatically (no thresholds needed to be predefined) as case (disease) and control specific, while the rest are considered neutral i.e. not differentially expressed. Decomposition of each LMM is enabled by enforcing sparseness constraint on the components to be extracted. This implies that each feature (<it>m/z </it>ratio or gene) belongs to the two components at most (disease and neutral or, control and neutral). The model formally presumes that disease specific features are present in the prevailing concentration in disease samples as well as that control specific features are present in prevailing concentration in control samples. However, the features do not have to be expressed equally strong across the whole dataset in order to be selected as a part of disease or case specific components. It is this way due to the fact that decomposition is performed locally (on a sample-by-sample basis). This should prevent losing some important features for classification. Accordingly, the level of expression of indifferent features can also vary between the samples. Thus, postulating one or more components with indifferent features enables their removal that is sample adaptive. As opposed to that, existing methods try to optimize a single threshold for a whole dataset. Geometric interpretation of the LMM based on a reference sample enables automatic selection of disease and control specific components (Figure <figr fid="F1">1</figr> in section 1.2), without using label information. Hence, the selected components can be further used for disease prediction. By postulating existence of one or more components with differentially not expressed features the complexity of the selected components can be controlled (see discussion in section 1.7), whereas the overall number of components is selected by cross-validation. Although the feature selection is the main goal of the proposed method, component extracted from the sample as disease specific can also be interpreted as a sub-mode as in <abbrgrp>
<abbr bid="B3">3</abbr>
<abbr bid="B4">4</abbr>
</abbrgrp>. It can be used for further biomarker identification related analysis. We see the linearity of the model used to describe a sample as a potential limitation of a proposed method. Although linear models dominate in bioinformatics, it has been discussed in <abbrgrp>
<abbr bid="B8">8</abbr>
</abbrgrp> that nonlinear models might be more accurate description of biological processes. Assumption of an availability of a reference sample might also be seen as a potential weakness. Yet, we have demonstrated that in the absence of expert information the reference sample can be obtained by a simple average of all the samples within the same class. The proposed method is demonstrated in sections 1.4 to 1.7 on disease prediction problems using a computational model as well as on the experimental datasets related to a prediction of ovarian, prostate and colon cancers from protein and gene expression profiles.</p>
<fig id="F1"><title><p>Figure 1</p></title><caption><p>Geometrical interpretation of the linear mixture model</p></caption><text>
   <p><b>Geometrical interpretation of the linear mixture model</b>. Concentration vectors of the linear mixture model comprised of control reference sample and test sample, (2a) and Figure 1a, i.e. disease reference sample and test sample, (2b) and Figure 1b, are confined in a first quadrant of the plane spanned by two mixture samples. Features (<it>m</it>/<it>z </it>ratios or genes) with prevailing concentration in disease sample are linearly combined into component associated with the red colour relative concentration vector. Likewise, features with prevailing concentration in control sample are combined linearly into component associated with the blue colour relative concentration vector. Features that are not differentially expressed are combined linearly into one or more neutral components associated with the green colour relative concentration vectors.</p>
</text><graphic file="1471-2105-12-496-1" hint_layout="single"/></fig>
</sec>
<sec>
<st>
<p>Methods</p>
</st>
<p>This section derives sparse component analysis (SCA) approach to unsupervised decomposition of protein (mass spectra) and gene expression profiles using a novel mixture model of a sample. The model enables automatic selection of the two of the extracted components as case and control specific. They are retained for classification. In what follows, the problem motivation and definition are presented first. Then, LMM of a sample is introduced and its interpretation is described. Afterwards, a two-stage implementation of the SCA algorithm is described and discussed in details.</p>
<sec>
<st>
<p>1.1 Problem formulation</p>
</st>
<p>As mentioned previously, bioinformatics problems often deal with data containing components that are imprinted in a sample by several interfering sources. As an example, brief description of endocrine signalling system, secreting hormones into a blood stream, is given in <abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp>. Likewise, reference <abbrgrp>
<abbr bid="B21">21</abbr>
</abbrgrp> describes how different organs imprint their substances (metabolites) into a urine sample. As pointed out in <abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp> and <abbrgrp>
<abbr bid="B16">16</abbr>
</abbrgrp> disease samples are combinations of several co-regulated components (signals) originating from different sources (organs) and disease specific component is actually "buried" within a sample. Hence we are dealing with the two problems simultaneously: a sample decomposition (component inference) problem and a classification (disease prediction) problem that is based on sample decomposition. Thus, automatic selection of one or more extracted components is of practical importance. It is also important that component selection is done without a use of label information in which case it can be used for classification.</p>
<p>Matrix factorization is conveniently used in signal processing to solve decomposition problems <abbrgrp>
<abbr bid="B17">17</abbr>
<abbr bid="B18">18</abbr>
<abbr bid="B19">19</abbr>
</abbrgrp>. It is assumed that data matrix <b>X </b>&#8712; &#8477;<sup>
<it>N </it>&#215; <it>K </it>
</sup>is comprised of <it>N </it>row vectors representing mixture samples, whereas each sample is further comprised of <it>K </it>features (<it>m</it>/<it>z </it>ratios or genes). It is also assumed that <it>N </it>samples are labelled: <inline-formula>
<m:math name="1471-2105-12-496-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mfenced separators="" open="{" close="}">
         <m:mrow>
            <m:msub>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">x</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mi>n</m:mi>
               </m:mrow>
            </m:msub>
            <m:mo class="MathClass-rel">&#8712;</m:mo>
            <m:msup>
               <m:mrow>
                  <m:mi>&#8477;</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>k</m:mi>
               </m:mrow>
            </m:msup>
            <m:mo class="MathClass-punc">,</m:mo>
            <m:msub>
               <m:mrow>
                  <m:mi>y</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>n</m:mi>
               </m:mrow>
            </m:msub>
            <m:mo class="MathClass-rel">&#8712;</m:mo>
            <m:mrow>
               <m:mo class="MathClass-open">{</m:mo>
               <m:mrow>
                  <m:mn>1</m:mn>
                  <m:mo class="MathClass-punc">,</m:mo>
                  <m:mo class="MathClass-bin">-</m:mo>
                  <m:mn>1</m:mn>
               </m:mrow>
               <m:mo class="MathClass-close">}</m:mo>
            </m:mrow>
         </m:mrow>
      </m:mfenced>
   </m:mrow>
   <m:mrow>
      <m:mi>n</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:mi>N</m:mi>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula>, where 1 denotes positive (disease) sample and -1 stands for a negative (control) sample. Data matrix <b>X </b>is modelled as a product of two factor matrices:</p>
<p>
<display-formula id="M1">
<m:math name="1471-2105-12-496-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mstyle mathvariant="bold">
      <m:mi mathvariant="normal">X</m:mi>
   </m:mstyle>
   <m:mstyle class="text">
      <m:mtext class="textsf" mathvariant="sans-serif">&#160;=&#160;</m:mtext>
   </m:mstyle>
   <m:mstyle mathvariant="bold">
      <m:mi mathvariant="normal">A</m:mi>
      <m:mi mathvariant="normal">S</m:mi>
   </m:mstyle>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>where <b>A </b>&#8712; &#8477;<sup>
<it>N </it>&#215; <it>M </it>
</sup>and <b>S </b>&#8712; &#8477;<sup>
<it>M </it>&#215; <it>K </it>
</sup>, and <it>M </it>represents an <it>unknown </it>number of components present in a sample. Each component <inline-formula>
<m:math name="1471-2105-12-496-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mfenced separators="" open="{" close="}">
         <m:mrow>
            <m:msub>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">s</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mi>m</m:mi>
               </m:mrow>
            </m:msub>
            <m:mo class="MathClass-rel">&#8712;</m:mo>
            <m:msup>
               <m:mrow>
                  <m:mi>&#8477;</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>K</m:mi>
               </m:mrow>
            </m:msup>
         </m:mrow>
      </m:mfenced>
   </m:mrow>
   <m:mrow>
      <m:mi>m</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:mi>M</m:mi>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula> is represented by a row vector of matrix <b>S</b>. Nonnegative relative concentration profiles <inline-formula>
<m:math name="1471-2105-12-496-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mfenced separators="" open="{" close="}">
         <m:mrow>
            <m:msub>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">a</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mi>m</m:mi>
               </m:mrow>
            </m:msub>
            <m:mo class="MathClass-rel">&#8712;</m:mo>
            <m:msubsup>
               <m:mrow>
                  <m:mi>&#8477;</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mo class="MathClass-bin">+</m:mo>
               </m:mrow>
               <m:mrow>
                  <m:mi>N</m:mi>
               </m:mrow>
            </m:msubsup>
         </m:mrow>
      </m:mfenced>
   </m:mrow>
   <m:mrow>
      <m:mi>m</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:mi>M</m:mi>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula> are represented by column vectors of matrix <b>A </b>and are associated with the particular components. Here, it will be presented how innovative version of the LMM (1) of a sample <inline-formula>
<m:math name="1471-2105-12-496-i5" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mfenced separators="" open="{" close="}">
         <m:mrow>
            <m:msub>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">x</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mi>n</m:mi>
               </m:mrow>
            </m:msub>
            <m:mo class="MathClass-rel">&#8712;</m:mo>
            <m:msup>
               <m:mrow>
                  <m:mi>&#8477;</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>k</m:mi>
               </m:mrow>
            </m:msup>
         </m:mrow>
      </m:mfenced>
   </m:mrow>
   <m:mrow>
      <m:mi>m</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:mi>M</m:mi>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula> enables automatic selection of the case (disease) and control specific components out of <inline-formula>
<m:math name="1471-2105-12-496-i6" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mfenced separators="" open="{" close="}">
         <m:mrow>
            <m:msub>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">s</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mi>m</m:mi>
               </m:mrow>
            </m:msub>
         </m:mrow>
      </m:mfenced>
   </m:mrow>
   <m:mrow>
      <m:mi>m</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:mi>M</m:mi>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula> components extracted by unsupervised factorization method: a two stage SCA. The method will then be demonstrated on a computational model as well as on a cancer prediction problem using well known proteomic and genomic datasets.</p>
</sec>
<sec>
<st>
<p>1.2 Novel additive linear mixture model of a sample</p>
</st>
<p>The LMM (1) is widely used in various bioinformatics problems <abbrgrp>
<abbr bid="B1">1</abbr>
<abbr bid="B2">2</abbr>
<abbr bid="B3">3</abbr>
<abbr bid="B4">4</abbr>
<abbr bid="B5">5</abbr>
<abbr bid="B6">6</abbr>
<abbr bid="B7">7</abbr>
<abbr bid="B8">8</abbr>
<abbr bid="B9">9</abbr>
<abbr bid="B10">10</abbr>
<abbr bid="B11">11</abbr>
<abbr bid="B12">12</abbr>
<abbr bid="B13">13</abbr>
<abbr bid="B14">14</abbr>
<abbr bid="B15">15</abbr>
</abbrgrp>. Unless constraints are imposed on <b>A </b>and/or <b>S</b>, the matrix factorization implied by (1) is not unique. Typical constraints involve non-Gaussianity and statistical independence between components by ICA algorithms <abbrgrp>
<abbr bid="B6">6</abbr>
<abbr bid="B18">18</abbr>
</abbrgrp>, and non-negativity and sparseness constraints by NMF algorithms, <abbrgrp>
<abbr bid="B7">7</abbr>
<abbr bid="B11">11</abbr>
<abbr bid="B12">12</abbr>
<abbr bid="B19">19</abbr>
<abbr bid="B22">22</abbr>
<abbr bid="B23">23</abbr>
</abbrgrp>. In addition to that, many ICA algorithms, as well as many NMF algorithms, also require the <it>unknown </it>number of components <it>M </it>to be less than or equal to the number of mixture samples <it>N</it>.</p>
<p>Depending on the context, this constraint can be considered as restrictive. There are, however, ICA methods developed for the solution of underdetermined problems that are known as overcomplete ICA, see Chapter 16 in <abbrgrp>
<abbr bid="B18">18</abbr>
</abbrgrp>, as well as <abbrgrp>
<abbr bid="B24">24</abbr>
<abbr bid="B25">25</abbr>
</abbrgrp>. However, as discussed in details in <abbrgrp>
<abbr bid="B18">18</abbr>
</abbrgrp>, overcomplete ICA methods also assume that unknown components are sparse. The two exemplary overcomplete ICA methods based on sparseness assumption are described in <abbrgrp>
<abbr bid="B24">24</abbr>
</abbrgrp> and <abbrgrp>
<abbr bid="B25">25</abbr>
</abbrgrp>. In <abbrgrp>
<abbr bid="B24">24</abbr>
</abbrgrp> it is assumed that components are sparse and approximately uncorrelated ("quasi-uncorrelated"). This basically means that each feature belongs to one component only. That is even a fairly stronger assumption than what is used by the method proposed here. Likewise, in maximum likelihood (ML) approach to the overcomplete problem in <abbrgrp>
<abbr bid="B25">25</abbr>
</abbrgrp> it is assumed that marginal distributions of the components are Laplacian. In this case the component estimation problem (assuming the mixing matrix is estimated by clustering) is reduced to linear program with equality constraint. In other words, a probabilistic ML problem is converted into a deterministic linear programming task. Hence, the overcomplete ICA effectively becomes SCA. This further justifies our choice of the state-of-the-art SCA method (described in section 1.3), to be used in a component extraction task. Here, we propose a novel type of the LMM model which is composed of two samples only:</p>
<p>
<display-formula id="M2a">
<m:math name="1471-2105-12-496-i7" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mfenced separators="" open="[" close="]">
      <m:mrow>
         <m:mtable equalrows="false" columnlines="none none none none none none none none none none none none none none none none none none none" equalcolumns="false" class="array">
            <m:mtr>
               <m:mtd class="array" columnalign="center">
                  <m:msub>
                     <m:mrow>
                        <m:mstyle mathvariant="bold">
                           <m:mi mathvariant="normal">x</m:mi>
                        </m:mstyle>
                     </m:mrow>
                     <m:mrow>
                        <m:mstyle class="text">
                           <m:mtext class="textsf" mathvariant="sans-serif">control</m:mtext>
                        </m:mstyle>
                     </m:mrow>
                  </m:msub>
               </m:mtd>
            </m:mtr>
            <m:mtr>
               <m:mtd class="array" columnalign="center">
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">x</m:mi>
                  </m:mstyle>
               </m:mtd>
            </m:mtr>
            <m:mtr>
               <m:mtd class="array" columnalign="center"/>
            </m:mtr>
         </m:mtable>
      </m:mrow>
   </m:mfenced>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:msub>
      <m:mrow>
         <m:mstyle mathvariant="bold">
            <m:mi mathvariant="normal">A</m:mi>
         </m:mstyle>
      </m:mrow>
      <m:mrow>
         <m:mstyle class="text">
            <m:mtext class="textsf" mathvariant="sans-serif">control</m:mtext>
         </m:mstyle>
      </m:mrow>
   </m:msub>
   <m:msub>
      <m:mrow>
         <m:mstyle mathvariant="bold">
            <m:mi mathvariant="normal">S</m:mi>
         </m:mstyle>
      </m:mrow>
      <m:mrow>
         <m:mstyle class="text">
            <m:mtext class="textsf" mathvariant="sans-serif">control</m:mtext>
         </m:mstyle>
      </m:mrow>
   </m:msub>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>
<display-formula id="M2b">
<m:math name="1471-2105-12-496-i8" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mfenced separators="" open="[" close="]">
      <m:mrow>
         <m:mtable equalrows="false" columnlines="none none none none none none none none none none none none none none none none none none none" equalcolumns="false" class="array">
            <m:mtr>
               <m:mtd class="array" columnalign="center">
                  <m:msub>
                     <m:mrow>
                        <m:mstyle mathvariant="bold">
                           <m:mi mathvariant="normal">x</m:mi>
                        </m:mstyle>
                     </m:mrow>
                     <m:mrow>
                        <m:mstyle class="text">
                           <m:mtext class="textsf" mathvariant="sans-serif">disease</m:mtext>
                        </m:mstyle>
                     </m:mrow>
                  </m:msub>
               </m:mtd>
            </m:mtr>
            <m:mtr>
               <m:mtd class="array" columnalign="center">
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">x</m:mi>
                  </m:mstyle>
               </m:mtd>
            </m:mtr>
            <m:mtr>
               <m:mtd class="array" columnalign="center"/>
            </m:mtr>
         </m:mtable>
      </m:mrow>
   </m:mfenced>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:msub>
      <m:mrow>
         <m:mstyle mathvariant="bold">
            <m:mi mathvariant="normal">A</m:mi>
         </m:mstyle>
      </m:mrow>
      <m:mrow>
         <m:mstyle class="text">
            <m:mtext class="textsf" mathvariant="sans-serif">disease</m:mtext>
         </m:mstyle>
      </m:mrow>
   </m:msub>
   <m:msub>
      <m:mrow>
         <m:mstyle mathvariant="bold">
            <m:mi mathvariant="normal">S</m:mi>
         </m:mstyle>
      </m:mrow>
      <m:mrow>
         <m:mstyle class="text">
            <m:mtext class="textsf" mathvariant="sans-serif">disease</m:mtext>
         </m:mstyle>
      </m:mrow>
   </m:msub>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>The first sample is a reference sample representing control group, <b>x</b>
<sub>control </sub>&#8712; &#8477;<it>
<sup>K</sup>
</it>, in (2a) and case (disease) group, <b>x</b>
<sub>disease </sub>&#8712; &#8477;<it>
<sup>K</sup>
</it>, in (2b). The second sample is actual test sample: <inline-formula>
<m:math name="1471-2105-12-496-i9" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mstyle mathvariant="bold">
   <m:mi mathvariant="normal">x</m:mi>
</m:mstyle>
<m:mo class="MathClass-rel">&#8712;</m:mo>
<m:msubsup>
   <m:mrow>
      <m:mfenced separators="" open="{" close="}">
         <m:mrow>
            <m:msub>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">x</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mi>n</m:mi>
               </m:mrow>
            </m:msub>
            <m:mo class="MathClass-rel">&#8712;</m:mo>
            <m:msup>
               <m:mrow>
                  <m:mi>&#8477;</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>k</m:mi>
               </m:mrow>
            </m:msup>
         </m:mrow>
      </m:mfenced>
   </m:mrow>
   <m:mrow>
      <m:mi>n</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:mi>N</m:mi>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula>. Coefficients of matrices <inline-formula>
<m:math name="1471-2105-12-496-i10" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mrow>
      <m:mstyle mathvariant="bold">
         <m:mi mathvariant="normal">A</m:mi>
      </m:mstyle>
   </m:mrow>
   <m:mrow>
      <m:mstyle class="text">
         <m:mtext class="textsf" mathvariant="sans-serif">control</m:mtext>
      </m:mstyle>
   </m:mrow>
</m:msub>
<m:mo class="MathClass-rel">&#8712;</m:mo>
<m:msubsup>
   <m:mrow>
      <m:mi>&#8477;</m:mi>
   </m:mrow>
   <m:mrow>
      <m:mo class="MathClass-bin">+</m:mo>
   </m:mrow>
   <m:mrow>
      <m:mn>2</m:mn>
      <m:mo class="MathClass-bin">&#215;</m:mo>
      <m:mi>M</m:mi>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula> and <inline-formula>
<m:math name="1471-2105-12-496-i11" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mrow>
      <m:mstyle mathvariant="bold">
         <m:mi mathvariant="normal">A</m:mi>
      </m:mstyle>
   </m:mrow>
   <m:mrow>
      <m:mstyle class="text">
         <m:mtext class="textsf" mathvariant="sans-serif">disease</m:mtext>
      </m:mstyle>
   </m:mrow>
</m:msub>
<m:mo class="MathClass-rel">&#8712;</m:mo>
<m:msubsup>
   <m:mrow>
      <m:mi>&#8477;</m:mi>
   </m:mrow>
   <m:mrow>
      <m:mo class="MathClass-bin">+</m:mo>
   </m:mrow>
   <m:mrow>
      <m:mn>2</m:mn>
      <m:mo class="MathClass-bin">&#215;</m:mo>
      <m:mi>M</m:mi>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula> in (2a) and (2b) refer to the amount of relative concentration at which related components are present in the mixture samples <b>x </b>and <b>x</b>
<sub>control </sub>in (2a) or <b>x </b>and <b>x</b>
<sub>disease </sub>in (2b). Source matrices <b>S</b>
<sub>control </sub>&#8712; &#8477;<sup>
<it>M </it>&#215; <it>K </it>
</sup>and <b>S</b>
<sub>disease </sub>&#8712; &#8477;<sup>
<it>M </it>&#215; <it>K </it>
</sup>contain (as row vectors), disease- and control specific components and, possibly, differentially not expressed components. Number of components <it>M </it>is assumed to be greater than or equal to 2. Evidently, for <it>M </it>= 2 existence of differentially not expressed components is not postulated. Importance of postulating components with indifferent features is to obtain less complex disease and control specific components used for classification (see also discussion in section 1.7). These components absorb features that do not vary substantially across the sample population. These features are removed automatically from each sample. The concentration is relative due to the fact that BSS methods enable estimation of the mixing and source matrices up to the scaling constant only. Therefore, it is customary to constrain the column vectors of the mixing matrix to unit &#8467;<sub>2 </sub>or &#8467;<sub>1 </sub>norm. The LMM proposed here is built upon an implicit assumption that disease specific features (<it>m</it>/<it>z </it>ratios or genes) are present in prevailing concentration in disease specific samples and in minor concentration in control specific samples. As opposed to that, control specific features are present in prevailing concentration in control specific samples and in minor concentration in disease specific samples. Features that are not differentially expressed are present in similar concentrations in both control and disease specific samples. These groups of features constitute components, whereas similarity of their concentration profiles enables automatic selection of the components extracted by unsupervised factorization. The assumption on the prevailing concentrations of up- and down-regulated features needs to be understood in the relative sense. It is justified on the basis of locality of proposed method since the components are extracted on a sample-by-sample basis. Thus, to be allocated in the same component (a case or a control specific one) feature does not need to be expressed in each sample equally strong. Since the LMMs (2a)/(2b) considered here are comprised of two samples only the non-negative mixing vectors are confined in the first quadrant of the plane spanned by control reference sample and test sample, see Figure <figr fid="F1">1a</figr>, or by disease reference sample and test sample, see Figure <figr fid="F1">1b</figr>. Thus, upon decomposition of the LMM (2a) into <it>M </it>components, the one associated with the mixing vector that confines the maximal angle with respect to the axis defined by control reference sample is selected as a disease specific component, Figure <figr fid="F1">1a</figr>. As opposed to that, the one associated with the mixing vector that confines the minimal angle with respect to the axis defined by control reference sample is selected as a control specific component. When decomposition is performed with respect to a disease reference sample, LMM (2b), the logic for an angle-based automatic selection of disease and control specific components is the opposite, see Figure <figr fid="F1">1b</figr>. The components not selected as disease or control specific are considered neutral i.e. not differentially expressed. Thus, LMMs (2a)/(2b) enable automatic selection of the components extracted by unsupervised factorization of mixture samples. Unlike selection method presented in <abbrgrp>
<abbr bid="B2">2</abbr>
</abbrgrp> that is based on fixed thresholds which need to be determined by cross-validation, the thresholds (mixing angles) used in the method presented here are sample adaptive. An assumption that each feature is contained in disease specific and one of the neutral components, or control specific and one of the neutral components, represents a sparseness constraint. It enables solution of the related BSS problems through, in principle, two-stage SCA method described in section 1.3. However, sparseness constraint is not justified by mathematical reasons only but also, as emphasized in <abbrgrp>
<abbr bid="B3">3</abbr>
<abbr bid="B6">6</abbr>
<abbr bid="B11">11</abbr>
<abbr bid="B12">12</abbr>
</abbrgrp>, by the biological reasons. As noted in <abbrgrp>
<abbr bid="B6">6</abbr>
</abbrgrp> this is necessary if underlying component (source signal) is going to be indicative of ongoing biological processes in a sample (cell, tissue, serum, etc.). The same conjecture has actually also been used in a three components based gene discovery method in <abbrgrp>
<abbr bid="B2">2</abbr>
</abbrgrp>. In this respect, the sparseness constrained NMF methods for microarray data analysis proposed in <abbrgrp>
<abbr bid="B7">7</abbr>
<abbr bid="B11">11</abbr>
<abbr bid="B12">12</abbr>
</abbrgrp> also assume the same working hypothesis. As discussed in <abbrgrp>
<abbr bid="B11">11</abbr>
<abbr bid="B12">12</abbr>
</abbrgrp>, it is the sparseness constraint that enabled biological relevance of obtained results. In microarray data analysis enforcement of sparseness constraint is biologically justified due to the fact that more sparse <b>S </b>gives rise to metagenes (if factorization is performed by NMF), or to the expression modes (if factorization is performed by ICA), that comprise few dominantly co-expressed genes which may indicate good local features for specific disease <abbrgrp>
<abbr bid="B11">11</abbr>
</abbrgrp>. A subtle interpretation of the reference-based mixture model (2a)/(2b) reveals its several profound characteristics. Since placement of the features to each of the two or more postulated components is based on sample adaptive thresholds (decomposition is localized), one gene (or <it>m</it>/<it>z </it>ratio) may be highly up-regulated in a case of one sample and significantly less expressed in a case of an another sample. Yet, if it is contained in prevailing concentration in both samples it will be contained in both cases in the component automatically selected as disease or control specific. Moreover, sample adaptive component (feature) selection enables that features selected as up- (or down)-regulated in one sample be less (or more) expressed than differentially not expressed features in another sample. Thus, extracted components selected as disease or control specific are composed of multiple features with different expression levels and joint discriminative power rather than of several (or even single) features only.</p>
<p>For disease prediction, disease and control specific components can be used to train a classifier. The reason is that in each LMM (2a)/(2b) they are extracted with respect to different reference samples and, thus, carry on different but specific information. Hence, proposed method yields four components to be retained for classifier training. In accordance with Figure <figr fid="F1">1</figr> they are denoted as <inline-formula>
<m:math name="1471-2105-12-496-i12" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mstyle mathvariant="bold">
         <m:mi mathvariant="normal">s</m:mi>
      </m:mstyle>
   </m:mrow>
   <m:mrow>
      <m:mstyle class="text">
         <m:mtext class="textsf" mathvariant="sans-serif">control&#160;ref</m:mtext>
      </m:mstyle>
      <m:mstyle class="text">
         <m:mtext class="textsf" mathvariant="sans-serif">.;</m:mtext>
      </m:mstyle>
      <m:mi>n</m:mi>
   </m:mrow>
   <m:mrow>
      <m:mstyle class="text">
         <m:mtext class="textsf" mathvariant="sans-serif">disease</m:mtext>
      </m:mstyle>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula>, <inline-formula>
<m:math name="1471-2105-12-496-i13" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mstyle mathvariant="bold">
         <m:mi mathvariant="normal">s</m:mi>
      </m:mstyle>
   </m:mrow>
   <m:mrow>
      <m:mstyle class="text">
         <m:mtext class="textsf" mathvariant="sans-serif">control&#160;ref</m:mtext>
      </m:mstyle>
      <m:mstyle class="text">
         <m:mtext class="textsf" mathvariant="sans-serif">.;</m:mtext>
      </m:mstyle>
      <m:mi>n</m:mi>
   </m:mrow>
   <m:mrow>
      <m:mstyle class="text">
         <m:mtext class="textsf" mathvariant="sans-serif">control</m:mtext>
      </m:mstyle>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula>, <inline-formula>
<m:math name="1471-2105-12-496-i14" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mstyle mathvariant="bold">
         <m:mi mathvariant="normal">s</m:mi>
      </m:mstyle>
   </m:mrow>
   <m:mrow>
      <m:mstyle class="text">
         <m:mtext class="textsf" mathvariant="sans-serif">disease&#160;ref</m:mtext>
      </m:mstyle>
      <m:mstyle class="text">
         <m:mtext class="textsf" mathvariant="sans-serif">.;</m:mtext>
      </m:mstyle>
      <m:mi>n</m:mi>
   </m:mrow>
   <m:mrow>
      <m:mstyle class="text">
         <m:mtext class="textsf" mathvariant="sans-serif">control</m:mtext>
      </m:mstyle>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula>, and <inline-formula>
<m:math name="1471-2105-12-496-i15" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mstyle mathvariant="bold">
         <m:mi mathvariant="normal">s</m:mi>
      </m:mstyle>
   </m:mrow>
   <m:mrow>
      <m:mstyle class="text">
         <m:mtext class="textsf" mathvariant="sans-serif">disease&#160;ref</m:mtext>
      </m:mstyle>
      <m:mstyle class="text">
         <m:mtext class="textsf" mathvariant="sans-serif">.;</m:mtext>
      </m:mstyle>
      <m:mi>n</m:mi>
   </m:mrow>
   <m:mrow>
      <m:mstyle class="text">
         <m:mtext class="textsf" mathvariant="sans-serif">disease</m:mtext>
      </m:mstyle>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula>, where <it>n </it>denotes index of a test sample <b>x</b>
<it>
<sub>n </sub>
</it>used in current decomposition. Components extracted from <it>N </it>mixture samples, form four sets of labelled feature vectors as follows: <inline-formula>
<m:math name="1471-2105-12-496-i16" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mfenced separators="" open="{" close="}">
         <m:mrow>
            <m:msubsup>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">s</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">control&#160;ref</m:mtext>
                  </m:mstyle>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">.;</m:mtext>
                  </m:mstyle>
                  <m:mi>n</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">disease</m:mtext>
                  </m:mstyle>
               </m:mrow>
            </m:msubsup>
            <m:mo class="MathClass-punc">,</m:mo>
            <m:msub>
               <m:mrow>
                  <m:mi>y</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>n</m:mi>
               </m:mrow>
            </m:msub>
         </m:mrow>
      </m:mfenced>
   </m:mrow>
   <m:mrow>
      <m:mi>n</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:mi>N</m:mi>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula>, <inline-formula>
<m:math name="1471-2105-12-496-i17" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mfenced separators="" open="{" close="}">
         <m:mrow>
            <m:msubsup>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">s</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">control&#160;ref</m:mtext>
                  </m:mstyle>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">.;</m:mtext>
                  </m:mstyle>
                  <m:mi>n</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">control</m:mtext>
                  </m:mstyle>
               </m:mrow>
            </m:msubsup>
            <m:mo class="MathClass-punc">,</m:mo>
            <m:msub>
               <m:mrow>
                  <m:mi>y</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>n</m:mi>
               </m:mrow>
            </m:msub>
         </m:mrow>
      </m:mfenced>
   </m:mrow>
   <m:mrow>
      <m:mi>n</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:mi>N</m:mi>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula>, <inline-formula>
<m:math name="1471-2105-12-496-i18" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mfenced separators="" open="{" close="}">
         <m:mrow>
            <m:msubsup>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">s</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">disease&#160;ref</m:mtext>
                  </m:mstyle>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">.;</m:mtext>
                  </m:mstyle>
                  <m:mi>n</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">control</m:mtext>
                  </m:mstyle>
               </m:mrow>
            </m:msubsup>
            <m:mo class="MathClass-punc">,</m:mo>
            <m:msub>
               <m:mrow>
                  <m:mi>y</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>n</m:mi>
               </m:mrow>
            </m:msub>
         </m:mrow>
      </m:mfenced>
   </m:mrow>
   <m:mrow>
      <m:mi>n</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:mi>N</m:mi>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula> and <inline-formula>
<m:math name="1471-2105-12-496-i19" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mfenced separators="" open="{" close="}">
         <m:mrow>
            <m:msubsup>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">s</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">disease&#160;ref</m:mtext>
                  </m:mstyle>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">.;</m:mtext>
                  </m:mstyle>
                  <m:mi>n</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">disease</m:mtext>
                  </m:mstyle>
               </m:mrow>
            </m:msubsup>
            <m:mo class="MathClass-punc">,</m:mo>
            <m:msub>
               <m:mrow>
                  <m:mi>y</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>n</m:mi>
               </m:mrow>
            </m:msub>
         </m:mrow>
      </m:mfenced>
   </m:mrow>
   <m:mrow>
      <m:mi>n</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:mi>N</m:mi>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula>. One or more classifiers can be trained on them and the one with the highest accuracy achieved through cross-validation is selected for a disease diagnosis.</p>
<p>Selection of the <it>unknown </it>number of components <it>M </it>is generally non-trivial problem in a matrix factorization and is the part of a model validation procedure. <it>M </it>is selected through cross-validation and postulated to be 2, 3, 4 or 5 because it directly determines the number of features used for classification. This follows from previously described interpretation of the LMM (2a) and (2b). Since disease prediction is based on four components selected as disease and control specific it is important that they are composed of features with the high discriminative power. It means that they should contain features which are truly disease or control specific. The component considered here as disease or control specific (as well as neutral) can actually be composed of features (<it>m/z </it>ratios or genes) belonging to multiple substances (metabolites, analytes) that share similar relative concentrations. This is practically important since it makes decomposition much less sensitive to an underestimation of the true total number of substances present in a sample. By setting the number of substances to predefined value <it>M</it>, proposed method is enforcing substances with similar concentrations to be linearly combined into one more complex components composed of disease, neutral or control specific features. Provided that concentration variability of these features across the samples is small, it would suffice to select overall number of components as <it>M </it>= 3 or even <it>M </it>= 2. (In the latter case, the existence of differentially not expressed features is not postulated at all). However, since we are dealing with biological samples it is more realistic to expect that relative concentrations could vary across the sample population. This is illustrated in Figures <figr fid="F1">1a</figr> and <figr fid="F1">1b</figr> by ellipsoids around vectors that represent <it>average </it>concentration profiles of each group of features (components). As seen from Figure <figr fid="F1">1</figr>, some features considered neutral can be present in the prevailing concentration in a certain number of samples than the features considered in a majority of the samples as disease (or control) specific. To partially remove such features from disease and/or control specific components, an unknown number of components <it>M </it>should be increased to <it>M </it>= 4 or perhaps even to <it>M </it>= 5. Thus, existence of two or three neutral components should be postulated. This is expected to yield less complex disease and control specific components and that is in agreement with the principle of parsimony (see also discussion in section 1.7). Model validation presented in section 1.4 suggests that this, indeed, is the case when concentration variability across the samples is significant. When it comes to the real world datasets, the information about number of components will not be known in advance. The strategy to comply with this uncertainty is to use the cross-validation and to verify whether increased number of components <it>M </it>indeed contributed to increased accuracy in disease prediction.</p>
</sec>
<sec>
<st>
<p>1.3 Sparse component analysis algorithm</p>
</st>
<p>Proposed feature extraction/component selection method is based on a decomposition of LMMs (2a)/(2b) comprised of two samples (reference sample and test sample) into <it>M </it>&#8805; 2 components. From the BSS point of view this yields determined BSS problem when <it>M </it>= 2 and underdetermined BSS problem, when <it>M </it>&#8805; 3 [26, 27, Chapter 10 in 17]. The enabling constraint for solving underdetermined BSS problems is a sparseness of the components and the methods are known under the common name as sparse component analysis (SCA) [26-29, Chapter 10 in 17]. As commented at the beginning of section 1.2 the overcomplete ICA, [Chapter 16 in 18, 24, 25], is basically reduced to SCA and also demands sparse sources. SCA has already been applied to microarray data analysis in <abbrgrp>
<abbr bid="B3">3</abbr>
<abbr bid="B6">6</abbr>
<abbr bid="B7">7</abbr>
<abbr bid="B11">11</abbr>
<abbr bid="B12">12</abbr>
</abbrgrp>. It has also been used in <abbrgrp>
<abbr bid="B22">22</abbr>
<abbr bid="B23">23</abbr>
</abbrgrp> to extract more than two components from the two mixture samples of nuclear magnetic resonance and mass spectra. A sparseness constraint implies that each particular feature point <it>k </it>= 1, ...,<it>K </it>(<it>m/z </it>ratio or gene) belongs to the several components only. To this end, for the two-samples based LMMs (2a)/(2b) used here, it is assumed that each feature point belongs to at most two components: either disease specific and neutral or control specific and neutral. From the viewpoint of biology, a plausibility of this assumption has been elaborated before.</p>
<p>Algorithmic approaches used to solve underdetermined BSS problem associated with (2a)/(2b) belong to the two main categories: (<it>i</it>) estimating concentration/mixing matrix and component matrix simultaneously by minimizing data fidelity terms <inline-formula>
<m:math name="1471-2105-12-496-i20" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mfenced separators="" open="&#8741;" close="&#8741;">
         <m:mrow>
            <m:mstyle mathvariant="bold">
               <m:mi mathvariant="normal">X</m:mi>
            </m:mstyle>
            <m:mo class="MathClass-bin">-</m:mo>
            <m:msub>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">A</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">control</m:mtext>
                  </m:mstyle>
               </m:mrow>
            </m:msub>
            <m:msub>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">S</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">control</m:mtext>
                  </m:mstyle>
               </m:mrow>
            </m:msub>
         </m:mrow>
      </m:mfenced>
   </m:mrow>
   <m:mrow>
      <m:mi>F</m:mi>
   </m:mrow>
   <m:mrow>
      <m:mn>2</m:mn>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula> or <inline-formula>
<m:math name="1471-2105-12-496-i21" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mfenced separators="" open="&#8741;" close="&#8741;">
         <m:mrow>
            <m:mstyle mathvariant="bold">
               <m:mi mathvariant="normal">X</m:mi>
            </m:mstyle>
            <m:mo class="MathClass-bin">-</m:mo>
            <m:msub>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">A</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">disease</m:mtext>
                  </m:mstyle>
               </m:mrow>
            </m:msub>
            <m:msub>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">S</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">disease</m:mtext>
                  </m:mstyle>
               </m:mrow>
            </m:msub>
         </m:mrow>
      </m:mfenced>
   </m:mrow>
   <m:mrow>
      <m:mi>F</m:mi>
   </m:mrow>
   <m:mrow>
      <m:mn>2</m:mn>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula>, where <b>X </b>follows from the left side of (2a) or (2b). A minimization is usually done through the alternating least square (ALS) methodology with sparseness constraint imposed on source matrices <b>S</b>
<sub>control </sub>and <b>S</b>
<sub>disease</sub>, <abbrgrp>
<abbr bid="B19">19</abbr>
<abbr bid="B22">22</abbr>
<abbr bid="B23">23</abbr>
<abbr bid="B30">30</abbr>
<abbr bid="B31">31</abbr>
<abbr bid="B32">32</abbr>
</abbrgrp>; (<it>ii</it>) estimating concentration/mixing matrices first by clustering and source/component matrices afterwards by solving underdetermined system of linear equations through minimization of the &#8467;<it>
<sub>p </sub>
</it>norm, 0 &lt; <it>p &#8804; </it>1, of the column vectors <b>s</b>
<it>
<sub>k </sub>
</it>&#8712; &#8477;<it>
<sup>M </sup>
</it>of <b>S</b>
<sub>control </sub>and <b>S</b>
<sub>disease</sub>, <abbrgrp>
<abbr bid="B25">25</abbr>
<abbr bid="B26">26</abbr>
<abbr bid="B27">27</abbr>
<abbr bid="B28">28</abbr>
<abbr bid="B29">29</abbr>
<abbr bid="B33">33</abbr>
<abbr bid="B34">34</abbr>
<abbr bid="B35">35</abbr>
</abbrgrp>. As discussed in <abbrgrp>
<abbr bid="B6">6</abbr>
</abbrgrp>, a sparseness constrained minimization of the data fidelity term is sensitive to the choice of a sparseness constraint. On the other side, it has been recognized in <abbrgrp>
<abbr bid="B33">33</abbr>
<abbr bid="B34">34</abbr>
<abbr bid="B35">35</abbr>
</abbrgrp> that accurate estimation of the concentration matrix enables accurate solution of even determined BSS problems. To this end, selection of feature points where only single component is present is of a special importance. At these points, feature vector and appropriate mixing vector are collinear. For example, if feature <it>k </it>belongs to component <it>m </it>then: <b>x</b>
<it>
<sub>k </sub>
</it>&#8776; <b>a</b>
<it>
<sub>m </sub>s<sub>mk</sub>
</it>. Thus, clustering of a set of single component points (SCPs) ought to yield an accurate estimate of the mixing matrix. Its columns are represented by cluster centroids. It has been demonstrated in <abbrgrp>
<abbr bid="B33">33</abbr>
</abbrgrp> that such estimation of the mixing matrix, where hierarchical clustering was used, yields more accurate solution of determined BSS problem: <b>S </b>= <it>pinv</it>(<b>A</b>)<b>X</b>, than the one obtained by ICA algorithms. Thus, selection of SCPs is of an essential importance for accurate estimation of the mixing matrix. Such feature points are identified from the overall number of <it>K </it>points using geometric criterion based on the notion that at them real and imaginary parts of the mixture samples point either in the same or in the opposite direction <abbrgrp>
<abbr bid="B33">33</abbr>
<abbr bid="B34">34</abbr>
</abbrgrp>. Since protein (mass spectra) and gene expression levels are real sequences an analytic continuation <abbrgrp>
<abbr bid="B22">22</abbr>
</abbrgrp> of mixture samples:</p>
<p>
<inline-formula>
<m:math name="1471-2105-12-496-i22" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mrow>
      <m:mstyle mathvariant="bold">
         <m:mi mathvariant="normal">x</m:mi>
      </m:mstyle>
   </m:mrow>
   <m:mrow>
      <m:mi>n</m:mi>
   </m:mrow>
</m:msub>
<m:mo class="MathClass-rel">&#8614;</m:mo>
<m:msub>
   <m:mrow>
      <m:mover accent="true">
         <m:mrow>
            <m:mstyle mathvariant="bold">
               <m:mi mathvariant="normal">x</m:mi>
            </m:mstyle>
         </m:mrow>
         <m:mo class="MathClass-op">&#771;</m:mo>
      </m:mover>
   </m:mrow>
   <m:mrow>
      <m:mi>n</m:mi>
   </m:mrow>
</m:msub>
<m:mo class="MathClass-rel">=</m:mo>
<m:msub>
   <m:mrow>
      <m:mstyle mathvariant="bold">
         <m:mi mathvariant="normal">x</m:mi>
      </m:mstyle>
   </m:mrow>
   <m:mrow>
      <m:mi>n</m:mi>
   </m:mrow>
</m:msub>
<m:mo class="MathClass-bin">+</m:mo>
<m:msqrt>
   <m:mrow>
      <m:mo class="MathClass-bin">-</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
</m:msqrt>
<m:mi>H</m:mi>
<m:mrow>
   <m:mo class="MathClass-open">(</m:mo>
   <m:mrow>
      <m:msub>
         <m:mrow>
            <m:mstyle mathvariant="bold">
               <m:mi mathvariant="normal">x</m:mi>
            </m:mstyle>
         </m:mrow>
         <m:mrow>
            <m:mi>n</m:mi>
         </m:mrow>
      </m:msub>
   </m:mrow>
   <m:mo class="MathClass-close">)</m:mo>
</m:mrow>
</m:math>
</inline-formula> is used to obtain complex representation, where <it>H</it>(<b>x</b>
<it>
<sub>n</sub>
</it>) denotes Hilbert transform of <b>x</b>
<it>
<sub>n</sub>
</it>. The feature point <it>k </it>will be selected to the set of <it>J </it>SCPs provided that the following criterion is satisfied:</p>
<p>
<display-formula>
<m:math name="1471-2105-12-496-i23" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mfenced separators="" open="&#8741;" close="&#8741;">
      <m:mrow>
         <m:mfrac>
            <m:mrow>
               <m:mi>R</m:mi>
               <m:msup>
                  <m:mrow>
                     <m:mrow>
                        <m:mo class="MathClass-open">(</m:mo>
                        <m:mrow>
                           <m:msub>
                              <m:mrow>
                                 <m:mover accent="true">
                                    <m:mrow>
                                       <m:mstyle mathvariant="bold">
                                          <m:mi mathvariant="normal">x</m:mi>
                                       </m:mstyle>
                                    </m:mrow>
                                    <m:mo class="MathClass-op">&#771;</m:mo>
                                 </m:mover>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>k</m:mi>
                              </m:mrow>
                           </m:msub>
                        </m:mrow>
                        <m:mo class="MathClass-close">)</m:mo>
                     </m:mrow>
                  </m:mrow>
                  <m:mrow>
                     <m:mstyle class="text">
                        <m:mtext class="textsf" mathvariant="sans-serif">T</m:mtext>
                     </m:mstyle>
                  </m:mrow>
               </m:msup>
               <m:mi>I</m:mi>
               <m:mrow>
                  <m:mo class="MathClass-open">(</m:mo>
                  <m:mrow>
                     <m:msub>
                        <m:mrow>
                           <m:mover accent="true">
                              <m:mrow>
                                 <m:mstyle mathvariant="bold">
                                    <m:mi mathvariant="normal">x</m:mi>
                                 </m:mstyle>
                              </m:mrow>
                              <m:mo class="MathClass-op">&#771;</m:mo>
                           </m:mover>
                        </m:mrow>
                        <m:mrow>
                           <m:mi>k</m:mi>
                        </m:mrow>
                     </m:msub>
                  </m:mrow>
                  <m:mo class="MathClass-close">)</m:mo>
               </m:mrow>
            </m:mrow>
            <m:mrow>
               <m:mfenced separators="" open="&#8741;" close="&#8741;">
                  <m:mrow>
                     <m:mi>R</m:mi>
                     <m:mrow>
                        <m:mo class="MathClass-open">(</m:mo>
                        <m:mrow>
                           <m:msub>
                              <m:mrow>
                                 <m:mover accent="true">
                                    <m:mrow>
                                       <m:mstyle mathvariant="bold">
                                          <m:mi mathvariant="normal">x</m:mi>
                                       </m:mstyle>
                                    </m:mrow>
                                    <m:mo class="MathClass-op">&#771;</m:mo>
                                 </m:mover>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>k</m:mi>
                              </m:mrow>
                           </m:msub>
                        </m:mrow>
                        <m:mo class="MathClass-close">)</m:mo>
                     </m:mrow>
                  </m:mrow>
               </m:mfenced>
               <m:mfenced separators="" open="&#8741;" close="&#8741;">
                  <m:mrow>
                     <m:mi>I</m:mi>
                     <m:mrow>
                        <m:mo class="MathClass-open">(</m:mo>
                        <m:mrow>
                           <m:msub>
                              <m:mrow>
                                 <m:mover accent="true">
                                    <m:mrow>
                                       <m:mstyle mathvariant="bold">
                                          <m:mi mathvariant="normal">x</m:mi>
                                       </m:mstyle>
                                    </m:mrow>
                                    <m:mo class="MathClass-op">&#771;</m:mo>
                                 </m:mover>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>k</m:mi>
                              </m:mrow>
                           </m:msub>
                        </m:mrow>
                        <m:mo class="MathClass-close">)</m:mo>
                     </m:mrow>
                  </m:mrow>
               </m:mfenced>
            </m:mrow>
         </m:mfrac>
      </m:mrow>
   </m:mfenced>
   <m:mo class="MathClass-rel">&#8805;</m:mo>
   <m:mo class="qopname"> cos</m:mo>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:mi mathvariant="normal">&#916;</m:mi>
         <m:mi>&#952;</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:mspace width="1em" class="quad"/>
   <m:mi>k</m:mi>
   <m:mo class="MathClass-rel">&#8712;</m:mo>
   <m:mrow>
      <m:mo class="MathClass-open">{</m:mo>
      <m:mrow>
         <m:mn>1</m:mn>
         <m:mo class="MathClass-punc">,</m:mo>
         <m:mo class="MathClass-punc">.</m:mo>
         <m:mo class="MathClass-punc">.</m:mo>
         <m:mo class="MathClass-punc">.</m:mo>
         <m:mo class="MathClass-punc">,</m:mo>
         <m:mi>K</m:mi>
      </m:mrow>
      <m:mo class="MathClass-close">}</m:mo>
   </m:mrow>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>where <inline-formula>
<m:math name="1471-2105-12-496-i24" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mi>R</m:mi>
<m:mrow>
   <m:mo class="MathClass-open">(</m:mo>
   <m:mrow>
      <m:msub>
         <m:mrow>
            <m:mover accent="true">
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">x</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mo class="MathClass-op">&#771;</m:mo>
            </m:mover>
         </m:mrow>
         <m:mrow>
            <m:mi>k</m:mi>
         </m:mrow>
      </m:msub>
   </m:mrow>
   <m:mo class="MathClass-close">)</m:mo>
</m:mrow>
</m:math>
</inline-formula> and <inline-formula>
<m:math name="1471-2105-12-496-i25" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mi>I</m:mi>
<m:mrow>
   <m:mo class="MathClass-open">(</m:mo>
   <m:mrow>
      <m:msub>
         <m:mrow>
            <m:mover accent="true">
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">x</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mo class="MathClass-op">&#771;</m:mo>
            </m:mover>
         </m:mrow>
         <m:mrow>
            <m:mi>k</m:mi>
         </m:mrow>
      </m:msub>
   </m:mrow>
   <m:mo class="MathClass-close">)</m:mo>
</m:mrow>
</m:math>
</inline-formula> denote real and imaginary part of <inline-formula>
<m:math name="1471-2105-12-496-i26" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mrow>
      <m:mover accent="true">
         <m:mrow>
            <m:mstyle mathvariant="bold">
               <m:mi mathvariant="normal">x</m:mi>
            </m:mstyle>
         </m:mrow>
         <m:mo class="MathClass-op">&#771;</m:mo>
      </m:mover>
   </m:mrow>
   <m:mrow>
      <m:mi>k</m:mi>
   </m:mrow>
</m:msub>
</m:math>
</inline-formula> respectively, 'T' denotes transpose operation, <inline-formula>
<m:math name="1471-2105-12-496-i27" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mfenced separators="" open="&#8741;" close="&#8741;">
   <m:mrow>
      <m:mi>R</m:mi>
      <m:mrow>
         <m:mo class="MathClass-open">(</m:mo>
         <m:mrow>
            <m:msub>
               <m:mrow>
                  <m:mover accent="true">
                     <m:mrow>
                        <m:mstyle mathvariant="bold">
                           <m:mi mathvariant="normal">x</m:mi>
                        </m:mstyle>
                     </m:mrow>
                     <m:mo class="MathClass-op">&#771;</m:mo>
                  </m:mover>
               </m:mrow>
               <m:mrow>
                  <m:mi>k</m:mi>
               </m:mrow>
            </m:msub>
         </m:mrow>
         <m:mo class="MathClass-close">)</m:mo>
      </m:mrow>
   </m:mrow>
</m:mfenced>
</m:math>
</inline-formula> and <inline-formula>
<m:math name="1471-2105-12-496-i28" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mfenced separators="" open="&#8741;" close="&#8741;">
   <m:mrow>
      <m:mi>I</m:mi>
      <m:mrow>
         <m:mo class="MathClass-open">(</m:mo>
         <m:mrow>
            <m:msub>
               <m:mrow>
                  <m:mover accent="true">
                     <m:mrow>
                        <m:mstyle mathvariant="bold">
                           <m:mi mathvariant="normal">x</m:mi>
                        </m:mstyle>
                     </m:mrow>
                     <m:mo class="MathClass-op">&#771;</m:mo>
                  </m:mover>
               </m:mrow>
               <m:mrow>
                  <m:mi>k</m:mi>
               </m:mrow>
            </m:msub>
         </m:mrow>
         <m:mo class="MathClass-close">)</m:mo>
      </m:mrow>
   </m:mrow>
</m:mfenced>
</m:math>
</inline-formula> denote &#8467;<sub>2</sub>-norms of <inline-formula>
<m:math name="1471-2105-12-496-i29" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mi>R</m:mi>
<m:mrow>
   <m:mo class="MathClass-open">(</m:mo>
   <m:mrow>
      <m:msub>
         <m:mrow>
            <m:mover accent="true">
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">x</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mo class="MathClass-op">&#771;</m:mo>
            </m:mover>
         </m:mrow>
         <m:mrow>
            <m:mi>k</m:mi>
         </m:mrow>
      </m:msub>
   </m:mrow>
   <m:mo class="MathClass-close">)</m:mo>
</m:mrow>
</m:math>
</inline-formula> and <inline-formula>
<m:math name="1471-2105-12-496-i30" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mi>I</m:mi>
<m:mrow>
   <m:mo class="MathClass-open">(</m:mo>
   <m:mrow>
      <m:msub>
         <m:mrow>
            <m:mover accent="true">
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">x</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mo class="MathClass-op">&#771;</m:mo>
            </m:mover>
         </m:mrow>
         <m:mrow>
            <m:mi>k</m:mi>
         </m:mrow>
      </m:msub>
   </m:mrow>
   <m:mo class="MathClass-close">)</m:mo>
</m:mrow>
</m:math>
</inline-formula> while &#916;<it>&#952; </it>stands for the angular displacement from direction of either 0 or &#960; radians. Evidently, &#916;&#952; determines quality of the selected SCPs and, thus, accuracy of the estimation of the mixing matrices <b>A</b>
<sub>control </sub>and <b>A</b>
<sub>disease</sub>. Setting &#916;&#952; to a small value (e.g., to an equivalent of 1<sup>0 </sup>) enforces, with an overwhelming probability, the selection of feature points that contain one component only. If, however, all the components are not present in at least one feature point alone it may occur that corresponding columns of the mixing matrices will be estimated inaccurately. This problem can be alleviated by increasing the value of &#916;&#952; in which case the selected feature points may not contain one component only, but may rather be composed of one dominant component and one or more components present in a small amount.</p>
<p>Thus, in practice, &#916;&#952; needs to be selected through a cross-validation. In the experiments described in sections 1.4 to 1.7, &#916;&#952; has been selected from the set of radians equivalent to {1<sup>0</sup>, 3<sup>0</sup>, 5<sup>0</sup>} together with a postulated number of components <it>M </it>and with a regularization parameter related to sparseness constraint imposed on <b>S</b>
<sub>control </sub>and <b>S</b>
<sub>disease </sub>(see eq. (3) below). Hierarchical clustering implemented by MATLAB clusterdata command (with a '<it>cosine</it>' distance metric and '<it>complete</it>' linkage option) has been used to cluster the set of selected <it>J </it>feature points with a single component belonging. Number of clusters has been set in advance to equal the postulated number of components <it>M</it>. Cluster centres represent estimated concentrations vectors <inline-formula>
<m:math name="1471-2105-12-496-i31" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mfenced separators="" open="{" close="}">
         <m:mrow>
            <m:msub>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">a</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mi>m</m:mi>
               </m:mrow>
            </m:msub>
            <m:mo class="MathClass-rel">&#8712;</m:mo>
            <m:msubsup>
               <m:mrow>
                  <m:mi>&#8477;</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mo class="MathClass-bin">+</m:mo>
               </m:mrow>
               <m:mrow>
                  <m:mn>2</m:mn>
               </m:mrow>
            </m:msubsup>
         </m:mrow>
      </m:mfenced>
   </m:mrow>
   <m:mrow>
      <m:mi>m</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:mi>M</m:mi>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula>. It is also possible to use other clustering methods, such as <it>k</it>-means, as an alternative to hierarchical clustering. The problem with <it>k</it>-means, however, is that it is non-convex and its performance strongly depends on the initial value selected for cluster centroids. On the other side, hierarchical clustering produces repeatable result i.e. for a given set of SCPs it yields the same result for the mixing matrix in each run. Since the number of selected SCPs is modest, the computational complexity of hierarchical clustering approach is not too high. That is why hierarchical clustering is used to estimate the mixing matrices in (2a) and (2b). After mixing matrices are estimated, estimation of the component matrices proceeds by minimizing sparseness constrained cost functions:</p>
<p>
<display-formula id="M3a">
<m:math name="1471-2105-12-496-i32" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mrow>
         <m:mover accent="true">
            <m:mrow>
               <m:mstyle mathvariant="bold">
                  <m:mi mathvariant="normal">S</m:mi>
               </m:mstyle>
            </m:mrow>
            <m:mo class="MathClass-op">^</m:mo>
         </m:mover>
      </m:mrow>
      <m:mrow>
         <m:mstyle class="text">
            <m:mtext class="textsf" mathvariant="sans-serif">control</m:mtext>
         </m:mstyle>
      </m:mrow>
   </m:msub>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:munder class="msub">
      <m:mrow>
         <m:mo class="qopname">min</m:mo>
      </m:mrow>
      <m:mrow>
         <m:mstyle class="text">
            <m:mtext class="textsf" mathvariant="sans-serif">S</m:mtext>
         </m:mstyle>
      </m:mrow>
   </m:munder>
   <m:mfenced separators="" open="{" close="}">
      <m:mrow>
         <m:mfrac>
            <m:mrow>
               <m:mn>1</m:mn>
            </m:mrow>
            <m:mrow>
               <m:mn>2</m:mn>
            </m:mrow>
         </m:mfrac>
         <m:msubsup>
            <m:mrow>
               <m:mfenced separators="" open="&#8741;" close="&#8741;">
                  <m:mrow>
                     <m:msub>
                        <m:mrow>
                           <m:mover accent="true">
                              <m:mrow>
                                 <m:mstyle mathvariant="bold">
                                    <m:mi mathvariant="normal">A</m:mi>
                                 </m:mstyle>
                              </m:mrow>
                              <m:mo class="MathClass-op">^</m:mo>
                           </m:mover>
                        </m:mrow>
                        <m:mrow>
                           <m:mstyle class="text">
                              <m:mtext class="textsf" mathvariant="sans-serif">control</m:mtext>
                           </m:mstyle>
                        </m:mrow>
                     </m:msub>
                     <m:mstyle mathvariant="bold">
                        <m:mi mathvariant="normal">S</m:mi>
                     </m:mstyle>
                     <m:mo class="MathClass-bin">-</m:mo>
                     <m:mfenced separators="" open="[" close="]">
                        <m:mrow>
                           <m:mtable equalrows="false" columnlines="none none none none none none none none none none none none none none none none none none none" equalcolumns="false" class="array">
                              <m:mtr>
                                 <m:mtd class="array" columnalign="center">
                                    <m:msub>
                                       <m:mrow>
                                          <m:mstyle mathvariant="bold">
                                             <m:mi mathvariant="normal">x</m:mi>
                                          </m:mstyle>
                                       </m:mrow>
                                       <m:mrow>
                                          <m:mstyle class="text">
                                             <m:mtext class="textsf" mathvariant="sans-serif">control</m:mtext>
                                          </m:mstyle>
                                       </m:mrow>
                                    </m:msub>
                                 </m:mtd>
                              </m:mtr>
                              <m:mtr>
                                 <m:mtd class="array" columnalign="center">
                                    <m:mstyle mathvariant="bold">
                                       <m:mi mathvariant="normal">x</m:mi>
                                    </m:mstyle>
                                 </m:mtd>
                              </m:mtr>
                              <m:mtr>
                                 <m:mtd class="array" columnalign="center"/>
                              </m:mtr>
                           </m:mtable>
                        </m:mrow>
                     </m:mfenced>
                  </m:mrow>
               </m:mfenced>
            </m:mrow>
            <m:mrow>
               <m:mi>F</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mn>2</m:mn>
            </m:mrow>
         </m:msubsup>
         <m:mo class="MathClass-bin">+</m:mo>
         <m:mi>&#955;</m:mi>
         <m:msub>
            <m:mrow>
               <m:mfenced separators="" open="&#8741;" close="&#8741;">
                  <m:mrow>
                     <m:mstyle mathvariant="bold">
                        <m:mi mathvariant="normal">S</m:mi>
                     </m:mstyle>
                  </m:mrow>
               </m:mfenced>
            </m:mrow>
            <m:mrow>
               <m:mn>1</m:mn>
            </m:mrow>
         </m:msub>
      </m:mrow>
   </m:mfenced>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>
<display-formula id="M3b">
<m:math name="1471-2105-12-496-i33" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mrow>
         <m:mover accent="true">
            <m:mrow>
               <m:mstyle mathvariant="bold">
                  <m:mi mathvariant="normal">S</m:mi>
               </m:mstyle>
            </m:mrow>
            <m:mo class="MathClass-op">^</m:mo>
         </m:mover>
      </m:mrow>
      <m:mrow>
         <m:mstyle class="text">
            <m:mtext class="textsf" mathvariant="sans-serif">disease</m:mtext>
         </m:mstyle>
      </m:mrow>
   </m:msub>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:munder class="msub">
      <m:mrow>
         <m:mo class="qopname">min</m:mo>
      </m:mrow>
      <m:mrow>
         <m:mstyle class="text">
            <m:mtext class="textsf" mathvariant="sans-serif">S</m:mtext>
         </m:mstyle>
      </m:mrow>
   </m:munder>
   <m:mfenced separators="" open="{" close="}">
      <m:mrow>
         <m:mfrac>
            <m:mrow>
               <m:mn>1</m:mn>
            </m:mrow>
            <m:mrow>
               <m:mn>2</m:mn>
            </m:mrow>
         </m:mfrac>
         <m:msubsup>
            <m:mrow>
               <m:mfenced separators="" open="&#8741;" close="&#8741;">
                  <m:mrow>
                     <m:msub>
                        <m:mrow>
                           <m:mover accent="true">
                              <m:mrow>
                                 <m:mstyle mathvariant="bold">
                                    <m:mi mathvariant="normal">A</m:mi>
                                 </m:mstyle>
                              </m:mrow>
                              <m:mo class="MathClass-op">^</m:mo>
                           </m:mover>
                        </m:mrow>
                        <m:mrow>
                           <m:mstyle class="text">
                              <m:mtext class="textsf" mathvariant="sans-serif">disease</m:mtext>
                           </m:mstyle>
                        </m:mrow>
                     </m:msub>
                     <m:mstyle mathvariant="bold">
                        <m:mi mathvariant="normal">S</m:mi>
                     </m:mstyle>
                     <m:mo class="MathClass-bin">-</m:mo>
                     <m:mfenced separators="" open="[" close="]">
                        <m:mrow>
                           <m:mtable equalrows="false" columnlines="none none none none none none none none none none none none none none none none none none none" equalcolumns="false" class="array">
                              <m:mtr>
                                 <m:mtd class="array" columnalign="center">
                                    <m:msub>
                                       <m:mrow>
                                          <m:mstyle mathvariant="bold">
                                             <m:mi mathvariant="normal">x</m:mi>
                                          </m:mstyle>
                                       </m:mrow>
                                       <m:mrow>
                                          <m:mstyle class="text">
                                             <m:mtext class="textsf" mathvariant="sans-serif">disease</m:mtext>
                                          </m:mstyle>
                                       </m:mrow>
                                    </m:msub>
                                 </m:mtd>
                              </m:mtr>
                              <m:mtr>
                                 <m:mtd class="array" columnalign="center">
                                    <m:mstyle mathvariant="bold">
                                       <m:mi mathvariant="normal">x</m:mi>
                                    </m:mstyle>
                                 </m:mtd>
                              </m:mtr>
                              <m:mtr>
                                 <m:mtd class="array" columnalign="center"/>
                              </m:mtr>
                           </m:mtable>
                        </m:mrow>
                     </m:mfenced>
                  </m:mrow>
               </m:mfenced>
            </m:mrow>
            <m:mrow>
               <m:mi>F</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mn>2</m:mn>
            </m:mrow>
         </m:msubsup>
         <m:mo class="MathClass-bin">+</m:mo>
         <m:mi>&#955;</m:mi>
         <m:msub>
            <m:mrow>
               <m:mfenced separators="" open="&#8741;" close="&#8741;">
                  <m:mrow>
                     <m:mstyle mathvariant="bold">
                        <m:mi mathvariant="normal">S</m:mi>
                     </m:mstyle>
                  </m:mrow>
               </m:mfenced>
            </m:mrow>
            <m:mrow>
               <m:mn>1</m:mn>
            </m:mrow>
         </m:msub>
      </m:mrow>
   </m:mfenced>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>where the hat sign denotes estimates of the model variables <b>A</b>
<sub>control</sub>/<b>A</b>
<sub>disease</sub> and <b>S</b>
<sub>control</sub>/<b>S</b>
<sub>disease</sub>. Problems (3) relate to the sparseness constrained solution of the underdetermined systems of linear equations. For a decomposition of gene expression profiles, a non-negativity constraint is additionally imposed on <b>S</b>: <b>S </b>&#8805; <b>0</b>. Problem (3) can be solved by the LASSO algorithm <abbrgrp>
<abbr bid="B36">36</abbr>
</abbrgrp> or, by some other solver for underdetermined system of linear equations <abbrgrp>
<abbr bid="B37">37</abbr>
</abbrgrp>. Here, for problem (3) we have used the iterative shrinkage thresholding (IST) type of method <abbrgrp>
<abbr bid="B38">38</abbr>
</abbrgrp>, with a MATLAB code available at <abbrgrp>
<abbr bid="B39">39</abbr>
</abbrgrp>. This approach has been shown to be fast and it can be easily implemented in batch mode such as (3a)/(3b) i.e. as a solving of all <it>K </it>systems of equations simultaneously. In relation to standard IST methods, the method <abbrgrp>
<abbr bid="B38">38</abbr>
</abbrgrp> has guaranteed better global rate of convergence. In addition to that, through the effect of iterations, it shrinks to zero small nonzero elements of <b>S </b>that are influenced by noise. This prevents them to determine level of sparseness of <b>S</b>. As discussed in <abbrgrp>
<abbr bid="B6">6</abbr>
</abbrgrp> this shrinking operation is important in preventing selection of less sparse <b>S </b>over the sparse version of <b>S</b>. With non-negativity constraint <b>S </b>&#8805; <b>0 </b>problem (3) becomes a quadratic program. Thus, we have used a gradient descent with projection onto non-negative orthant: max(<b>0</b>,<b>S</b>). A sparsity of the solution is controlled by the parameter &#955;. There is a maximal value of &#955; (denoted by &#955;<sub>max </sub>here) above which the solution of the problems (3) is maximally sparse, i.e. it is equal to zero. Thus, in the experiments reported in sections 1.5 to 1.7 the value &#955; has been selected by cross-validation (together with &#916;&#952; and <it>M</it>) with respect to &#955;<sub>max </sub>as: &#955;&#8712;{10<sup>-2</sup>&#183;&#955;<sub>max</sub>, 10<sup>-4</sup>&#183;&#955;<sub>max</sub>, 10-<sup>6</sup>&#183;&#955;<sub>max</sub>}. We conclude this section by an observation that the situation suggested in <abbrgrp>
<abbr bid="B6">6</abbr>
</abbrgrp>: <b>X </b>= <b>AS </b>= <b>A</b>
<it>
<sup>pseu</sup>
</it>
<b>S</b>
<it>
<sup>pseu</sup>
</it>, where (<b>A</b>
<it>
<sup>pseu</sup>
</it>, <b>S</b>
<it>
<sup>pseu</sup>
</it>) represents alternative factorization of <b>X </b>such that <b>S</b>
<it>
<sup>pseu </sup>
</it>would be less sparse than <b>S</b>, during minimization of (3) cannot occur. That is due to IST algorithm <abbrgrp>
<abbr bid="B38">38</abbr>
</abbrgrp> as well as due to accurate estimation of the mixing matrices that is enabled by clustering set of the SCPs. First, this is a consequence of the fact that a shrinking operation used by IST algorithm <abbrgrp>
<abbr bid="B38">38</abbr>
</abbrgrp> imposes sparseness constraint of the type given by eq.(7) in <abbrgrp>
<abbr bid="B6">6</abbr>
</abbrgrp>:</p>
<p>
<display-formula>
<m:math name="1471-2105-12-496-i34" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mn>0</m:mn>
   <m:mo class="MathClass-rel">&#8804;</m:mo>
   <m:msub>
      <m:mrow>
         <m:mi>&#963;</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>&#964;</m:mi>
      </m:mrow>
   </m:msub>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mstyle mathvariant="bold">
                  <m:mi mathvariant="normal">s</m:mi>
               </m:mstyle>
            </m:mrow>
            <m:mrow>
               <m:mi>k</m:mi>
            </m:mrow>
         </m:msub>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mfrac>
      <m:mrow>
         <m:mstyle class="text">
            <m:mtext class="textsf" mathvariant="sans-serif">number&#160;of&#160;elements&#160;of&#160;</m:mtext>
         </m:mstyle>
         <m:msub>
            <m:mrow>
               <m:mstyle mathvariant="bold">
                  <m:mi mathvariant="normal">s</m:mi>
               </m:mstyle>
            </m:mrow>
            <m:mrow>
               <m:mi>k</m:mi>
            </m:mrow>
         </m:msub>
         <m:mo class="MathClass-rel">&#8804;</m:mo>
         <m:mi>&#964;</m:mi>
         <m:mo class="MathClass-bin">&#8901;</m:mo>
         <m:msubsup>
            <m:mrow>
               <m:mstyle mathvariant="bold">
                  <m:mi mathvariant="normal">s</m:mi>
               </m:mstyle>
            </m:mrow>
            <m:mrow>
               <m:mi>k</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mo class="qopname">max</m:mo>
            </m:mrow>
         </m:msubsup>
      </m:mrow>
      <m:mrow>
         <m:mstyle class="text">
            <m:mtext class="textsf" mathvariant="sans-serif">number&#160;of&#160;elements&#160;of&#160;</m:mtext>
         </m:mstyle>
         <m:msub>
            <m:mrow>
               <m:mstyle mathvariant="bold">
                  <m:mi mathvariant="normal">s</m:mi>
               </m:mstyle>
            </m:mrow>
            <m:mrow>
               <m:mi>k</m:mi>
            </m:mrow>
         </m:msub>
      </m:mrow>
   </m:mfrac>
   <m:mo class="MathClass-rel">&#8804;</m:mo>
   <m:mn>1</m:mn>
   <m:mo class="MathClass-punc">,</m:mo>
   <m:mi>&#964;</m:mi>
   <m:mo class="MathClass-rel">&#8712;</m:mo>
   <m:mrow>
      <m:mo class="MathClass-open">[</m:mo>
      <m:mrow>
         <m:mn>0</m:mn>
         <m:mo class="MathClass-punc">,</m:mo>
         <m:mn>1</m:mn>
      </m:mrow>
      <m:mo class="MathClass-close">]</m:mo>
   </m:mrow>
   <m:mo class="MathClass-punc">,</m:mo>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>i.e. small nonzero elements of <b>s</b>
<it>
<sub>k </sub>
</it>are set to zero. This prevents selection of less sparse <b>S</b>
<it>
<sup>pseu </sup>
</it>over sparser <b>S</b>. Second, SCA method used here is a two-stage method where <b>A </b>is estimated accurately by clustering on a set of SCPs. This, in addition to a sparseness measure discussed above, prevents estimate of <b>S </b>to deviate from the true value significantly. It is this way because when <b>S </b>is being estimated by means of IST algorithm the very estimate of <b>A </b>is fixed. As opposed to the case when <b>A </b>and <b>S </b>are estimated simultaneously, as in <abbrgrp>
<abbr bid="B6">6</abbr>
</abbrgrp>, an estimate of <b>A </b>can't now be adjusted by the algorithm to some value <b>A</b>
<it>
<sup>pseu </sup>
</it>that will counteract changes in <b>S</b>. Hence, selecting <b>S</b>
<it>
<sup>pseu </sup>
</it>would increase a data fidelity term in the cost function. Thus, situation as suggested in <abbrgrp>
<abbr bid="B6">6</abbr>
</abbrgrp>: <b>X </b>= <b>AS </b>= <b>A</b>
<it>
<sup>pseu</sup>
</it>
<b>S</b>
<it>
<sup>pseu </sup>
</it>can't occur. A proposed two-stage SCA approach to feature extraction/component selection is in a concise form presented in Table <tblr tid="T1">1</tblr>. A MATLAB code is posted in the Additional Material Files section accompanied with the paper as Additional File <supplr sid="S1">1</supplr>.</p>
<tbl id="T1"><title><p>Table 1</p></title><caption><p>A mixture model with a reference-based algorithm for feature extraction/component selection</p></caption><tblbdy cols="1">
      <r>
         <c ca="left">
            <p><b>Inputs. <inline-formula><m:math name="1471-2105-12-496-i35" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup><m:mrow><m:mfenced separators="" open="{" close="}"><m:mrow><m:msub><m:mrow><m:mstyle mathvariant="bold"><m:mi mathvariant="normal">x</m:mi></m:mstyle></m:mrow><m:mrow><m:mi>n</m:mi></m:mrow></m:msub><m:mo class="MathClass-rel">&#8712;</m:mo><m:msup><m:mrow><m:mi>&#8477;</m:mi></m:mrow><m:mrow><m:mi>k</m:mi></m:mrow></m:msup><m:mo class="MathClass-punc">,</m:mo><m:msub><m:mrow><m:mi>y</m:mi></m:mrow><m:mrow><m:mi>n</m:mi></m:mrow></m:msub><m:mo class="MathClass-rel">&#8712;</m:mo><m:mrow><m:mo class="MathClass-open">{</m:mo><m:mrow><m:mn>1</m:mn><m:mo class="MathClass-punc">,</m:mo><m:mo class="MathClass-bin">-</m:mo><m:mn>1</m:mn></m:mrow><m:mo class="MathClass-close">}</m:mo></m:mrow></m:mrow></m:mfenced></m:mrow><m:mrow><m:mi>n</m:mi><m:mo class="MathClass-rel">=</m:mo><m:mn>1</m:mn></m:mrow><m:mrow><m:mi>N</m:mi></m:mrow></m:msubsup></m:math></inline-formula></b>samples and sample labels, where <it>K </it>represents number of feature points (<it>m</it>/<it>z </it>ratios or genes).</p>
         </c>
      </r>
      <r>
         <c indent="1" ca="left">
            <p><b>x</b><sub>control </sub>&#8712; &#8477;<it><sup>K </sup></it>and <b>x</b><sub>disease </sub>&#8712; &#8477;<it><sup>K </sup></it>representing control and disease (case) groups of samples.</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p><b>Nested two-fold cross-validation</b>. Parameters: single component points (SCPs) selection threshold in radian equivalents of &#916; &#952; {1<sup>0</sup>, 3<sup>0</sup>, 5<sup>0</sup>}; regularization constant &#955;&#8712; {10<sup>-2</sup>&#955;<sub>max</sub>, 10<sup>-4</sup>&#955;<sub>max</sub>, 10<sup>-6</sup>&#955;<sub>max</sub>}; number of components <it>M </it>&#8712;{2, 3, 4, 5}; parameters of selected classifier.</p>
         </c>
      </r>
      <r>
         <c indent="1" ca="left">
            <p><b>Components selection from mixture samples</b>.</p>
         </c>
      </r>
      <r>
         <c indent="2" ca="left">
            <p><b>1. <inline-formula><m:math name="1471-2105-12-496-i36" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mo class="MathClass-op">&#8704;</m:mo><m:mstyle mathvariant="bold"><m:mi mathvariant="normal">x</m:mi></m:mstyle><m:mo class="MathClass-rel">&#8712;</m:mo><m:msubsup><m:mrow><m:mfenced separators="" open="{" close="}"><m:mrow><m:msub><m:mrow><m:mstyle mathvariant="bold"><m:mi mathvariant="normal">x</m:mi></m:mstyle></m:mrow><m:mrow><m:mi>n</m:mi></m:mrow></m:msub><m:mo class="MathClass-rel">&#8712;</m:mo><m:msup><m:mrow><m:mi>&#8477;</m:mi></m:mrow><m:mrow><m:mi>k</m:mi></m:mrow></m:msup></m:mrow></m:mfenced></m:mrow><m:mrow><m:mi>n</m:mi><m:mo class="MathClass-rel">=</m:mo><m:mn>1</m:mn></m:mrow><m:mrow><m:mi>N</m:mi></m:mrow></m:msubsup></m:math></inline-formula></b>form a linear mixture models (LMMs) (2a) and (2b).</p>
         </c>
      </r>
      <r>
         <c indent="2" ca="left">
            <p><b>2</b>. For LMMs (2a)/(2b) select a set of single component points for a given
&#916;&#952;.
</p>
         </c>
      </r>
      <r>
         <c indent="2" ca="left">
            <p><b>3</b>. On sets of SCPs use hierarchical clustering (other clustering methods can be used also) to estimate mixing matrices <b>A</b><sub>control </sub>and <b>A</b><sub>disease </sub>for a given <it>M</it>.</p>
         </c>
      </r>
      <r>
         <c indent="2" ca="left">
            <p><b>4</b>. Estimate source matrices <b>S</b><sub>control </sub>and <b>S</b><sub>disease </sub>by solving (3a) and (3b) respectively for a given regularization parameter &#955;.</p>
         </c>
      </r>
      <r>
         <c indent="2" ca="left">
            <p><b>5</b>. Use minimal and maximal mixing angles estimated from mixing matrices <b>A</b>control and <b>A</b>disease to select, following the logic illustrated in Fig. 2a and Fig. 2b, disease and control specific components: <inline-formula><m:math name="1471-2105-12-496-i37" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup><m:mrow><m:mstyle mathvariant="bold"><m:mi mathvariant="normal">s</m:mi></m:mstyle></m:mrow><m:mrow><m:mstyle class="text"><m:mtext class="textsf" mathvariant="sans-serif">control&#160;ref</m:mtext></m:mstyle><m:mstyle class="text"><m:mtext class="textsf" mathvariant="sans-serif">.;</m:mtext></m:mstyle><m:mi>n</m:mi></m:mrow><m:mrow><m:mstyle class="text"><m:mtext class="textsf" mathvariant="sans-serif">disease</m:mtext></m:mstyle></m:mrow></m:msubsup></m:math></inline-formula>, <inline-formula><m:math name="1471-2105-12-496-i38" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup><m:mrow><m:mstyle mathvariant="bold"><m:mi mathvariant="normal">s</m:mi></m:mstyle></m:mrow><m:mrow><m:mstyle class="text"><m:mtext class="textsf" mathvariant="sans-serif">control&#160;ref</m:mtext></m:mstyle><m:mstyle class="text"><m:mtext class="textsf" mathvariant="sans-serif">.;</m:mtext></m:mstyle><m:mi>n</m:mi></m:mrow><m:mrow><m:mstyle class="text"><m:mtext class="textsf" mathvariant="sans-serif">control</m:mtext></m:mstyle></m:mrow></m:msubsup></m:math></inline-formula>, <inline-formula><m:math name="1471-2105-12-496-i39" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup><m:mrow><m:mstyle mathvariant="bold"><m:mi mathvariant="normal">s</m:mi></m:mstyle></m:mrow><m:mrow><m:mstyle class="text"><m:mtext class="textsf" mathvariant="sans-serif">disease&#160;ref</m:mtext></m:mstyle><m:mstyle class="text"><m:mtext class="textsf" mathvariant="sans-serif">.;</m:mtext></m:mstyle><m:mi>n</m:mi></m:mrow><m:mrow><m:mstyle class="text"><m:mtext class="textsf" mathvariant="sans-serif">control</m:mtext></m:mstyle></m:mrow></m:msubsup></m:math></inline-formula> and <inline-formula><m:math name="1471-2105-12-496-i40" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup><m:mrow><m:mstyle mathvariant="bold"><m:mi mathvariant="normal">s</m:mi></m:mstyle></m:mrow><m:mrow><m:mstyle class="text"><m:mtext class="textsf" mathvariant="sans-serif">disease&#160;ref</m:mtext></m:mstyle><m:mstyle class="text"><m:mtext class="textsf" mathvariant="sans-serif">.;</m:mtext></m:mstyle><m:mi>n</m:mi></m:mrow><m:mrow><m:mstyle class="text"><m:mtext class="textsf" mathvariant="sans-serif">disease</m:mtext></m:mstyle></m:mrow></m:msubsup></m:math></inline-formula>.</p>
         </c>
      </r>
      <r>
         <c indent="1" ca="left">
            <p><b>End of component selection</b>.</p>
         </c>
      </r>
      <r>
         <c ca="left">
            <p><b>End of nested two-fold cross-validation</b>.</p>
         </c>
      </r>
   </tblbdy></tbl>
<suppl id="S1">
<title>
<p>Additional file 1</p>
</title>
<text>
<p>
<b>code with implementation of proposed feature extraction/component selection method</b>.</p>
</text>
<file name="1471-2105-12-496-S1.ZIP">
   <p>Click here for file</p>
</file>
</suppl>
</sec>
</sec>
<sec>
<st>
<p>Results and Discussion</p>
</st>
<p>This section presents model validation procedure. It is demonstrated how increased number of postulated components retains, or slightly improves, prediction accuracy when concentration variability of the features across the sample population is significant. Moreover, an increased number of postulated components yields the disease and control specific components used for classification with a smaller number of features. This is in an agreement with the principle of parsimony which states that less complex solution ought to be preferred over the more complex one. Proposed method for feature extraction/component selection is also applied to a prediction of ovarian, prostate and colon cancers from the three well-studied datasets. Prediction accuracy (sensitivity and specificity with standard deviations) is estimated by 100 independent two-fold cross-validations. Proposed SCA component selection method is compared (favourably) against state-of-the-art predictors tested on the same datasets including our implementation of methods proposed in <abbrgrp>
<abbr bid="B1">1</abbr>
<abbr bid="B2">2</abbr>
</abbrgrp>. Regarding our implementation of a predictive matrix factorization method <abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp>, we have used the MATLAB <monospace>fminsearch</monospace> function to minimize the negative value of the target function suggested in <abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp> while selecting the threshold vector. We have set the <monospace>TolFun</monospace> to 10<sup>-10</sup>, the <monospace>TolX</monospace> to 10<sup>-10 </sup>and the <monospace>MaxFunEvals</monospace> to 10,000. An initial value of the two-dimensional threshold vector has been set to [0 0]<sup>T</sup>. Regarding a gene discovery method proposed in <abbrgrp>
<abbr bid="B2">2</abbr>
</abbrgrp> we have cross-validated three values of the threshold <it>c</it>
<sub>u </sub>&#8712;{2, 2.5, 3.0} (<it>c</it>
<sub>l </sub>is set automatically <it>c</it>
<sub>l </sub>= 1/<it>c</it>
<sub>u</sub>). The best result is presented in section 1.7. Regarding a comparison of a proposed component selection method against many methods in sections 1.5 to 1.7, our intention has been to provide a brief description of the methods and to provide fair comparison given the fact that code for compared methods has not been available to us. That actually was the main reason for choosing a well known datasets such as in 1.5 to 1.7, since a rich list of published results exists for them. We are aware of the fact that results by many other methods were obtained by different cross-validation settings. Therefore, our reasoning is that fair comparison is possible as long as the results to be compared were obtained on the same datasets under conditions that favor less the method proposed here. That is the reason why we have chosen to perform two-fold cross-validation, since it is known to yield the least optimistic result. Thus, if such results are compared favorably against those obtained under milder (ten- and three-fold) cross-validation settings, conclusion can be made that proposed feature extraction/component selection method represents contribution to the field. As opposed to the two-fold cross-validation applied here, cross-validation details for many cited results were not specified. Sometimes ten-fold, or three-fold, cross-validations have been performed. Hence, it is believed that performance assessment of proposed component selection method is more realistic than performance of the majority of methods cited in comparative analysis. For each of the three types of cancers three classifiers were trained on four sets of extracted components: <inline-formula>
<m:math name="1471-2105-12-496-i41" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mfenced separators="" open="{" close="}">
         <m:mrow>
            <m:msubsup>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">s</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">control&#160;ref</m:mtext>
                  </m:mstyle>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">.;</m:mtext>
                  </m:mstyle>
                  <m:mi>n</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">disease</m:mtext>
                  </m:mstyle>
               </m:mrow>
            </m:msubsup>
            <m:mo class="MathClass-punc">,</m:mo>
            <m:msub>
               <m:mrow>
                  <m:mi>y</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>n</m:mi>
               </m:mrow>
            </m:msub>
         </m:mrow>
      </m:mfenced>
   </m:mrow>
   <m:mrow>
      <m:mi>n</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:mi>N</m:mi>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula>, <inline-formula>
<m:math name="1471-2105-12-496-i42" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mfenced separators="" open="{" close="}">
         <m:mrow>
            <m:msubsup>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">s</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">control&#160;ref</m:mtext>
                  </m:mstyle>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">.;</m:mtext>
                  </m:mstyle>
                  <m:mi>n</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">control</m:mtext>
                  </m:mstyle>
               </m:mrow>
            </m:msubsup>
            <m:mo class="MathClass-punc">,</m:mo>
            <m:msub>
               <m:mrow>
                  <m:mi>y</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>n</m:mi>
               </m:mrow>
            </m:msub>
         </m:mrow>
      </m:mfenced>
   </m:mrow>
   <m:mrow>
      <m:mi>n</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:mi>N</m:mi>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula>, <inline-formula>
<m:math name="1471-2105-12-496-i43" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mfenced separators="" open="{" close="}">
         <m:mrow>
            <m:msubsup>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">s</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">disease&#160;ref</m:mtext>
                  </m:mstyle>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">.;</m:mtext>
                  </m:mstyle>
                  <m:mi>n</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">control</m:mtext>
                  </m:mstyle>
               </m:mrow>
            </m:msubsup>
            <m:mo class="MathClass-punc">,</m:mo>
            <m:msub>
               <m:mrow>
                  <m:mi>y</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>n</m:mi>
               </m:mrow>
            </m:msub>
         </m:mrow>
      </m:mfenced>
   </m:mrow>
   <m:mrow>
      <m:mi>n</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:mi>N</m:mi>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula> and <inline-formula>
<m:math name="1471-2105-12-496-i44" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mfenced separators="" open="{" close="}">
         <m:mrow>
            <m:msubsup>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">s</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">disease&#160;ref</m:mtext>
                  </m:mstyle>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">.;</m:mtext>
                  </m:mstyle>
                  <m:mi>n</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mstyle class="text">
                     <m:mtext class="textsf" mathvariant="sans-serif">disease</m:mtext>
                  </m:mstyle>
               </m:mrow>
            </m:msubsup>
            <m:mo class="MathClass-punc">,</m:mo>
            <m:msub>
               <m:mrow>
                  <m:mi>y</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>n</m:mi>
               </m:mrow>
            </m:msub>
         </m:mrow>
      </m:mfenced>
   </m:mrow>
   <m:mrow>
      <m:mi>n</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:mi>N</m:mi>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula>. The three classifiers used were linear SVM and nonlinear SVM with radial basis function (RBF) and polynomial kernels <abbrgrp>
<abbr bid="B40">40</abbr>
</abbrgrp>, with <it>C </it>= 1. Parameters of the nonlinear SVM classifiers were selected by cross-validation. Prior to the classification, the sets of extracted components were standardized to zero mean and unit variance. Although the standardization across the features is used more often, a standardization across the components (they coincide with the samples from which they were extracted) has been performed here. It yielded much better accuracy and such a fact has also been observed in Chapter 18 in <abbrgrp>
<abbr bid="B41">41</abbr>
</abbrgrp>, where in microarray data analysis standardization across the samples has also been preferred over standardization across the features. In comparative performance analysis presented in Tables <tblr tid="T2">2</tblr>, <tblr tid="T3">3</tblr> and <tblr tid="T4">4</tblr> the best result (obtained by a nested two-fold cross-validation with respect to parameters of the classifiers, single component selection threshold &#916;&#952;, regularization constant &#955; and postulated number of components <it>M </it>) on all four sets of selected components has been used to represent component selection method proposed here. Since many components extracted by other combinations of the parameters yielded also good prediction accuracy we have posted complete results in the Additional Material Files section (Additional Files <supplr sid="S2">2</supplr>, <supplr sid="S3">3</supplr>, <supplr sid="S4">4</supplr> and <supplr sid="S5">5</supplr>) accompanied with the paper. Reference samples used to represent disease and control groups were obtained by averaging all the samples in disease group, <inline-formula>
<m:math name="1471-2105-12-496-i45" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mrow>
      <m:mstyle mathvariant="bold">
         <m:mi mathvariant="normal">x</m:mi>
      </m:mstyle>
   </m:mrow>
   <m:mrow>
      <m:mstyle class="text">
         <m:mtext class="textsf" mathvariant="sans-serif">disease</m:mtext>
      </m:mstyle>
   </m:mrow>
</m:msub>
<m:mo class="MathClass-rel">=</m:mo>
<m:mfrac>
   <m:mrow>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:msub>
         <m:mrow>
            <m:mi>N</m:mi>
         </m:mrow>
         <m:mrow>
            <m:mn>1</m:mn>
         </m:mrow>
      </m:msub>
   </m:mrow>
</m:mfrac>
<m:msubsup>
   <m:mrow>
      <m:mo class="MathClass-op"> &#8721;</m:mo>
   </m:mrow>
   <m:mrow>
      <m:mi>i</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:msub>
         <m:mrow>
            <m:mi>N</m:mi>
         </m:mrow>
         <m:mrow>
            <m:mn>1</m:mn>
         </m:mrow>
      </m:msub>
   </m:mrow>
</m:msubsup>
<m:msub>
   <m:mrow>
      <m:mstyle mathvariant="bold">
         <m:mi mathvariant="normal">x</m:mi>
      </m:mstyle>
   </m:mrow>
   <m:mrow>
      <m:mi>i</m:mi>
   </m:mrow>
</m:msub>
</m:math>
</inline-formula> where <inline-formula>
<m:math name="1471-2105-12-496-i46" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mrow>
      <m:mstyle mathvariant="bold">
         <m:mi mathvariant="normal">x</m:mi>
      </m:mstyle>
   </m:mrow>
   <m:mrow>
      <m:mi>i</m:mi>
   </m:mrow>
</m:msub>
<m:mo class="MathClass-rel">&#8712;</m:mo>
<m:msubsup>
   <m:mrow>
      <m:mfenced separators="" open="{" close="}">
         <m:mrow>
            <m:msub>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">x</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mi>n</m:mi>
               </m:mrow>
            </m:msub>
            <m:mo class="MathClass-punc">:</m:mo>
            <m:msub>
               <m:mrow>
                  <m:mi>y</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>n</m:mi>
               </m:mrow>
            </m:msub>
            <m:mo class="MathClass-rel">=</m:mo>
            <m:mn>1</m:mn>
         </m:mrow>
      </m:mfenced>
   </m:mrow>
   <m:mrow>
      <m:mi>n</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:msub>
         <m:mrow>
            <m:mi>N</m:mi>
         </m:mrow>
         <m:mrow>
            <m:mn>1</m:mn>
         </m:mrow>
      </m:msub>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula>, and control group, <inline-formula>
<m:math name="1471-2105-12-496-i47" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mrow>
      <m:mstyle mathvariant="bold">
         <m:mi mathvariant="normal">x</m:mi>
      </m:mstyle>
   </m:mrow>
   <m:mrow>
      <m:mstyle class="text">
         <m:mtext class="textsf" mathvariant="sans-serif">control</m:mtext>
      </m:mstyle>
   </m:mrow>
</m:msub>
<m:mo class="MathClass-rel">=</m:mo>
<m:mfrac>
   <m:mrow>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:msub>
         <m:mrow>
            <m:mi>N</m:mi>
         </m:mrow>
         <m:mrow>
            <m:mn>2</m:mn>
         </m:mrow>
      </m:msub>
   </m:mrow>
</m:mfrac>
<m:msubsup>
   <m:mrow>
      <m:mo class="MathClass-op"> &#8721;</m:mo>
   </m:mrow>
   <m:mrow>
      <m:mi>i</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:msub>
         <m:mrow>
            <m:mi>N</m:mi>
         </m:mrow>
         <m:mrow>
            <m:mn>2</m:mn>
         </m:mrow>
      </m:msub>
   </m:mrow>
</m:msubsup>
<m:msub>
   <m:mrow>
      <m:mstyle mathvariant="bold">
         <m:mi mathvariant="normal">x</m:mi>
      </m:mstyle>
   </m:mrow>
   <m:mrow>
      <m:mi>i</m:mi>
   </m:mrow>
</m:msub>
</m:math>
</inline-formula> where <inline-formula>
<m:math name="1471-2105-12-496-i48" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msub>
   <m:mrow>
      <m:mstyle mathvariant="bold">
         <m:mi mathvariant="normal">x</m:mi>
      </m:mstyle>
   </m:mrow>
   <m:mrow>
      <m:mi>i</m:mi>
   </m:mrow>
</m:msub>
<m:mo class="MathClass-rel">&#8712;</m:mo>
<m:msubsup>
   <m:mrow>
      <m:mfenced separators="" open="{" close="}">
         <m:mrow>
            <m:msub>
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">x</m:mi>
                  </m:mstyle>
               </m:mrow>
               <m:mrow>
                  <m:mi>n</m:mi>
               </m:mrow>
            </m:msub>
            <m:mo class="MathClass-punc">:</m:mo>
            <m:msub>
               <m:mrow>
                  <m:mi>y</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>n</m:mi>
               </m:mrow>
            </m:msub>
            <m:mo class="MathClass-rel">=</m:mo>
            <m:mn>1</m:mn>
         </m:mrow>
      </m:mfenced>
   </m:mrow>
   <m:mrow>
      <m:mi>n</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:msub>
         <m:mrow>
            <m:mi>N</m:mi>
         </m:mrow>
         <m:mrow>
            <m:mn>2</m:mn>
         </m:mrow>
      </m:msub>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula> and <it>N</it>
<sub>1 </sub>+ <it>N</it>
<sub>2 </sub>= <it>N</it>. We thought this is the most fair approach in the absence of any <it>prior </it>information that could suggest which labelled sample could serve as a gold standard. We conclude this section by providing assessment of the computational complexity of proposed method. It has been implemented in MATLAB 7.7 environment on a desktop computer based on 3 GHz dual core processor and 2 GB of RAM. Processing of proteomic and genomic datasets used in sections 1.5 to 1.7 took 10, 7 and 3 minutes respectively.</p>
<tbl id="T2"><title><p>Table 2</p></title><caption><p>Comparative performance results in ovarian cancer prediction. Sensitivities and specificities were estimated by 100 two-fold cross-validations (standard deviations are in brackets).</p></caption><tblbdy cols="2">
      <r>
         <c ca="left">
            <p>
               <b>Method</b>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>Sensitivity/Specificity/Accuracy</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Proposed method <it>M </it>= 3, &#916;&#952; = 5<sup>0</sup></p>
            <p>&#955; = 10<sup>-4</sup>&#955;<sub>max</sub></p>
            <p>Linear SVM</p>
         </c>
         <c ca="left">
            <p>Sensitivity: 96.2 (2.7)%; specificity: 93.6 (4.1)%; accuracy: 94.9%</p>
            <p>Control specific component extracted with respect to a cancer reference sample.</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Proposed method <it>M </it>= 4, &#916;&#952; = 3<sup>0</sup></p>
            <p>&#955; = 10<sup>-6</sup>&#955;<sub>max</sub></p>
            <p>Linear SVM</p>
         </c>
         <c ca="left">
            <p>Sensitivity: 95.4 (3)%; specificity: 94 (3.7)%; accuracy:94.7%</p>
            <p>Control specific component extracted with respect to a cancer reference sample.</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <abbrgrp>
                  <abbr bid="B1">1</abbr>
               </abbrgrp>
            </p>
         </c>
         <c ca="left">
            <p>Sensitivity: 81.4 (7.1)%; specificity: 71.7 (6.6)%</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <abbrgrp>
                  <abbr bid="B42">42</abbr>
               </abbrgrp>
            </p>
         </c>
         <c ca="left">
            <p>Sensitivity: 100%; specificity: 95% (<ul>one partition only:</ul> 50/50 training; 66/50 test).</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <abbrgrp>
                  <abbr bid="B44">44</abbr>
               </abbrgrp>
            </p>
         </c>
         <c ca="left">
            <p>Accuracy averaged over 10 ten-fold partitions: 98-99% (sd: 0.3-0.8)</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <abbrgrp>
                  <abbr bid="B13">13</abbr>
               </abbrgrp>
            </p>
         </c>
         <c ca="left">
            <p>Sensitivity: 98%, specificity: 95%, two-fold CV with 100 partitions.</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <abbrgrp>
                  <abbr bid="B45">45</abbr>
               </abbrgrp>
            </p>
         </c>
         <c ca="left">
            <p>Average error rate of 4.1% with three-fold CV.</p>
         </c>
      </r>
   </tblbdy></tbl>
<tbl id="T3"><title><p>Table 3</p></title><caption><p>Comparative performance results in prostate cancer prediction. Sensitivities and specificities were estimated by 100 two-fold cross-validations (standard deviations are in brackets).</p></caption><tblbdy cols="2">
      <r>
         <c ca="left">
            <p>
               <b>Methods</b>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>Sensitivity/Specificity/Accuracy</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Proposed method <it>M </it>= 5, &#916;&#952; = 1<sup>0</sup></p>
            <p>&#955; = 10<sup>-4</sup>&#955;<sub>max </sub>Linear SVM</p>
         </c>
         <c ca="left">
            <p>Sensitivity: 97.6 (2.8)%; specificity: 99 (2.2)%; accuracy: 98.3%</p>
            <p>Control specific component extracted with respect to a cancer reference sample.</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Proposed method <it>M </it>= 4, &#916;&#952; = 1<sup>0</sup></p>
            <p>&#955; = 10<sup>-4</sup>&#955;<sub>max </sub>Linear SVM</p>
         </c>
         <c ca="left">
            <p>Sensitivity: 97.7 (2.3)%; specificity: 98 (2.4)%; accuracy: 97.9%</p>
            <p>Control specific component extracted with respect to a cancer reference sample.</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <abbrgrp>
                  <abbr bid="B1">1</abbr>
               </abbrgrp>
            </p>
         </c>
         <c ca="left">
            <p>Sensitivity: 86 (6.6)%; specificity: 67.8(12.9)%; accuracy: 76.9%.</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <abbrgrp>
                  <abbr bid="B46">46</abbr>
               </abbrgrp>
            </p>
         </c>
         <c ca="left">
            <p>Sensitivity: 94.7%; specificity: 75.9%; accuracy: 85.3%. 253 benign and 69 cancers. Results were obtained on independent test set comprised of 38 cancers and 228 benign samples.</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <abbrgrp>
                  <abbr bid="B47">47</abbr>
               </abbrgrp>
            </p>
         </c>
         <c ca="left">
            <p>Sensitivity: 97.1%; specificity: 96.8%; accuracy: 97%. 253 benign and 69 cancers. Cross-validation details not reported.</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <abbrgrp>
                  <abbr bid="B45">45</abbr>
               </abbrgrp>
            </p>
         </c>
         <c ca="left">
            <p>Average error rate of 28.97 on four class problem with three-fold cross-validation.</p>
         </c>
      </r>
   </tblbdy></tbl>
<tbl id="T4"><title><p>Table 4</p></title><caption><p>Comparative performance results in colon cancer prediction. Sensitivities and specificities were estimated by 100 two-fold cross-validations (standard deviations are in brackets).</p></caption><tblbdy cols="2">
      <r>
         <c ca="left">
            <p>
               <b>Methods</b>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>Sensitivity/Specificity/Accuracy</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Proposed method <it>M </it>= 2, &#916;&#952; = 1<sup>0</sup></p>
            <p>RBF SVM (&#963;<sup>2 </sup>= 1200, C = 1)</p>
         </c>
         <c ca="left">
            <p>Sensitivity: 90.8 (5.5)%, specificity: 79.4 (9.8)%; accuracy: 85.1%</p>
            <p>Control specific component extracted with respect to a cancer reference sample.</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>Proposed method <it>M </it>= 4, &#916;&#952; = 5<sup>0 </sup>&#955; = 10<sup>-2</sup>&#955;<sub>max</sub></p>
            <p>RBF SVM (&#963;<sup>2 </sup>= 1000, C = 1)</p>
         </c>
         <c ca="left">
            <p>Sensitivity: 89.8 (6.2)%, specificity: 78.6 (12.8)%; accuracy: 84.2%.</p>
            <p>Control specific component extracted with respect to a control reference sample.</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <abbrgrp>
                  <abbr bid="B1">1</abbr>
               </abbrgrp>
            </p>
         </c>
         <c ca="left">
            <p>Sensitivity: 89.7 (6.4)%, specificity: 84.3 (8.4)%; accuracy = 87%. 100 two-fold cross-validations.</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <abbrgrp>
                  <abbr bid="B2">2</abbr>
               </abbrgrp>
            </p>
         </c>
         <c ca="left">
            <p>Sensitivity: 92.1 (4.7)%, specificity: 85 (10.1)%; accuracy: 88.55%. 100 two-fold cross-validations. <it>c</it>u = 2.0.</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <abbrgrp>
                  <abbr bid="B48">48</abbr>
               </abbrgrp>
            </p>
         </c>
         <c ca="left">
            <p>Sensitivity: 92-95% calculated from Figure 5. Specificity not reported.</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <abbrgrp>
                  <abbr bid="B15">15</abbr>
               </abbrgrp>
            </p>
         </c>
         <c ca="left">
            <p>Accuracy 85%. Cross-validation details not reported.</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <abbrgrp>
                  <abbr bid="B50">50</abbr>
               </abbrgrp>
            </p>
         </c>
         <c ca="left">
            <p>Accuracy 82.5%, ten-fold cross-validation (RFE with linear SVM).</p>
         </c>
      </r>
      <r>
         <c cspan="2">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="left">
            <p>
               <abbrgrp>
                  <abbr bid="B51">51</abbr>
               </abbrgrp>
            </p>
         </c>
         <c ca="left">
            <p>Accuracy 88.84%, two-fold cross-validation (RFE with linear SVM and optimized penalty parameter C).</p>
         </c>
      </r>
   </tblbdy></tbl>
<suppl id="S2">
<title>
<p>Additional file 2</p>
</title>
<text>
<p>
<b>classification results obtained by the linear SVM applied to disease and control specific components extracted from the ovarian cancer dataset for various combination of parameters <it>M</it>, &#955; and &#916;&#952;</b>.</p>
</text>
<file name="1471-2105-12-496-S2.XLSX">
   <p>Click here for file</p>
</file>
</suppl>
<suppl id="S3">
<title>
<p>Additional file 3</p>
</title>
<text>
<p>
<b>classification results obtained by the linear SVM applied to disease and control specific components extracted from the prostate cancer dataset for various combination of parameters <it>M</it>, &#955; and &#916;&#952;</b>.</p>
</text>
<file name="1471-2105-12-496-S3.XLSX">
   <p>Click here for file</p>
</file>
</suppl>
<suppl id="S4">
<title>
<p>Additional file 4</p>
</title>
<text>
<p>
<b>classification results obtained by the linear SVM applied to disease and control specific components extracted from the colon cancer dataset for various combination of parameters <it>M</it>, &#955; and &#916;&#952;</b>.</p>
</text>
<file name="1471-2105-12-496-S4.XLSX">
   <p>Click here for file</p>
</file>
</suppl>
<suppl id="S5">
<title>
<p>Additional file 5</p>
</title>
<text>
<p>
<b>best classification results obtained by the RBF SVM applied to disease and control specific components extracted from the colon cancer dataset for <it>M </it>= 4, &#955; = 10<sup>-2</sup>&#955;<sub>max </sub>and &#916;&#952; = 5<sup>0 </sup>and <it>M </it>= 2 and &#916;&#952; = 1<sup>0</sup>
</b>.</p>
</text>
<file name="1471-2105-12-496-S5.XLSX">
   <p>Click here for file</p>
</file>
</suppl>
<sec>
<st>
<p>1.4 Model validation</p>
</st>
<p>This section presents model validation results obtained on simulated data using LMM (2a)/(2b). To this end, each mixture sample has been composed of ten orthogonal components comprised of <it>K </it>= 15000 features. The orthogonality implies that each feature belongs to one component only. By a convention, the first component has been selected to contain disease specific features, the tenth component to contain control specific features and the components two to nine contain features that are not differentially expressed and share similar concentrations in control and disease labelled samples. A concentration variability across the sample population is simulated using the following model for disease group of samples:</p>
<p>
<display-formula>
<m:math name="1471-2105-12-496-i49" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mrow>
         <m:mstyle mathvariant="bold">
            <m:mi mathvariant="normal">x</m:mi>
         </m:mstyle>
      </m:mrow>
      <m:mrow>
         <m:mi>n</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:msubsup>
      <m:mrow>
         <m:mo mathsize="big"> &#8721;</m:mo>
      </m:mrow>
      <m:mrow>
         <m:mi>m</m:mi>
         <m:mo class="MathClass-bin">-</m:mo>
         <m:mn>1</m:mn>
      </m:mrow>
      <m:mrow>
         <m:mi>M</m:mi>
      </m:mrow>
   </m:msubsup>
   <m:msup>
      <m:mrow>
         <m:mo class="qopname">sin</m:mo>
      </m:mrow>
      <m:mrow>
         <m:mn>2</m:mn>
      </m:mrow>
   </m:msup>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mi>&#952;</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>n</m:mi>
               <m:mi>m</m:mi>
            </m:mrow>
         </m:msub>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:msub>
      <m:mrow>
         <m:mstyle mathvariant="bold">
            <m:mi mathvariant="normal">s</m:mi>
         </m:mstyle>
      </m:mrow>
      <m:mrow>
         <m:mi>m</m:mi>
      </m:mrow>
   </m:msub>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>and for control group of samples:</p>
<p>
<display-formula id="M4">
<m:math name="1471-2105-12-496-i50" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mrow>
         <m:mstyle mathvariant="bold">
            <m:mi mathvariant="normal">x</m:mi>
         </m:mstyle>
      </m:mrow>
      <m:mrow>
         <m:mi>n</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:msubsup>
      <m:mrow>
         <m:mo mathsize="big"> &#8721;</m:mo>
      </m:mrow>
      <m:mrow>
         <m:mi>m</m:mi>
         <m:mo class="MathClass-bin">-</m:mo>
         <m:mn>1</m:mn>
      </m:mrow>
      <m:mrow>
         <m:mi>M</m:mi>
      </m:mrow>
   </m:msubsup>
   <m:msup>
      <m:mrow>
         <m:mo class="qopname">cos</m:mo>
      </m:mrow>
      <m:mrow>
         <m:mn>2</m:mn>
      </m:mrow>
   </m:msup>
   <m:mrow>
      <m:mo class="MathClass-open">(</m:mo>
      <m:mrow>
         <m:msub>
            <m:mrow>
               <m:mi>&#952;</m:mi>
            </m:mrow>
            <m:mrow>
               <m:mi>n</m:mi>
               <m:mi>m</m:mi>
            </m:mrow>
         </m:msub>
      </m:mrow>
      <m:mo class="MathClass-close">)</m:mo>
   </m:mrow>
   <m:msub>
      <m:mrow>
         <m:mstyle mathvariant="bold">
            <m:mi mathvariant="normal">s</m:mi>
         </m:mstyle>
      </m:mrow>
      <m:mrow>
         <m:mi>m</m:mi>
      </m:mrow>
   </m:msub>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>Thus, by controlling the mixing angles <inline-formula>
<m:math name="1471-2105-12-496-i51" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mrow>
         <m:mo class="MathClass-open">{</m:mo>
         <m:mrow>
            <m:msub>
               <m:mrow>
                  <m:mi>&#952;</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>n</m:mi>
                  <m:mi>m</m:mi>
               </m:mrow>
            </m:msub>
         </m:mrow>
         <m:mo class="MathClass-close">}</m:mo>
      </m:mrow>
   </m:mrow>
   <m:mrow>
      <m:mi>n</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
      <m:mo class="MathClass-punc">,</m:mo>
      <m:mi>m</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:mi>N</m:mi>
      <m:mo class="MathClass-punc">,</m:mo>
      <m:mi>M</m:mi>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula> the amount of a concentration of each component in disease and control samples is controlled. Also amount of concentration variability is controlled by selecting <inline-formula>
<m:math name="1471-2105-12-496-i52" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mrow>
         <m:mo class="MathClass-open">{</m:mo>
         <m:mrow>
            <m:msub>
               <m:mrow>
                  <m:mi>&#952;</m:mi>
               </m:mrow>
               <m:mrow>
                  <m:mi>n</m:mi>
                  <m:mi>m</m:mi>
               </m:mrow>
            </m:msub>
         </m:mrow>
         <m:mo class="MathClass-close">}</m:mo>
      </m:mrow>
   </m:mrow>
   <m:mrow>
      <m:mi>n</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
      <m:mo class="MathClass-punc">,</m:mo>
      <m:mi>m</m:mi>
      <m:mo class="MathClass-rel">=</m:mo>
      <m:mn>1</m:mn>
   </m:mrow>
   <m:mrow>
      <m:mi>N</m:mi>
      <m:mo class="MathClass-punc">,</m:mo>
      <m:mi>M</m:mi>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula> to be confined within (non-) overlapping angular sectors. Note that (4) implies that component <b>s</b>
<it>
<sub>m </sub>
</it>is contained in a related disease and control samples in overall concentration of 100%. To simulate biological variability between the samples, the relative concentration has been varied across the sample population, where disease and control groups contained 100 samples each. The concentration vectors were overlapping in the mixing angle domain i.e. a concentration vector for disease specific features was confined in the sector of [50<sup>0</sup>, 89.99<sup>0</sup>], for the neutral features it was in the sector of [25<sup>0</sup>,65<sup>0</sup>] and for control specific features it was confined in the sector of [0.01<sup>0</sup>,40<sup>0</sup>]. Thus, amount of overlap between concentration profiles was significant, implying that in many cases neutral features were contained in greater concentrations in disease labelled samples than disease specific features, as well as that neutral features were contained in greater concentrations in control labelled samples than control specific features. Figures <figr fid="F2">2a</figr> and <figr fid="F2">2b</figr> show disease prediction results using four extracted disease and control specific components with the postulated overall number of components equal to <it>M </it>= 2 (red bars), <it>M </it>= 3 (green bars), <it>M </it>= 4 (blue bars) and <it>M </it>= 5 (magenta bars). Reference samples used in LMM (2a)/(2b) were obtained by averaging all the samples in control i.e. disease group. Results reported in terms of sensitivity (Figure <figr fid="F2">2a</figr>) and specificity (Figure <figr fid="F2">2b</figr>) were obtained by the linear support vector machine (SVM) classifier using 100 independent two-fold cross-validations. SCPs selection parameter has been set to &#916;&#952; = 3<sup>0 </sup>and sparseness regularization parameter in (3a)/(3b) to &#955; = 10<sup>-6</sup>&#183;&#955;<sub>max</sub>. These parameters were not selected through cross-validation since the purpose of the computational experiment has been to evaluate influence of the assumed number of components <it>M </it>to the prediction accuracy when concentration varies across the sample population. The presented results demonstrate that greater number of postulated components does not decrease prediction accuracy (in the average it is even slightly increased). However, increased number of postulated components <it>M </it>reduces the number of features contained in disease and control specific components selected for classification. As discussed previously, a greater <it>M </it>yields less complex disease and control specific components. Following the principle of parsimony such solution should be preferred over the more complex ones that are obtained for smaller <it>M</it>. Thus, selected disease and control specific components are expected to be more discriminative and less sensitive to over-fitting when the number of postulated components is increased. In practical implementation of the proposed approach to component selection the optimal number of overall components needs to be evaluated by a cross-validation. In the three real world experiments reported below the number of components has been selected by cross-validation from <it>M </it>&#8712; {2, 3, 4, 5}. If a prediction accuracy achieved for the two values of <it>M </it>is approximately equal, it is better to prefer components extracted from the samples with a greater value of <it>M</it>.</p>
<fig id="F2"><title><p>Figure 2</p></title><caption><p>model validation</p></caption><text>
   <p><b>model validation</b>. Sensitivities, Figure 2a, and specificities, Figure 2b, (with standard deviations as error bars) estimated by linear SVM classifier and 100 independent two-fold cross-validations using two disease specific and two control specific components. Components were extracted from the linear mixture models based on control reference (c.r.) sample, model (2a), and disease reference (d.r.) sample, model (2b), where each sample was comprised of ten orthogonal components containing <it>K=</it>15000 features. One component contained in prevailing concentration disease specific features, one control specific features and eight components contained features equally expressed in control and disease labelled samples. Relative concentration (expressed through a mixing angle) across the sample population has been: for disease specific features in the range of 50<sup>0 </sup>to 89.99<sup>0</sup>; for differentially not expressed features in the range of 25<sup>0 </sup>to 65<sup>0</sup>; and for control specific features in the range of 0.01<sup>0 </sup>to 40<sup>0</sup>. Assumed overall number of components has been <it>M=</it>2 (red bars), <it>M </it>= 3 (green bars), <it>M=</it>4 (blue bars) and <it>M </it>= 5 (magenta bars).</p>
</text><graphic file="1471-2105-12-496-2" hint_layout="single"/></fig>
</sec>
<sec>
<st>
<p>1.5 Ovarian cancer prediction from a protein mass spectra</p>
</st>
<p>Low resolution surface-enhanced laser desorption ionization time-of-flight (SELDI-TOF) mass spectra of 100 controls and 100 cases have been used for ovarian cancer prediction study <abbrgrp>
<abbr bid="B42">42</abbr>
</abbrgrp>. See also the website of the clinical proteomics program of the National Cancer Institute (NCI), <abbrgrp>
<abbr bid="B43">43</abbr>
</abbrgrp>, where the used dataset is labelled as "Ovarian 4-3-02". All spectra were baseline corrected. Thus, some intensities have negative values. Table <tblr tid="T2">2</tblr> presents the best result obtained by the proposed SCA-based component selection method together with results obtained for the same dataset by competing methods reported in cited references as well as by predictive factorization method proposed in <abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp>. Described SCA method has been used to extract four sets of components with the overall number of components <it>M </it>assumed to be 2, 3, 4 and 5. Figure <figr fid="F3">3</figr> shows sensitivities and specificities estimated by 100 independent two-fold cross-validations using linear SVM classifier which yielded the best results compared against nonlinear SVM classifiers based on polynomial and RBF kernels. Performance improvement is visible when assumed number of components is increased from 2 to 3, 4 or 5. The error bars are dictated by the sample size and would decrease with a larger sample. Thus, the mean values should be looked at to observe the trend in performance as a function of <it>M</it>. The best result (shown in Table <tblr tid="T2">2</tblr>) has been obtained with the linear SVM classifier for <it>M </it>= 3 with sensitivity of 96.2% and specificity of 93.6%, but results with the very similar quality have been obtained for several combinations of the parameters <it>M</it>, &#916;&#952; and &#955;, see Figure <figr fid="F3">3</figr>, most notably <it>M </it>= 4 (see second column in Table <tblr tid="T2">2</tblr> and the Additional File <supplr sid="S2">2</supplr>). As seen in Table <tblr tid="T2">2</tblr>, only <abbrgrp>
<abbr bid="B13">13</abbr>
</abbrgrp> reported better result for a two-fold cross-validation with the same number of partitions. There, a combination of genetic algorithm and <it>k</it>-nearest neighbours method, originally developed for mining of high-dimensional microarray gene expression data, has been used for analysis of proteomics data. However, the method <abbrgrp>
<abbr bid="B13">13</abbr>
</abbrgrp> is tested on proteomic ovarian cancer dataset only, while the method proposed here exhibited excellent performance in prediction of prostate cancer from proteomic data (reported in section 1.6), as well as on colon cancer from genomic data (presented in section 1.7). The method shown in <abbrgrp>
<abbr bid="B42">42</abbr>
</abbrgrp> used 50 samples from the control group and 50 samples from the ovarian cancer group to discover a pattern that discriminated cancer from non-cancer group. This pattern has then been used to classify an independent set of 50 samples with ovarian cancer and 66 samples unaffected by ovarian cancer. In <abbrgrp>
<abbr bid="B44">44</abbr>
</abbrgrp>, a fuzzy rule based classifier fusion is proposed for feature selection and classification (diagnosis) of protein mass spectra based ovarian cancer. Demonstrated accuracy of 98-99% has been estimated through 10 ten-fold cross-validations (as opposed to 100 two-fold cross-validations used here). Moreover, as demonstrated in sections 1.6 and 1.7, the method proposed here exhibited good performance on diagnosis of prostate and colon cancers from proteomic and gene expression levels, respectively. In <abbrgrp>
<abbr bid="B45">45</abbr>
</abbrgrp>, a clustering based method for feature selection from mass spectrometry data is derived by combining <it>k</it>-means clustering and genetic algorithm. The method exhibited an accuracy of 95.8% (error rate 4.1%), but this has been assessed through three-fold cross-validations (as opposed to two-fold cross-validations used here).</p>
<fig id="F3"><title><p>Figure 3</p></title><caption><p>ovarian cancer prediction</p></caption><text>
   <p><b>ovarian cancer prediction</b>. Sensitivities (a) and specificities (b) (with standard deviations as error bars) estimated in ovarian cancer prediction from protein expression levels using 100 independent two-fold cross-validations and linear SVM classifier. Four sets of selected components were extracted by SCA-based factorization using LMMs (2a) and (2b) with control reference (c.r.) and disease reference (d.r.) samples respectively, where the overall number of components <it>M </it>has been set to 2 (red bars), 3 (green bars), 4 (blue bars) and 5 (magenta bars). Optimal values of the parameters &#955; and &#916;&#952; were used for each <it>M</it>. Performance improvement is visible when number of components is increased from 2 to 3, 4 or 5.</p>
</text><graphic file="1471-2105-12-496-3" hint_layout="single"/></fig>
</sec>
<sec>
<st>
<p>1.6 Prostate cancer prediction from a protein mass spectra</p>
</st>
<p>Low resolution SELDI-TOF mass spectra of 63 controls: no evidence of cancer with prostate-specific antigen (PSA)&lt;1, and 69 cases (prostate cancers): 26 with 4&lt;PSA&lt;10 and 43 with PSA&gt;10, have been used for prostate cancer prediction study <abbrgrp>
<abbr bid="B46">46</abbr>
</abbrgrp>. There are additional 190 control samples with benign cancer (4&lt;PSA&lt;10) available as well (see the website of the clinical proteomics program of the NCI, <abbrgrp>
<abbr bid="B43">43</abbr>
</abbrgrp>), in dataset labelled as "JNCI_Data_7-3-02". However, in the two-class comparative performance analysis problem reported here these samples were not used. Proposed SCA-based method has been used to extract four sets of components with the overall number of components <it>M </it>assumed to be 2, 3, 4 and 5. The best result has been achieved for <it>M </it>= 5 with sensitivity of 97.6% and specificity of 99%, but results with the very similar quality have been obtained for several combinations of the parameters <it>M</it>, &#916;&#952; and &#955;, (see Figure <figr fid="F4">4</figr> and the Additional File <supplr sid="S3">3</supplr>). Table <tblr tid="T3">3</tblr> presents two best results achieved by the proposed SCA-based approach to component selection together with the results obtained by competing methods reported in cited references. Linear SVM classifier yielded the best results when compared against nonlinear SVM classifiers based on polynomial and RBF kernels. According to Table <tblr tid="T3">3</tblr>, comparable result (although slightly worse) is in the reference <abbrgrp>
<abbr bid="B47">47</abbr>
</abbrgrp> only. The method <abbrgrp>
<abbr bid="B47">47</abbr>
</abbrgrp> is proposed for analysis of mass spectra for screening of prostate cancer. The system is composed of three stages: a feature selection using statistical significance test, a classification by radial basis function and probabilistic neural networks and an optimization of the results through the receiver-operating-characteristic analysis. The method achieved sensitivity 97.1% and specificity 96.8% but the cross-validation setting has not been described in details. In <abbrgrp>
<abbr bid="B46">46</abbr>
</abbrgrp>, the training group has been used to discover a pattern that discriminated cancer from non-cancer group. This pattern has then been used to classify an independent set of 38 patients with the prostate cancer and 228 patients with the benign conditions. The obtained specificity is low. The predictive matrix factorization method <abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp> yielded significantly worse result than the method proposed here. In <abbrgrp>
<abbr bid="B45">45</abbr>
</abbrgrp> a clustering based method for feature selection from mass spectrometry data is derived combining <it>k</it>-means clustering and genetic algorithm. Despite a three-fold cross-validation, the reported error was 28.97%. Figure <figr fid="F4">4</figr> shows sensitivities and specificities estimated by 100 independent two-fold cross-validations using linear SVM classifier on components selected by the method proposed here. For each <it>M </it>the optimal values of the parameters &#955; and &#916;&#952; (obtained by cross-validation) have been used to obtain results shown in Figure <figr fid="F4">4</figr>. Increasing a postulated number of components from 2 to 5 increased accuracy from 97.4% to 98.3%. Thus, better accuracy is achieved with the smaller number of features (<it>m</it>/<it>z </it>ratios) contained in selected components.</p>
<fig id="F4"><title><p>Figure 4</p></title><caption><p>prostate cancer prediction</p></caption><text>
   <p><b>prostate cancer prediction</b>. Sensitivities (a) and specificities (b) (with standard deviations as error bars) estimated in prostate cancer prediction from protein expression levels using 100 independent two-fold cross-validations and linear SVM classifier. Four sets of selected components were extracted by SCA-based factorization using LMMs (2a) and (2b) with control reference (c.r.) and disease reference (d.r.) samples respectively, where the overall number of components <it>M </it>has been set to 2 (red bars), 3 (green bars), 4 (blue bars) and 5 (magenta bars). Optimal values of the parameters &#955; and &#916;&#952;<b/>were used for each <it>M</it>. Performance improvement is visible when number of components is increased from 2 to 5.</p>
</text><graphic file="1471-2105-12-496-4" hint_layout="single"/></fig>
</sec>
<sec>
<st>
<p>1.7 Colon cancer prediction from gene expression profiles</p>
</st>
<p>Gene expression profiles of 40 colon cancer and 22 normal colon tissue samples obtained by an Affymetrix oligonucleotide array <abbrgrp>
<abbr bid="B48">48</abbr>
</abbrgrp>, have been also used for validation and comparative performance analysis of proposed feature extraction method. Gene expression profiles have been downloaded from <abbrgrp>
<abbr bid="B49">49</abbr>
</abbrgrp>. Original data produced by oligonucleotide array contained more than 6500 genes but only 2000 high-intensity genes have been used for cluster analysis in <abbrgrp>
<abbr bid="B48">48</abbr>
</abbrgrp> and are provided for download on the cited website. The proposed SCA-based approach to feature extraction/component selection has been used to extract four sets of components with up- and down-regulated genes and with the overall number of components <it>M </it>assumed to be 2, 3, 4 and 5. The linear SVM classifier has been applied to groups of the four sets of selected components extracted from gene expression levels for specific combinations of parameters &#916;&#952;, &#955; and <it>M</it>. The best result in terms of sensitivity and specificity for each <it>M </it>has been selected and shown in Figure <figr fid="F5">5</figr>. The complete list of results obtained by linear SVM classifier is presented in the Additional File <supplr sid="S4">4</supplr>. An increased number of postulated components <it>M </it>did not decrease accuracy but it yielded components selected for classification with reduced number of genes. This is verified in Figure <figr fid="F6">6</figr> which shows component with up-regulated genes <inline-formula>
<m:math name="1471-2105-12-496-i53" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:msubsup>
   <m:mrow>
      <m:mstyle mathvariant="bold">
         <m:mi mathvariant="normal">s</m:mi>
      </m:mstyle>
   </m:mrow>
   <m:mrow>
      <m:mstyle class="text">
         <m:mtext class="textsf" mathvariant="sans-serif">control</m:mtext>
      </m:mstyle>
   </m:mrow>
   <m:mrow>
      <m:mstyle class="text">
         <m:mtext class="textsf" mathvariant="sans-serif">disease</m:mtext>
      </m:mstyle>
   </m:mrow>
</m:msubsup>
</m:math>
</inline-formula> extracted from a cancer labelled sample w.r.t. the control reference for assumed number of components <it>M </it>= 2 and <it>M </it>= 4. Thus, it is confirmed again that an increased <it>M </it>yields less complex components that (following the principle of parsimony), should be preferred over the more complex ones obtained by smaller <it>M</it>. In order to (possibly) increase the prediction accuracy, we have applied nonlinear, polynomial and RBF SVM classifiers to the two groups of the four sets of components that yielded the best results with the linear SVM classifier: <it>M </it>= 2 (&#916;&#952; = 1<sup>0</sup>) and <it>M </it>= 4 (&#955; = 10<sup>-2</sup>&#955;max and &#916;&#952; = 5<sup>0</sup>). The polynomial SVM classifier has been cross-validated for degree of the polynomial equal to d = 2, 3 and 4. The RBF SVM classifier <inline-formula>
<m:math name="1471-2105-12-496-i54" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mi>&#954;</m:mi>
<m:mrow>
   <m:mo class="MathClass-open">(</m:mo>
   <m:mrow>
      <m:mstyle mathvariant="bold">
         <m:mi mathvariant="normal">x</m:mi>
      </m:mstyle>
      <m:mo class="MathClass-punc">,</m:mo>
      <m:mstyle mathvariant="bold">
         <m:mi mathvariant="normal">y</m:mi>
      </m:mstyle>
   </m:mrow>
   <m:mo class="MathClass-close">)</m:mo>
</m:mrow>
<m:mo class="MathClass-rel">=</m:mo>
<m:mo class="qopname"> exp</m:mo>
<m:mfenced separators="" open="(" close=")">
   <m:mrow>
      <m:mo class="MathClass-bin">-</m:mo>
      <m:msubsup>
         <m:mrow>
            <m:mfenced separators="" open="&#8741;" close="&#8741;">
               <m:mrow>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">x</m:mi>
                  </m:mstyle>
                  <m:mo class="MathClass-bin">-</m:mo>
                  <m:mstyle mathvariant="bold">
                     <m:mi mathvariant="normal">y</m:mi>
                  </m:mstyle>
               </m:mrow>
            </m:mfenced>
         </m:mrow>
         <m:mrow>
            <m:mn>2</m:mn>
         </m:mrow>
         <m:mrow>
            <m:mn>2</m:mn>
         </m:mrow>
      </m:msubsup>
      <m:mo>/</m:mo>
      <m:mn>2</m:mn>
      <m:msup>
         <m:mrow>
            <m:mi>&#963;</m:mi>
         </m:mrow>
         <m:mrow>
            <m:mn>2</m:mn>
         </m:mrow>
      </m:msup>
   </m:mrow>
</m:mfenced>
</m:math>
</inline-formula> has been cross-validated for the variance &#963;<sup>2 </sup>in the range 5 &#215; 10<sup>2 </sup>to 1.5 &#215; 10<sup>3 </sup>in steps of 10<sup>2</sup>. The best result has been obtained with &#963;<sup>2 </sup>= 1.2 &#215; 10<sup>3 </sup>for <it>M </it>= 2 and with &#963;<sup>2 </sup>= 1.0 &#215; 10<sup>3 </sup>for <it>M </it>= 4. An achieved accuracy is comparable with the accuracy obtained by other state-of-the-art results reported. That is shown in Table <tblr tid="T4">4</tblr> as well as in the Additional File <supplr sid="S5">5</supplr>. A predictive matrix factorization method <abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp> yielded slightly better results here, but it has shown significantly worse result in the cases of ovarian (see Table <tblr tid="T2">2</tblr>) and prostate (see Table <tblr tid="T3">3</tblr>) cancers. Gene discovery method <abbrgrp>
<abbr bid="B2">2</abbr>
</abbrgrp> has been applied for three values of the threshold <it>c</it>
<sub>u </sub>&#8712; {2, 2.5, 3} used to select up-regulated genes. Maximum <it>a posteriori </it>probability has been used for an assignment of genes to each of the three components containing up-, down regulated and differentially not expressed genes. Thus for each threshold value the two components were obtained for training a classifier. The logarithm with the base 10 has been applied to gene folding values prior gene discovery/selection took place. The best result reported in Table <tblr tid="T4">4</tblr> has been obtained for a component containing up-regulated genes with <it>c</it>
<sub>u </sub>= 2.0 and an RBF SVM classifier, whereas &#963;<sup>2 </sup>has been cross-validated in the range 10<sup>2 </sup>to 10<sup>3 </sup>in steps of 10<sup>2</sup>. The best result has been obtained for &#963;<sup>2 </sup>= 5 &#215; 10<sup>2</sup>. The gene discovery method <abbrgrp>
<abbr bid="B2">2</abbr>
</abbrgrp> outperformed slightly the method proposed here. However as opposed to the proposed method, the gene discovery method <abbrgrp>
<abbr bid="B2">2</abbr>
</abbrgrp> is not applicable to the analysis of mass spectra. The gene selection method in <abbrgrp>
<abbr bid="B15">15</abbr>
</abbrgrp> is a model driven trying to take into account the genes' group behaviours and interactions by developing an ensemble dependence model (EDM). The microarray dataset is clustered first. The EDM is based on modelling dependencies that represent inter-cluster relationships. Inter-cluster dependence matrix is the basis for discrimination between cancerous and non-cancerous samples. Classification accuracy of 85% reported in <abbrgrp>
<abbr bid="B15">15</abbr>
</abbrgrp> is very close to the one obtained by the SCA-based method proposed here. However, while SCA-based performance has been assessed through two-fold cross-validation, no cross-validation details were reported in <abbrgrp>
<abbr bid="B15">15</abbr>
</abbrgrp>. Similarly, sensitivity had to be estimated indirectly from Figure <figr fid="F5">5</figr> in <abbrgrp>
<abbr bid="B48">48</abbr>
</abbrgrp>. The method in <abbrgrp>
<abbr bid="B50">50</abbr>
</abbrgrp> combines a recursive feature extraction and the linear SVM to yield accuracy of 82.5%. This is also less accurate than what has been achieved by the method proposed. Moreover, the very accuracy reported in <abbrgrp>
<abbr bid="B50">50</abbr>
</abbrgrp> has been assessed by a ten-fold cross-validation only and that is known to yield a too optimistic performance assessment. In this regard accuracy reported in <abbrgrp>
<abbr bid="B51">51</abbr>
</abbrgrp> can be taken closer to the realistic one since it has been assessed by two-fold cross-validation. This method, as <abbrgrp>
<abbr bid="B50">50</abbr>
</abbrgrp>, again combines recursive feature elimination with the SVM, but it is taking additionally into account the parameter <it>C</it>. A reported accuracy of 88.84% is slightly better than the one obtained by the method proposed here. However, the proposed method is a classifier independent one and, as demonstrated in sections 1.5 and 1.6, it yields good results on cancer diagnosis from proteomic datasets as well.</p>
<fig id="F5"><title><p>Figure 5</p></title><caption><p>colon cancer prediction</p></caption><text>
   <p><b>colon cancer prediction</b>. Sensitivities (a) and specificities (b) (with standard deviations as error bars) estimated in colon cancer prediction from gene expression levels using 100 independent two-fold cross-validations and linear SVM classifier. Four sets of selected components were extracted by using LMMs (2a) and (2b) with control reference (c.r.) and disease reference (d.r.) samples respectively, where the overall number of components <it>M </it>has been set to 2 (red bars), 3 (green bars), 4 (blue bars) and 5 (magenta bars). Optimal values of the parameters &#955; and &#916;&#952;<b/>were used for each <it>M</it>. Increasing number of components <it>M </it>did not decrease prediction accuracy but did reduce the number of features (genes) in components used for classification (see Figure 6).</p>
</text><graphic file="1471-2105-12-496-5" hint_layout="single"/></fig>
<fig id="F6"><title><p>Figure 6</p></title><caption><p>colon cancer feature vectors</p></caption><text>
   <p><b>colon cancer feature vectors</b>. Component containing up-regulated genes extracted from a cancerous sample w.r.t. to a control reference sample using LMM (2a): a) assumed number of components <it>M </it>= 2; b) assumed number of components <it>M </it>= 4.</p>
</text><graphic file="1471-2105-12-496-6" hint_layout="single"/></fig>
</sec>
</sec>
<sec>
<st>
<p>Conclusions</p>
</st>
<p>This work presents a feature extraction/component selection method based on innovative additive linear mixture model of a sample (protein or gene expression levels represented respectively by mass spectra or microarray data) and sparseness constrained factorization that operates on a sample(experiment)-by-sample basis. That is different in respect to the existing methods which factorize complete dataset simultaneously. The sample model is comprised of a test sample and a reference sample representing disease and/or control group. Each sample is decomposed into several components selected automatically (the number is determined by cross-validation), without using label information, as disease-, control specific and differentially not expressed. An automatic selection is based on mixing angles which are estimated from each sample directly. Hence, due to the locality of decomposition, the strength of the expression of each feature can vary from sample to sample. However, the feature can still be allocated to the same (disease or control specific) component in different samples. As opposed to that, feature allocation/selection algorithms that operate on a whole dataset simultaneously try to optimize a single threshold for the whole dataset. Selected components can be used for classification due to the fact that labelled information is not used in the selection. Moreover, disease specific component(s) can also be used for further biomarker related analysis. As opposed to the existing matrix factorization methods, such disease specific component can be obtained from one sample (experiment) only. By postulating one or more components with differentially not expressed features the method yields less complex disease and control specific components that are composed of smaller number of features with higher discriminative power. This has been demonstrated to improve prediction accuracy. Moreover, decomposing sample with one or more components with indifferent features performs (indirectly) sample adaptive preprocessing related to removal of features that do not significantly vary across the sample population. The proposed feature extraction/component selection method is demonstrated on the real world proteomic datasets used for prediction of the ovarian and prostate cancers as well as on the genomic dataset used for the colon cancer prediction. Results obtained by 100 two-fold cross-validations are compared favourably against most of the state-of-the-art methods cited in the literature and used for cancer prediction on the same datasets.</p>
</sec>
<sec>
<st>
<p>Authors' contributions</p>
</st>
<p>IK has proposed novel linear mixture model of the samples and methodology for automatic selection of disease and control specific components extracted from the samples by means of sparse component analysis. He also has been performed model validation and implemented the clustering phase of the sparse component analysis method. MF implemented iterative thresholding based shrinkage algorithm for extraction of the components and performed cross-validation based component extraction and classification. All authors read and approved the final manuscript.</p>
</sec>
</bdy><bm>
<ack>
<sec>
<st>
<p>Acknowledgements</p>
</st>
<p>This work has been supported by Ministry of Science, Education and Sports, Republic of Croatia through Grant 098-0982903-2558. Professor Vojislav Kecman's and Dr. Ivanka Jeri&#263;'s help in proofreading the manuscript is gratefully acknowledged.</p>
</sec>
</ack>
<refgrp><bibl id="B1"><title><p>A factorization method for the classification of infrared spectra</p></title><aug><au><snm>Henneges</snm><fnm>C</fnm></au><au><snm>Laskov</snm><fnm>P</fnm></au><au><snm>Darmawan</snm><fnm>E</fnm></au><au><snm>Backhaus</snm><fnm>J</fnm></au><au><snm>Kammerer</snm><fnm>B</fnm></au><au><snm>Zell</snm><fnm>A</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2010</pubdate><volume>11</volume><fpage>561</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-11-561</pubid><pubid idtype="pmcid">3247165</pubid><pubid idtype="pmpid" link="fulltext">21078178</pubid></pubidlist></xrefbib></bibl><bibl id="B2"><title><p>A Three Component Latent Class Model for Robust Semiparametric Gene Discovery</p></title><aug><au><snm>Alfo</snm><fnm>M</fnm></au><au><snm>Farcomeni</snm><fnm>A</fnm></au><au><snm>Tardella</snm><fnm>L</fnm></au></aug><source>Stat Appl in Genet and Mol Biol</source><pubdate>2011</pubdate><volume>10</volume><issue>1</issue><note>Article 7</note></bibl><bibl id="B3"><title><p>Knowledge-based gene expression classification via matrix factorization</p></title><aug><au><snm>Schachtner</snm><fnm>R</fnm></au><au><snm>Lutter</snm><fnm>D</fnm></au><au><snm>Knollm&#252;ller</snm><fnm>P</fnm></au><au><snm>Tom&#233;</snm><fnm>AM</fnm></au><au><snm>Theis</snm><fnm>FJ</fnm></au><au><snm>Schmitz</snm><fnm>G</fnm></au><au><snm>Stetter</snm><fnm>M</fnm></au><au><snm>Vilda</snm><fnm>PG</fnm></au><au><snm>Lang</snm><fnm>EW</fnm></au></aug><source>Bioinformatics</source><pubdate>2008</pubdate><volume>24</volume><fpage>1688</fpage><lpage>1697</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btn245</pubid><pubid idtype="pmcid">2638868</pubid><pubid idtype="pmpid" link="fulltext">18535085</pubid></pubidlist></xrefbib></bibl><bibl id="B4"><title><p>Linear modes of gene expression determined by independent component analysis</p></title><aug><au><snm>Liebermeister</snm><fnm>W</fnm></au></aug><source>Bioinformatics</source><pubdate>2002</pubdate><volume>18</volume><fpage>51</fpage><lpage>60</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/18.1.51</pubid><pubid idtype="pmpid" link="fulltext">11836211</pubid></pubidlist></xrefbib></bibl><bibl id="B5"><title><p>Analyzing M-CSF dependent monocyte/macrophage differentiation: Expression modes and meta-modes derived from an independent component analysis</p></title><aug><au><snm>Lutter</snm><fnm>D</fnm></au><au><snm>Ugocsai</snm><fnm>P</fnm></au><au><snm>Grandl</snm><fnm>M</fnm></au><au><snm>Orso</snm><fnm>E</fnm></au><au><snm>Theis</snm><fnm>F</fnm></au><au><snm>Lang</snm><fnm>EW</fnm></au><au><snm>Schmitz</snm><fnm>G</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2008</pubdate><volume>9</volume><fpage>100</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-9-100</pubid><pubid idtype="pmcid">2277398</pubid><pubid idtype="pmpid" link="fulltext">18279525</pubid></pubidlist></xrefbib></bibl><bibl id="B6"><title><p>Hybridizing Sparse Component Analysis with Genetic Algorithms for Microarray Analysis</p></title><aug><au><snm>Stadtlthanner</snm><fnm>K</fnm></au><au><snm>Theis</snm><fnm>FJ</fnm></au><au><snm>Lang</snm><fnm>EW</fnm></au><au><snm>Tom&#233;</snm><fnm>AM</fnm></au><au><snm>Puntonet</snm><fnm>CG</fnm></au><au><snm>G&#243;rriz</snm><fnm>JM</fnm></au></aug><source>Neurocomputing</source><pubdate>2008</pubdate><volume>71</volume><fpage>2356</fpage><lpage>2376</lpage><xrefbib><pubid idtype="doi">10.1016/j.neucom.2007.09.017</pubid></xrefbib></bibl><bibl id="B7"><title><p>Biclustering of gene expression data by non-smooth non-negative matrix factorization</p></title><aug><au><snm>Carmona-Saez</snm><fnm>P</fnm></au><au><snm>Pascual-Marqui</snm><fnm>RD</fnm></au><au><snm>Tirado</snm><fnm>F</fnm></au><au><snm>Carazo</snm><fnm>JM</fnm></au><au><snm>Pascual-Montano</snm><fnm>A</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2006</pubdate><volume>7</volume><fpage>78</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-7-78</pubid><pubid idtype="pmcid">1434777</pubid><pubid idtype="pmpid" link="fulltext">16503973</pubid></pubidlist></xrefbib></bibl><bibl id="B8"><title><p>Application of independent component analysis to microarrays</p></title><aug><au><snm>Lee</snm><fnm>SI</fnm></au><au><snm>Batzoglou</snm><fnm>S</fnm></au></aug><source>Genome Biol</source><pubdate>2003</pubdate><volume>4</volume><fpage>R76</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/gb-2003-4-11-r76</pubid><pubid idtype="pmcid">329130</pubid><pubid idtype="pmpid" link="fulltext">14611662</pubid></pubidlist></xrefbib></bibl><bibl id="B9"><title><p>Biologically valid linear factor models of gene expression</p></title><aug><au><snm>Girolami</snm><fnm>M</fnm></au><au><snm>Breitling</snm><fnm>R</fnm></au></aug><source>Bioinformatics</source><pubdate>2004</pubdate><volume>20</volume><fpage>3021</fpage><lpage>3033</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bth354</pubid><pubid idtype="pmpid" link="fulltext">15201181</pubid></pubidlist></xrefbib></bibl><bibl id="B10"><title><p>Metagenes and molecular pattern discovery using matrix factorization</p></title><aug><au><snm>Brunet</snm><fnm>JP</fnm></au><au><snm>Tamayo</snm><fnm>P</fnm></au><au><snm>Golub</snm><fnm>TR</fnm></au><au><snm>Mesirov</snm><fnm>JP</fnm></au></aug><source>Proc Natl Acad Sci USA</source><pubdate>2004</pubdate><volume>101</volume><fpage>4164</fpage><lpage>4169</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.0308531101</pubid><pubid idtype="pmcid">384712</pubid><pubid idtype="pmpid" link="fulltext">15016911</pubid></pubidlist></xrefbib></bibl><bibl id="B11"><title><p>Improving molecular cancer class discovery through sparse non-negative matrix factorization</p></title><aug><au><snm>Gao</snm><fnm>Y</fnm></au><au><snm>Church</snm><fnm>G</fnm></au></aug><source>Bioinformatics</source><pubdate>2005</pubdate><volume>21</volume><fpage>3970</fpage><lpage>3975</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bti653</pubid><pubid idtype="pmpid" link="fulltext">16244221</pubid></pubidlist></xrefbib></bibl><bibl id="B12"><title><p>Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis</p></title><aug><au><snm>Kim</snm><fnm>H</fnm></au><au><snm>Park</snm><fnm>H</fnm></au></aug><source>Bioinformatics</source><pubdate>2007</pubdate><volume>23</volume><fpage>1495</fpage><lpage>1502</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btm134</pubid><pubid idtype="pmpid" link="fulltext">17483501</pubid></pubidlist></xrefbib></bibl><bibl id="B13"><title><p>Application of the GA/KNN method to SELDI proteomics data</p></title><aug><au><snm>Li</snm><fnm>L</fnm></au><au><snm>Umbach</snm><fnm>DM</fnm></au><au><snm>Terry</snm><fnm>P</fnm></au><au><snm>Taylor</snm><fnm>JA</fnm></au></aug><source>Bioinformatics</source><pubdate>2004</pubdate><volume>20</volume><fpage>1638</fpage><lpage>1640</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bth098</pubid><pubid idtype="pmpid" link="fulltext">14962943</pubid></pubidlist></xrefbib></bibl><bibl id="B14"><title><p>Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data</p></title><aug><au><snm>Yu</snm><fnm>JS</fnm></au><au><snm>Ongarello</snm><fnm>S</fnm></au><au><snm>Fiedler</snm><fnm>R</fnm></au><au><snm>Chen</snm><fnm>XW</fnm></au><au><snm>Toffolo</snm><fnm>G</fnm></au><au><snm>Cobelli</snm><fnm>C</fnm></au><au><snm>Trajanoski</snm><fnm>Z</fnm></au></aug><source>Bioinformatics</source><pubdate>2005</pubdate><volume>21</volume><fpage>2200</fpage><lpage>2209</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bti370</pubid><pubid idtype="pmpid" link="fulltext">15784749</pubid></pubidlist></xrefbib></bibl><bibl id="B15"><title><p>Ensemble dependence model for classification and prediction of cancer and normal gene expression data</p></title><aug><au><snm>Qiu</snm><fnm>P</fnm></au><au><snm>Wang</snm><fnm>ZJ</fnm></au><au><snm>Liu</snm><fnm>RKJ</fnm></au></aug><source>Bioinformatics</source><pubdate>2005</pubdate><volume>21</volume><fpage>3114</fpage><lpage>3121</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bti483</pubid><pubid idtype="pmpid" link="fulltext">15879455</pubid></pubidlist></xrefbib></bibl><bibl id="B16"><title><p>Capillary electrophoresis-mass spectrometry as powerful tool in biomarker discovery and clinical diagnosis: an update of recent developments</p></title><aug><au><snm>Mischak</snm><fnm>H</fnm></au><au><snm>Coon</snm><fnm>JJ</fnm></au><au><snm>Novak</snm><fnm>J</fnm></au><au><snm>Weissinger</snm><fnm>EM</fnm></au><au><snm>Schanstra</snm><fnm>J</fnm></au><au><snm>Dominiczak</snm><fnm>AF</fnm></au></aug><source>Mass Spectrom Rev</source><pubdate>2008</pubdate><volume>28</volume><fpage>703</fpage><lpage>724</lpage></bibl><bibl id="B17"><aug><au><snm>Comon</snm><fnm>P</fnm></au><au><snm>Jutten</snm><fnm>C</fnm></au></aug><source>Handbook on Blind Source Separation: Independent Component Analysis and Applications</source><publisher>Academic Press</publisher><pubdate>2010</pubdate></bibl><bibl id="B18"><aug><au><snm>Hyv&#228;rinen</snm><fnm>A</fnm></au><au><snm>Karhunen</snm><fnm>J</fnm></au><au><snm>Oja</snm><fnm>E</fnm></au></aug><source>Independent Component Analysis</source><publisher>Wiley Interscience</publisher><pubdate>2001</pubdate></bibl><bibl id="B19"><aug><au><snm>Cichocki</snm><fnm>A</fnm></au><au><snm>Zdunek</snm><fnm>R</fnm></au><au><snm>Phan</snm><fnm>AH</fnm></au><au><snm>Amari</snm><fnm>SI</fnm></au></aug><source>Nonnegative Matrix and Tensor Factorizations - Applications to Exploratory Multi-way Data Analysis and Blind Source Separation</source><publisher>Chichester: John Wiley</publisher><pubdate>2009</pubdate></bibl><bibl id="B20"><title><p>A fast fixed-point algorithm for independent component analysis</p></title><aug><au><snm>Hyv&#228;rinen</snm><fnm>A</fnm></au><au><snm>Oja</snm><fnm>E</fnm></au></aug><source>Neural Computation</source><pubdate>1997</pubdate><volume>9</volume><fpage>1483</fpage><lpage>1492</lpage><xrefbib><pubid idtype="doi">10.1162/neco.1997.9.7.1483</pubid></xrefbib></bibl><bibl id="B21"><title><p>Urine in clinical proteomics</p></title><aug><au><snm>Decramer</snm><fnm>S</fnm></au><au><snm>Gonzalez de Peredo</snm><fnm>A</fnm></au><au><snm>Breuil</snm><fnm>B</fnm></au><au><snm>Mischak</snm><fnm>H</fnm></au><au><snm>Monsarrat</snm><fnm>B</fnm></au><au><snm>Bascands</snm><fnm>JL</fnm></au><au><snm>Schanstra</snm><fnm>JP</fnm></au></aug><source>Mol Cell Proteomics</source><pubdate>2008</pubdate><volume>7</volume><fpage>1850</fpage><lpage>1862</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1074/mcp.R800001-MCP200</pubid><pubid idtype="pmpid" link="fulltext">18667409</pubid></pubidlist></xrefbib></bibl><bibl id="B22"><title><p>Blind separation of analytes in nuclear magnetic resonance spectroscopy and mass spectrometry: sparseness-based robust multicomponent analysis</p></title><aug><au><snm>Kopriva</snm><fnm>I</fnm></au><au><snm>Jeric</snm><fnm>I</fnm></au></aug><source>Analytical Chemistry</source><pubdate>2010</pubdate><volume>82</volume><fpage>1911</fpage><lpage>1920</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/ac902640y</pubid><pubid idtype="pmpid" link="fulltext">20131872</pubid></pubidlist></xrefbib></bibl><bibl id="B23"><title><p>Multi-component Analysis: Blind Extraction of Pure Components Mass Spectra using Sparse Component Analysis</p></title><aug><au><snm>Kopriva</snm><fnm>I</fnm></au><au><snm>Jeri&#263;</snm><fnm>I</fnm></au></aug><source>Journal of Mass Spectrometry</source><pubdate>2009</pubdate><volume>44</volume><fpage>1378</fpage><lpage>1388</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1002/jms.1627</pubid><pubid idtype="pmpid" link="fulltext">19670286</pubid></pubidlist></xrefbib></bibl><bibl id="B24"><title><p>A fast algorithm for estimating overcomplete ICA bases for image windows</p></title><aug><au><snm>Hyv&#228;rinen</snm><fnm>A</fnm></au><au><snm>Cristescu</snm><fnm>R</fnm></au><au><snm>Oja</snm><fnm>E</fnm></au></aug><source>Proc Int Joint Conf On Neural Networks</source><publisher>Washington DC, USA</publisher><pubdate>1999</pubdate><fpage>894</fpage><lpage>899</lpage></bibl><bibl id="B25"><title><p>Learning overcomplete representations</p></title><aug><au><snm>Lewicki</snm><fnm>M</fnm></au><au><snm>Sejnowski</snm><fnm>TJ</fnm></au></aug><source>Neural Comput</source><pubdate>2000</pubdate><volume>12</volume><fpage>337</fpage><lpage>365</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1162/089976600300015826</pubid><pubid idtype="pmpid">10636946</pubid></pubidlist></xrefbib></bibl><bibl id="B26"><title><p>Underdetermined blind source separation using sparse representations</p></title><aug><au><snm>Bofill</snm><fnm>P</fnm></au><au><snm>Zibulevsky</snm><fnm>M</fnm></au></aug><source>Signal Proc</source><pubdate>2001</pubdate><volume>81</volume><fpage>2353</fpage><lpage>2362</lpage><xrefbib><pubid idtype="doi">10.1016/S0165-1684(01)00120-7</pubid></xrefbib></bibl><bibl id="B27"><title><p>Sparse component analysis and blind source separation of underdetermined mixtures</p></title><aug><au><snm>Georgiev</snm><fnm>P</fnm></au><au><snm>Theis</snm><fnm>F</fnm></au><au><snm>Cichocki</snm><fnm>A</fnm></au></aug><source>IEEE Trans Neural Net</source><pubdate>2005</pubdate><volume>16</volume><fpage>992</fpage><lpage>996</lpage><xrefbib><pubid idtype="doi">10.1109/TNN.2005.849840</pubid></xrefbib></bibl><bibl id="B28"><title><p>Analysis of Sparse Representation and Blind Source Separation</p></title><aug><au><snm>Li</snm><fnm>Y</fnm></au><au><snm>Cichocki</snm><fnm>A</fnm></au><au><snm>Amari</snm><fnm>S</fnm></au></aug><source>Neural Comput</source><pubdate>2004</pubdate><volume>16</volume><fpage>1193</fpage><lpage>1234</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1162/089976604773717586</pubid><pubid idtype="pmpid" link="fulltext">15130247</pubid></pubidlist></xrefbib></bibl><bibl id="B29"><title><p>Underdetermined Blind Source Separation Based on Sparse Representation</p></title><aug><au><snm>Li</snm><fnm>Y</fnm></au><au><snm>Amari</snm><fnm>S</fnm></au><au><snm>Cichocki</snm><fnm>A</fnm></au><au><snm>Ho</snm><fnm>DWC</fnm></au><au><snm>Xie</snm><fnm>S</fnm></au></aug><source>IEEE Trans Signal Process</source><pubdate>2006</pubdate><volume>54</volume><fpage>423</fpage><lpage>437</lpage></bibl><bibl id="B30"><title><p>Hierarchical ALS Algorithms for Nonnegative Matrix Factorization and 3D Tensor Factorization</p></title><aug><au><snm>Cichocki</snm><fnm>A</fnm></au><au><snm>Zdunek</snm><fnm>R</fnm></au><au><snm>Amari</snm><fnm>SI</fnm></au></aug><source>LNCS</source><pubdate>2007</pubdate><volume>4666</volume><fpage>169</fpage><lpage>176</lpage></bibl><bibl id="B31"><title><p>Blind decomposition of low-dimensional multi-spectral image by sparse component analysis</p></title><aug><au><snm>Kopriva</snm><fnm>I</fnm></au><au><snm>Cichocki</snm><fnm>A</fnm></au></aug><source>J of Chemometrics</source><pubdate>2009</pubdate><volume>23</volume><fpage>590</fpage><lpage>597</lpage><xrefbib><pubid idtype="doi">10.1002/cem.1257</pubid></xrefbib></bibl><bibl id="B32"><title><p>Non-negative matrix factorization with sparseness constraints</p></title><aug><au><snm>Hoyer</snm><fnm>PO</fnm></au></aug><source>Journal of Machine Learning Research</source><pubdate>2004</pubdate><volume>5</volume><fpage>1457</fpage><lpage>1469</lpage></bibl><bibl id="B33"><title><p>An algorithm for mixing matrix estimation in instantaneous blind source separation</p></title><aug><au><snm>Reju</snm><fnm>VG</fnm></au><au><snm>Koh</snm><fnm>SN</fnm></au><au><snm>Soon</snm><fnm>IY</fnm></au></aug><source>Signal Proc</source><pubdate>2009</pubdate><volume>89</volume><fpage>1762</fpage><lpage>1773</lpage><xrefbib><pubid idtype="doi">10.1016/j.sigpro.2009.03.017</pubid></xrefbib></bibl><bibl id="B34"><title><p>Underdetermined Blind Source Separation Based on Subspace Representation</p></title><aug><au><snm>Kim</snm><fnm>SG</fnm></au><au><snm>Yoo</snm><fnm>CD</fnm></au></aug><source>IEEE Trans Sig Proc</source><pubdate>2009</pubdate><volume>57</volume><fpage>2604</fpage><lpage>2614</lpage></bibl><bibl id="B35"><title><p>Estimating the mixing matrix in Sparse Component Analysis (SCA) based on partial <it>k</it>-dimensional subspace clustering</p></title><aug><au><snm>Naini</snm><fnm>FM</fnm></au><au><snm>Mohimani</snm><fnm>GH</fnm></au><au><snm>Babaie-Zadeh</snm><fnm>M</fnm></au><au><snm>Jutten</snm><fnm>C</fnm></au></aug><source>Neurocomputing</source><pubdate>2008</pubdate><volume>71</volume><fpage>2330</fpage><lpage>2343</lpage><xrefbib><pubid idtype="doi">10.1016/j.neucom.2007.07.035</pubid></xrefbib></bibl><bibl id="B36"><title><p>Regression shrinkage and selection via the lasso</p></title><aug><au><snm>Tibshirani</snm><fnm>R</fnm></au></aug><source>J Royal Statist Soc B</source><pubdate>1996</pubdate><volume>58</volume><fpage>267</fpage><lpage>288</lpage></bibl><bibl id="B37"><title><p>Computational Methods for Sparse Solution of Linear Inverse Problems</p></title><aug><au><snm>Tropp</snm><fnm>JA</fnm></au><au><snm>Wright</snm><fnm>SJ</fnm></au></aug><source>Proc of the IEEE</source><pubdate>2010</pubdate><volume>98</volume><fpage>948</fpage><lpage>958</lpage></bibl><bibl id="B38"><title><p>A fast iterative shrinkage-thresholding algorithm for linear inverse problems</p></title><aug><au><snm>Beck</snm><fnm>A</fnm></au><au><snm>Teboulle</snm><fnm>M</fnm></au></aug><source>SIAM J on Imag Sci</source><pubdate>2009</pubdate><volume>2</volume><fpage>183</fpage><lpage>202</lpage><xrefbib><pubid idtype="doi">10.1137/080716542</pubid></xrefbib></bibl><bibl id="B39"><title><p>Selected publications list of professor Amir Beck</p></title><url>http://ie.technion.ac.il/Home/Users/becka.html</url></bibl><bibl id="B40"><aug><au><snm>Kecman</snm><fnm>V</fnm></au></aug><source>Learning and Soft Computing - Support Vector Machines, Neural Networks and Fuzzy Logic Models</source><publisher>The MIT Press</publisher><pubdate>2001</pubdate></bibl><bibl id="B41"><aug><au><snm>Hastie</snm><fnm>T</fnm></au><au><snm>Tibshirani</snm><fnm>R</fnm></au><au><snm>Fiedman</snm><fnm>J</fnm></au></aug><source>The Elements of Statistical Learning: Data Mining, Inference, and Prediction</source><publisher>Springer</publisher><edition>3</edition><pubdate>2009</pubdate><fpage>649</fpage><lpage>698</lpage></bibl><bibl id="B42"><title><p>Use of proteomic patterns in serum to identify ovarian cancer</p></title><aug><au><snm>Petricoin</snm><fnm>EF</fnm></au><au><snm>Ardekani</snm><fnm>AM</fnm></au><au><snm>Hitt</snm><fnm>BA</fnm></au><au><snm>Levine</snm><fnm>PJ</fnm></au><au><snm>Fusaro</snm><fnm>VA</fnm></au><au><snm>Steinberg</snm><fnm>SM</fnm></au><au><snm>Mills</snm><fnm>GB</fnm></au><au><snm>Simone</snm><fnm>C</fnm></au><au><snm>Fishman</snm><fnm>DA</fnm></au><au><snm>Kohn</snm><fnm>EC</fnm></au><au><snm>Liotta</snm><fnm>LA</fnm></au></aug><source>The Lancet</source><pubdate>2002</pubdate><volume>359</volume><fpage>572</fpage><lpage>577</lpage><xrefbib><pubid idtype="doi">10.1016/S0140-6736(02)07746-2</pubid></xrefbib></bibl><bibl id="B43"><title><p>National Cancer Institute clinical proteomics program</p></title><url>http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp</url></bibl><bibl id="B44"><title><p>Fuzzy rule based classifier fusion for protein mass spectra based ovarian cancer diagnosis</p></title><aug><au><snm>Assareh</snm><fnm>A</fnm></au><au><snm>Volkert</snm><fnm>LG</fnm></au></aug><source>Proceedings of the 2009 IEEE Symposium Computational Intelligence in Bioinformatics and Computational Biology (CIBCB&apos;09)</source><pubdate>2009</pubdate><fpage>193</fpage><lpage>199</lpage></bibl><bibl id="B45"><title><p>A clustering based hybrid system for biomarker selection and sample classification of mass spectrometry data</p></title><aug><au><snm>Yang</snm><fnm>P</fnm></au><au><snm>Zhang</snm><fnm>Z</fnm></au><au><snm>Zhou</snm><fnm>BB</fnm></au><au><snm>Zomaya</snm><fnm>AY</fnm></au></aug><source>Neurocomputing</source><pubdate>2010</pubdate><volume>73</volume><fpage>2317</fpage><lpage>2331</lpage><xrefbib><pubid idtype="doi">10.1016/j.neucom.2010.02.022</pubid></xrefbib></bibl><bibl id="B46"><title><p>Serum proteomic patterns for detection of prostate cancer</p></title><aug><au><snm>Petricoin</snm><fnm>EF</fnm></au><au><snm>Ornstein</snm><fnm>DK</fnm></au><au><snm>Paweletz</snm><fnm>CP</fnm></au><au><snm>Ardekani</snm><fnm>A</fnm></au><au><snm>Hackett</snm><fnm>PS</fnm></au><au><snm>Hitt</snm><fnm>BA</fnm></au><au><snm>Velassco</snm><fnm>A</fnm></au><au><snm>Trucco</snm><fnm>C</fnm></au><au><snm>Wiegand</snm><fnm>L</fnm></au><au><snm>Wood</snm><fnm>K</fnm></au><au><snm>Simone</snm><fnm>CB</fnm></au><au><snm>Levine</snm><fnm>PJ</fnm></au><au><snm>Linehan</snm><fnm>WM</fnm></au><au><snm>Emmert-Buck</snm><fnm>MR</fnm></au><au><snm>Steinberg</snm><fnm>SM</fnm></au><au><snm>Kohn</snm><fnm>EC</fnm></au><au><snm>Liotta</snm><fnm>LA</fnm></au></aug><source>J Natl Canc Institute</source><pubdate>2002</pubdate><volume>94</volume><fpage>1576</fpage><lpage>1578</lpage><xrefbib><pubid idtype="doi">10.1093/jnci/94.20.1576</pubid></xrefbib></bibl><bibl id="B47"><title><p>Mass spectrometry-based proteomic pattern analysis for prostate cancer detection using neural networks with statistical significance test-based feature selection</p></title><aug><au><snm>Xu</snm><fnm>Q</fnm></au><au><snm>Mohamed</snm><fnm>SS</fnm></au><au><snm>Salama</snm><fnm>MMA</fnm></au><au><snm>Kamel</snm><fnm>M</fnm></au></aug><source>Proceedings of the 2009 IEEE Conference Science and Technology for Humanity (TIC-STH)</source><pubdate>2009</pubdate><fpage>837</fpage><lpage>842</lpage></bibl><bibl id="B48"><title><p>Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays</p></title><aug><au><snm>Alon</snm><fnm>U</fnm></au><au><snm>Barkai</snm><fnm>N</fnm></au><au><snm>Notterman</snm><fnm>DA</fnm></au><au><snm>Gish</snm><fnm>K</fnm></au><au><snm>Ybarra</snm><fnm>S</fnm></au><au><snm>Mack</snm><fnm>D</fnm></au><au><snm>Levine</snm><fnm>AJ</fnm></au></aug><source>Proc Natl Acad Sci USA</source><pubdate>1999</pubdate><volume>96</volume><fpage>6745</fpage><lpage>6750</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.96.12.6745</pubid><pubid idtype="pmcid">21986</pubid><pubid idtype="pmpid" link="fulltext">10359783</pubid></pubidlist></xrefbib></bibl><bibl id="B49"><title><p>Data pertaining to the article 'Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays'</p></title><url>http://genomics-pubs.princeton.edu/oncology/affydata/index.html</url></bibl><bibl id="B50"><title><p>Selection bias in gene extraction on the basis of microarray gene-expression data</p></title><aug><au><snm>Ambroise</snm><fnm>C</fnm></au><au><snm>McLachlan G</snm><fnm>J</fnm></au></aug><source>Proc Natl Acad Sci USA</source><pubdate>2002</pubdate><volume>99</volume><fpage>6562</fpage><lpage>6566</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.102102699</pubid><pubid idtype="pmcid">124442</pubid><pubid idtype="pmpid" link="fulltext">11983868</pubid></pubidlist></xrefbib></bibl><bibl id="B51"><title><p>Gene extraction for cancer diagnosis using support vector machines</p></title><aug><au><snm>Huang</snm><fnm>TM</fnm></au><au><snm>Kecman</snm><fnm>V</fnm></au></aug><source>Artificial Intelligence in Medicine</source><pubdate>2005</pubdate><volume>35</volume><fpage>185</fpage><lpage>194</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.artmed.2005.01.006</pubid><pubid idtype="pmpid" link="fulltext">16026974</pubid></pubidlist></xrefbib></bibl></refgrp>
</bm></art>