Abstract
Background
Bioinformatics data analysis is often using linear mixture model representing samples as additive mixture of components. Properly constrained blind matrix factorization methods extract those components using mixture samples only. However, automatic selection of extracted components to be retained for classification analysis remains an open issue.
Results
The method proposed here is applied to wellstudied protein and genomic datasets of ovarian, prostate and colon cancers to extract components for disease prediction. It achieves average sensitivities of: 96.2 (sd = 2.7%), 97.6% (sd = 2.8%) and 90.8% (sd = 5.5%) and average specificities of: 93.6% (sd = 4.1%), 99% (sd = 2.2%) and 79.4% (sd = 9.8%) in 100 independent twofold crossvalidations.
Conclusions
We propose an additive mixture model of a sample for feature extraction using, in principle, sparseness constrained factorization on a samplebysample basis. As opposed to that, existing methods factorize complete dataset simultaneously. The sample model is composed of a reference sample representing control and/or case (disease) groups and a test sample. Each sample is decomposed into two or more components that are selected automatically (without using label information) as control specific, case specific and not differentially expressed (neutral). The number of components is determined by crossvalidation. Automatic assignment of features (m/z ratios or genes) to particular component is based on thresholds estimated from each sample directly. Due to the locality of decomposition, the strength of the expression of each feature across the samples can vary. Yet, they will still be allocated to the related disease and/or control specific component. Since label information is not used in the selection process, case and control specific components can be used for classification. That is not the case with standard factorization methods. Moreover, the component selected by proposed method as disease specific can be interpreted as a submode and retained for further analysis to identify potential biomarkers. As opposed to standard matrix factorization methods this can be achieved on a sample (experiment)bysample basis. Postulating one or more components with indifferent features enables their removal from disease and control specific components on a samplebysample basis. This yields selected components with reduced complexity and generally, it increases prediction accuracy.
Background
Bioinformatics data analysis is often based on the use of a linear mixture model (LMM) of a sample [115], whereas mixture is composed of components generated by unknown number of interfering sources. As an example, components can be generated during disease progression that causes cancerous cells to produce proteins and/or other molecules that can serve as early indicators (biomarkers) representing disease correlated chemical entities. Their correct identification may be very beneficial for an early detection and diagnosis of disease [16]. However, an identification of individual components within a sample is complicated by the fact that they can be "buried" within multiple substances. In addition to that, dynamic range of their concentrations can vary even several orders of magnitude [16], i.e., single components could no longer be recognizable [1]. Nevertheless, there are the algorithms able to extract either individual components or a group of components with similar concentrations within a sample. These algorithms are known under the name blind source separation (BSS) [17], and they commonly include independent component analysis (ICA) [18], and nonnegative matrix factorization (NMF) [19]. However, BSS methods perform unsupervised decomposition of the mixture samples. Thus, it is not clear which of the extracted components are to be retained for further prediction/classification analysis. To this end, several contributions toward solution of this problem have been published in [15,8]. In [1], a matrix factorization approach to the decomposition of infrared spectra of a sample is proposed taking into account class labels i.e., the classification phase and the components inference tasks are unified. Thus, the concept proposed in [1] is a classifier specific. It is formulated as the multiclass assignment problem where the number of components equals the number of classes and must be less than the number of samples available. As opposed to [1], the method proposed here selects automatically the case and control specific components on a samplebysample basis. Afterwards, these components can be used to train arbitrary classifier. In [2] gene expression profile is modelled as a linear superposition of three components comprised of upregulated, downregulated and differentially not expressed genes, whereas existence of two fixed thresholds is assumed to enable a decision to which of the three components the particular gene belongs. The thresholds are defined heuristically and in each specific case the optimal value must be obtained by crossvalidation. Moreover, the upper threshold c_{u }and the lower one c_{l }are mutually related through c_{u }= 1/c_{l}. As opposed to that, the method proposed here decomposes each sample (experiment) into components comprised of upregulated, downregulated and not differentially expressed features using data adaptive thresholds. They are based on mixing angles of an innovative linear mixture model of a sample. The method proposed in [3] uses available sample labels (the clinical diagnosis of the experiments) to select component(s), extracted by independent component analysis (ICA) or nonnegative matrix factorization (NMF), for further analysis. ICA or NMF are used to factorize the whole dataset simultaneously and one selected component (gene expression mode for ICA and metagene for NMF) is used for further analysis related to gene marker extraction. This component cannot be used for classification. Alternatively, basis matrix with labelled column vectors (for ICA) or row vectors (for NMF) can be used for classification in which case the test sample needs to be projected to space spanned by the column/row vectors, respectively. However, in this case no feature extraction can be performed. As opposed to ICA/NMF method proposed in [3], the method proposed here extracts disease and control specific component from each sample separately. Since no label information is used in the selection process, extracted components can be used for classification and that is the goal in this paper. The disease specific component can, however, be also retained for further biomarker related analysis as in [3]. The important difference is that by the method proposed here such component can be obtained from each sample separately while the method in [3], as well as in [4,5,8], needs the whole dataset. The method [4] uses again ICA (the FastICA algorithm [20]) to factorize the microarray dataset. Extracted components (gene expression modes) were analyzed to discriminate between those with biological significance and those representing noise. However, biologically significant components can be used for further gene marker related analysis but not for classification. The reason is that, as in [3], the whole dataset composed of case and control samples is reduced to several biologically interesting components only. In the extreme case it can only be one such component. In [5] the JADE ICA algorithm is used to decompose whole dataset into components (gene expression modes). As in [3,4] these components cannot be used for classification. They are used for further decomposition into submodes to identify a regulating network in the problem considered there. We want to emphasize that the component selected as disease specific by the method proposed here can also be interpreted as a submode and used for the similar type of analysis. However, since it is extracted from an individual and labelled sample it can be used for the classification as well. That is the main goal in this paper. The method in [8] again uses ICA (the maximum likelihood with natural gradient [18]) to extract components (gene expression modes). Similarly, as in [35] these components are not used for a classification. Instead, they are further analyzed by data clustering to determine biological relevance and extract gene markers. Similar types of comments as those discussed in relation to [35,8] can also be raised to other methods that use either ICA or NMF to extract components from the whole dataset, [6,7,1012]. Hence, although related to the component selection methods [1,35,8] the method proposed here is dissimilar to all of them by the fact that it extracts most interesting components on a sample (experiment)bysample basis. To achieve this, the linear mixture model (LMM) used for components extraction is composed of a test sample and a reference sample representing control and/or case group. Hence, a test sample is, in principle, associated with two LMMs. Each LMM describes a sample as an additive mixture of two or more components. Two of them are selected automatically (no thresholds needed to be predefined) as case (disease) and control specific, while the rest are considered neutral i.e. not differentially expressed. Decomposition of each LMM is enabled by enforcing sparseness constraint on the components to be extracted. This implies that each feature (m/z ratio or gene) belongs to the two components at most (disease and neutral or, control and neutral). The model formally presumes that disease specific features are present in the prevailing concentration in disease samples as well as that control specific features are present in prevailing concentration in control samples. However, the features do not have to be expressed equally strong across the whole dataset in order to be selected as a part of disease or case specific components. It is this way due to the fact that decomposition is performed locally (on a samplebysample basis). This should prevent losing some important features for classification. Accordingly, the level of expression of indifferent features can also vary between the samples. Thus, postulating one or more components with indifferent features enables their removal that is sample adaptive. As opposed to that, existing methods try to optimize a single threshold for a whole dataset. Geometric interpretation of the LMM based on a reference sample enables automatic selection of disease and control specific components (Figure 1 in section 1.2), without using label information. Hence, the selected components can be further used for disease prediction. By postulating existence of one or more components with differentially not expressed features the complexity of the selected components can be controlled (see discussion in section 1.7), whereas the overall number of components is selected by crossvalidation. Although the feature selection is the main goal of the proposed method, component extracted from the sample as disease specific can also be interpreted as a submode as in [3,4]. It can be used for further biomarker identification related analysis. We see the linearity of the model used to describe a sample as a potential limitation of a proposed method. Although linear models dominate in bioinformatics, it has been discussed in [8] that nonlinear models might be more accurate description of biological processes. Assumption of an availability of a reference sample might also be seen as a potential weakness. Yet, we have demonstrated that in the absence of expert information the reference sample can be obtained by a simple average of all the samples within the same class. The proposed method is demonstrated in sections 1.4 to 1.7 on disease prediction problems using a computational model as well as on the experimental datasets related to a prediction of ovarian, prostate and colon cancers from protein and gene expression profiles.
Figure 1. Geometrical interpretation of the linear mixture model. Concentration vectors of the linear mixture model comprised of control reference sample and test sample, (2a) and Figure 1a, i.e. disease reference sample and test sample, (2b) and Figure 1b, are confined in a first quadrant of the plane spanned by two mixture samples. Features (m/z ratios or genes) with prevailing concentration in disease sample are linearly combined into component associated with the red colour relative concentration vector. Likewise, features with prevailing concentration in control sample are combined linearly into component associated with the blue colour relative concentration vector. Features that are not differentially expressed are combined linearly into one or more neutral components associated with the green colour relative concentration vectors.
Methods
This section derives sparse component analysis (SCA) approach to unsupervised decomposition of protein (mass spectra) and gene expression profiles using a novel mixture model of a sample. The model enables automatic selection of the two of the extracted components as case and control specific. They are retained for classification. In what follows, the problem motivation and definition are presented first. Then, LMM of a sample is introduced and its interpretation is described. Afterwards, a twostage implementation of the SCA algorithm is described and discussed in details.
1.1 Problem formulation
As mentioned previously, bioinformatics problems often deal with data containing components that are imprinted in a sample by several interfering sources. As an example, brief description of endocrine signalling system, secreting hormones into a blood stream, is given in [1]. Likewise, reference [21] describes how different organs imprint their substances (metabolites) into a urine sample. As pointed out in [1] and [16] disease samples are combinations of several coregulated components (signals) originating from different sources (organs) and disease specific component is actually "buried" within a sample. Hence we are dealing with the two problems simultaneously: a sample decomposition (component inference) problem and a classification (disease prediction) problem that is based on sample decomposition. Thus, automatic selection of one or more extracted components is of practical importance. It is also important that component selection is done without a use of label information in which case it can be used for classification.
Matrix factorization is conveniently used in signal processing to solve decomposition
problems [1719]. It is assumed that data matrix X ∈ ℝ^{N × K }is comprised of N row vectors representing mixture samples, whereas each sample is further comprised
of K features (m/z ratios or genes). It is also assumed that N samples are labelled:
where A ∈ ℝ^{N × M }and S ∈ ℝ^{M × K }, and M represents an unknown number of components present in a sample. Each component
1.2 Novel additive linear mixture model of a sample
The LMM (1) is widely used in various bioinformatics problems [115]. Unless constraints are imposed on A and/or S, the matrix factorization implied by (1) is not unique. Typical constraints involve nonGaussianity and statistical independence between components by ICA algorithms [6,18], and nonnegativity and sparseness constraints by NMF algorithms, [7,11,12,19,22,23]. In addition to that, many ICA algorithms, as well as many NMF algorithms, also require the unknown number of components M to be less than or equal to the number of mixture samples N.
Depending on the context, this constraint can be considered as restrictive. There are, however, ICA methods developed for the solution of underdetermined problems that are known as overcomplete ICA, see Chapter 16 in [18], as well as [24,25]. However, as discussed in details in [18], overcomplete ICA methods also assume that unknown components are sparse. The two exemplary overcomplete ICA methods based on sparseness assumption are described in [24] and [25]. In [24] it is assumed that components are sparse and approximately uncorrelated ("quasiuncorrelated"). This basically means that each feature belongs to one component only. That is even a fairly stronger assumption than what is used by the method proposed here. Likewise, in maximum likelihood (ML) approach to the overcomplete problem in [25] it is assumed that marginal distributions of the components are Laplacian. In this case the component estimation problem (assuming the mixing matrix is estimated by clustering) is reduced to linear program with equality constraint. In other words, a probabilistic ML problem is converted into a deterministic linear programming task. Hence, the overcomplete ICA effectively becomes SCA. This further justifies our choice of the stateoftheart SCA method (described in section 1.3), to be used in a component extraction task. Here, we propose a novel type of the LMM model which is composed of two samples only:
The first sample is a reference sample representing control group, x_{control }∈ ℝ^{K}, in (2a) and case (disease) group, x_{disease }∈ ℝ^{K}, in (2b). The second sample is actual test sample:
For disease prediction, disease and control specific components can be used to train
a classifier. The reason is that in each LMM (2a)/(2b) they are extracted with respect
to different reference samples and, thus, carry on different but specific information.
Hence, proposed method yields four components to be retained for classifier training.
In accordance with Figure 1 they are denoted as
Selection of the unknown number of components M is generally nontrivial problem in a matrix factorization and is the part of a model validation procedure. M is selected through crossvalidation and postulated to be 2, 3, 4 or 5 because it directly determines the number of features used for classification. This follows from previously described interpretation of the LMM (2a) and (2b). Since disease prediction is based on four components selected as disease and control specific it is important that they are composed of features with the high discriminative power. It means that they should contain features which are truly disease or control specific. The component considered here as disease or control specific (as well as neutral) can actually be composed of features (m/z ratios or genes) belonging to multiple substances (metabolites, analytes) that share similar relative concentrations. This is practically important since it makes decomposition much less sensitive to an underestimation of the true total number of substances present in a sample. By setting the number of substances to predefined value M, proposed method is enforcing substances with similar concentrations to be linearly combined into one more complex components composed of disease, neutral or control specific features. Provided that concentration variability of these features across the samples is small, it would suffice to select overall number of components as M = 3 or even M = 2. (In the latter case, the existence of differentially not expressed features is not postulated at all). However, since we are dealing with biological samples it is more realistic to expect that relative concentrations could vary across the sample population. This is illustrated in Figures 1a and 1b by ellipsoids around vectors that represent average concentration profiles of each group of features (components). As seen from Figure 1, some features considered neutral can be present in the prevailing concentration in a certain number of samples than the features considered in a majority of the samples as disease (or control) specific. To partially remove such features from disease and/or control specific components, an unknown number of components M should be increased to M = 4 or perhaps even to M = 5. Thus, existence of two or three neutral components should be postulated. This is expected to yield less complex disease and control specific components and that is in agreement with the principle of parsimony (see also discussion in section 1.7). Model validation presented in section 1.4 suggests that this, indeed, is the case when concentration variability across the samples is significant. When it comes to the real world datasets, the information about number of components will not be known in advance. The strategy to comply with this uncertainty is to use the crossvalidation and to verify whether increased number of components M indeed contributed to increased accuracy in disease prediction.
1.3 Sparse component analysis algorithm
Proposed feature extraction/component selection method is based on a decomposition of LMMs (2a)/(2b) comprised of two samples (reference sample and test sample) into M ≥ 2 components. From the BSS point of view this yields determined BSS problem when M = 2 and underdetermined BSS problem, when M ≥ 3 [26, 27, Chapter 10 in 17]. The enabling constraint for solving underdetermined BSS problems is a sparseness of the components and the methods are known under the common name as sparse component analysis (SCA) [2629, Chapter 10 in 17]. As commented at the beginning of section 1.2 the overcomplete ICA, [Chapter 16 in 18, 24, 25], is basically reduced to SCA and also demands sparse sources. SCA has already been applied to microarray data analysis in [3,6,7,11,12]. It has also been used in [22,23] to extract more than two components from the two mixture samples of nuclear magnetic resonance and mass spectra. A sparseness constraint implies that each particular feature point k = 1, ...,K (m/z ratio or gene) belongs to the several components only. To this end, for the twosamples based LMMs (2a)/(2b) used here, it is assumed that each feature point belongs to at most two components: either disease specific and neutral or control specific and neutral. From the viewpoint of biology, a plausibility of this assumption has been elaborated before.
Algorithmic approaches used to solve underdetermined BSS problem associated with (2a)/(2b)
belong to the two main categories: (i) estimating concentration/mixing matrix and component matrix simultaneously by minimizing
data fidelity terms
where
Thus, in practice, Δθ needs to be selected through a crossvalidation. In the experiments
described in sections 1.4 to 1.7, Δθ has been selected from the set of radians equivalent
to {1^{0}, 3^{0}, 5^{0}} together with a postulated number of components M and with a regularization parameter related to sparseness constraint imposed on S_{control }and S_{disease }(see eq. (3) below). Hierarchical clustering implemented by MATLAB clusterdata command
(with a 'cosine' distance metric and 'complete' linkage option) has been used to cluster the set of selected J feature points with a single component belonging. Number of clusters has been set
in advance to equal the postulated number of components M. Cluster centres represent estimated concentrations vectors
where the hat sign denotes estimates of the model variables A_{control}/A_{disease} and S_{control}/S_{disease}. Problems (3) relate to the sparseness constrained solution of the underdetermined systems of linear equations. For a decomposition of gene expression profiles, a nonnegativity constraint is additionally imposed on S: S ≥ 0. Problem (3) can be solved by the LASSO algorithm [36] or, by some other solver for underdetermined system of linear equations [37]. Here, for problem (3) we have used the iterative shrinkage thresholding (IST) type of method [38], with a MATLAB code available at [39]. This approach has been shown to be fast and it can be easily implemented in batch mode such as (3a)/(3b) i.e. as a solving of all K systems of equations simultaneously. In relation to standard IST methods, the method [38] has guaranteed better global rate of convergence. In addition to that, through the effect of iterations, it shrinks to zero small nonzero elements of S that are influenced by noise. This prevents them to determine level of sparseness of S. As discussed in [6] this shrinking operation is important in preventing selection of less sparse S over the sparse version of S. With nonnegativity constraint S ≥ 0 problem (3) becomes a quadratic program. Thus, we have used a gradient descent with projection onto nonnegative orthant: max(0,S). A sparsity of the solution is controlled by the parameter λ. There is a maximal value of λ (denoted by λ_{max }here) above which the solution of the problems (3) is maximally sparse, i.e. it is equal to zero. Thus, in the experiments reported in sections 1.5 to 1.7 the value λ has been selected by crossvalidation (together with Δθ and M) with respect to λ_{max }as: λ∈{10^{2}·λ_{max}, 10^{4}·λ_{max}, 10^{6}·λ_{max}}. We conclude this section by an observation that the situation suggested in [6]: X = AS = A^{pseu}S^{pseu}, where (A^{pseu}, S^{pseu}) represents alternative factorization of X such that S^{pseu }would be less sparse than S, during minimization of (3) cannot occur. That is due to IST algorithm [38] as well as due to accurate estimation of the mixing matrices that is enabled by clustering set of the SCPs. First, this is a consequence of the fact that a shrinking operation used by IST algorithm [38] imposes sparseness constraint of the type given by eq.(7) in [6]:
i.e. small nonzero elements of s_{k }are set to zero. This prevents selection of less sparse S^{pseu }over sparser S. Second, SCA method used here is a twostage method where A is estimated accurately by clustering on a set of SCPs. This, in addition to a sparseness measure discussed above, prevents estimate of S to deviate from the true value significantly. It is this way because when S is being estimated by means of IST algorithm the very estimate of A is fixed. As opposed to the case when A and S are estimated simultaneously, as in [6], an estimate of A can't now be adjusted by the algorithm to some value A^{pseu }that will counteract changes in S. Hence, selecting S^{pseu }would increase a data fidelity term in the cost function. Thus, situation as suggested in [6]: X = AS = A^{pseu}S^{pseu }can't occur. A proposed twostage SCA approach to feature extraction/component selection is in a concise form presented in Table 1. A MATLAB code is posted in the Additional Material Files section accompanied with the paper as Additional File 1.
Table 1. A mixture model with a referencebased algorithm for feature extraction/component selection
Additional file 1. code with implementation of proposed feature extraction/component selection method.
Format: ZIP Size: 11KB Download file
Results and Discussion
This section presents model validation procedure. It is demonstrated how increased
number of postulated components retains, or slightly improves, prediction accuracy
when concentration variability of the features across the sample population is significant.
Moreover, an increased number of postulated components yields the disease and control
specific components used for classification with a smaller number of features. This
is in an agreement with the principle of parsimony which states that less complex
solution ought to be preferred over the more complex one. Proposed method for feature
extraction/component selection is also applied to a prediction of ovarian, prostate
and colon cancers from the three wellstudied datasets. Prediction accuracy (sensitivity
and specificity with standard deviations) is estimated by 100 independent twofold
crossvalidations. Proposed SCA component selection method is compared (favourably)
against stateoftheart predictors tested on the same datasets including our implementation
of methods proposed in [1,2]. Regarding our implementation of a predictive matrix factorization method [1], we have used the MATLAB fminsearch function to minimize the negative value of the target function suggested in [1] while selecting the threshold vector. We have set the TolFun to 10^{10}, the TolX to 10^{10 }and the MaxFunEvals to 10,000. An initial value of the twodimensional threshold vector has been set
to [0 0]^{T}. Regarding a gene discovery method proposed in [2] we have crossvalidated three values of the threshold c_{u }∈{2, 2.5, 3.0} (c_{l }is set automatically c_{l }= 1/c_{u}). The best result is presented in section 1.7. Regarding a comparison of a proposed
component selection method against many methods in sections 1.5 to 1.7, our intention
has been to provide a brief description of the methods and to provide fair comparison
given the fact that code for compared methods has not been available to us. That actually
was the main reason for choosing a well known datasets such as in 1.5 to 1.7, since
a rich list of published results exists for them. We are aware of the fact that results
by many other methods were obtained by different crossvalidation settings. Therefore,
our reasoning is that fair comparison is possible as long as the results to be compared
were obtained on the same datasets under conditions that favor less the method proposed
here. That is the reason why we have chosen to perform twofold crossvalidation,
since it is known to yield the least optimistic result. Thus, if such results are
compared favorably against those obtained under milder (ten and threefold) crossvalidation
settings, conclusion can be made that proposed feature extraction/component selection
method represents contribution to the field. As opposed to the twofold crossvalidation
applied here, crossvalidation details for many cited results were not specified.
Sometimes tenfold, or threefold, crossvalidations have been performed. Hence, it
is believed that performance assessment of proposed component selection method is
more realistic than performance of the majority of methods cited in comparative analysis.
For each of the three types of cancers three classifiers were trained on four sets
of extracted components:
Table 2. Comparative performance results in ovarian cancer prediction. Sensitivities and specificities were estimated by 100 twofold crossvalidations (standard deviations are in brackets).
Table 3. Comparative performance results in prostate cancer prediction. Sensitivities and specificities were estimated by 100 twofold crossvalidations (standard deviations are in brackets).
Table 4. Comparative performance results in colon cancer prediction. Sensitivities and specificities were estimated by 100 twofold crossvalidations (standard deviations are in brackets).
Additional file 2. classification results obtained by the linear SVM applied to disease and control specific components extracted from the ovarian cancer dataset for various combination of parameters M, λ and Δθ.
Format: XLSX Size: 15KB Download file
Additional file 3. classification results obtained by the linear SVM applied to disease and control specific components extracted from the prostate cancer dataset for various combination of parameters M, λ and Δθ.
Format: XLSX Size: 16KB Download file
Additional file 4. classification results obtained by the linear SVM applied to disease and control specific components extracted from the colon cancer dataset for various combination of parameters M, λ and Δθ.
Format: XLSX Size: 14KB Download file
Additional file 5. best classification results obtained by the RBF SVM applied to disease and control specific components extracted from the colon cancer dataset for M = 4, λ = 10^{2}λ_{max }and Δθ = 5^{0 }and M = 2 and Δθ = 1^{0}.
Format: XLSX Size: 12KB Download file
1.4 Model validation
This section presents model validation results obtained on simulated data using LMM (2a)/(2b). To this end, each mixture sample has been composed of ten orthogonal components comprised of K = 15000 features. The orthogonality implies that each feature belongs to one component only. By a convention, the first component has been selected to contain disease specific features, the tenth component to contain control specific features and the components two to nine contain features that are not differentially expressed and share similar concentrations in control and disease labelled samples. A concentration variability across the sample population is simulated using the following model for disease group of samples:
and for control group of samples:
Thus, by controlling the mixing angles
Figure 2. model validation. Sensitivities, Figure 2a, and specificities, Figure 2b, (with standard deviations as error bars) estimated by linear SVM classifier and 100 independent twofold crossvalidations using two disease specific and two control specific components. Components were extracted from the linear mixture models based on control reference (c.r.) sample, model (2a), and disease reference (d.r.) sample, model (2b), where each sample was comprised of ten orthogonal components containing K=15000 features. One component contained in prevailing concentration disease specific features, one control specific features and eight components contained features equally expressed in control and disease labelled samples. Relative concentration (expressed through a mixing angle) across the sample population has been: for disease specific features in the range of 50^{0 }to 89.99^{0}; for differentially not expressed features in the range of 25^{0 }to 65^{0}; and for control specific features in the range of 0.01^{0 }to 40^{0}. Assumed overall number of components has been M=2 (red bars), M = 3 (green bars), M=4 (blue bars) and M = 5 (magenta bars).
1.5 Ovarian cancer prediction from a protein mass spectra
Low resolution surfaceenhanced laser desorption ionization timeofflight (SELDITOF) mass spectra of 100 controls and 100 cases have been used for ovarian cancer prediction study [42]. See also the website of the clinical proteomics program of the National Cancer Institute (NCI), [43], where the used dataset is labelled as "Ovarian 4302". All spectra were baseline corrected. Thus, some intensities have negative values. Table 2 presents the best result obtained by the proposed SCAbased component selection method together with results obtained for the same dataset by competing methods reported in cited references as well as by predictive factorization method proposed in [1]. Described SCA method has been used to extract four sets of components with the overall number of components M assumed to be 2, 3, 4 and 5. Figure 3 shows sensitivities and specificities estimated by 100 independent twofold crossvalidations using linear SVM classifier which yielded the best results compared against nonlinear SVM classifiers based on polynomial and RBF kernels. Performance improvement is visible when assumed number of components is increased from 2 to 3, 4 or 5. The error bars are dictated by the sample size and would decrease with a larger sample. Thus, the mean values should be looked at to observe the trend in performance as a function of M. The best result (shown in Table 2) has been obtained with the linear SVM classifier for M = 3 with sensitivity of 96.2% and specificity of 93.6%, but results with the very similar quality have been obtained for several combinations of the parameters M, Δθ and λ, see Figure 3, most notably M = 4 (see second column in Table 2 and the Additional File 2). As seen in Table 2, only [13] reported better result for a twofold crossvalidation with the same number of partitions. There, a combination of genetic algorithm and knearest neighbours method, originally developed for mining of highdimensional microarray gene expression data, has been used for analysis of proteomics data. However, the method [13] is tested on proteomic ovarian cancer dataset only, while the method proposed here exhibited excellent performance in prediction of prostate cancer from proteomic data (reported in section 1.6), as well as on colon cancer from genomic data (presented in section 1.7). The method shown in [42] used 50 samples from the control group and 50 samples from the ovarian cancer group to discover a pattern that discriminated cancer from noncancer group. This pattern has then been used to classify an independent set of 50 samples with ovarian cancer and 66 samples unaffected by ovarian cancer. In [44], a fuzzy rule based classifier fusion is proposed for feature selection and classification (diagnosis) of protein mass spectra based ovarian cancer. Demonstrated accuracy of 9899% has been estimated through 10 tenfold crossvalidations (as opposed to 100 twofold crossvalidations used here). Moreover, as demonstrated in sections 1.6 and 1.7, the method proposed here exhibited good performance on diagnosis of prostate and colon cancers from proteomic and gene expression levels, respectively. In [45], a clustering based method for feature selection from mass spectrometry data is derived by combining kmeans clustering and genetic algorithm. The method exhibited an accuracy of 95.8% (error rate 4.1%), but this has been assessed through threefold crossvalidations (as opposed to twofold crossvalidations used here).
Figure 3. ovarian cancer prediction. Sensitivities (a) and specificities (b) (with standard deviations as error bars) estimated in ovarian cancer prediction from protein expression levels using 100 independent twofold crossvalidations and linear SVM classifier. Four sets of selected components were extracted by SCAbased factorization using LMMs (2a) and (2b) with control reference (c.r.) and disease reference (d.r.) samples respectively, where the overall number of components M has been set to 2 (red bars), 3 (green bars), 4 (blue bars) and 5 (magenta bars). Optimal values of the parameters λ and Δθ were used for each M. Performance improvement is visible when number of components is increased from 2 to 3, 4 or 5.
1.6 Prostate cancer prediction from a protein mass spectra
Low resolution SELDITOF mass spectra of 63 controls: no evidence of cancer with prostatespecific antigen (PSA)<1, and 69 cases (prostate cancers): 26 with 4<PSA<10 and 43 with PSA>10, have been used for prostate cancer prediction study [46]. There are additional 190 control samples with benign cancer (4<PSA<10) available as well (see the website of the clinical proteomics program of the NCI, [43]), in dataset labelled as "JNCI_Data_7302". However, in the twoclass comparative performance analysis problem reported here these samples were not used. Proposed SCAbased method has been used to extract four sets of components with the overall number of components M assumed to be 2, 3, 4 and 5. The best result has been achieved for M = 5 with sensitivity of 97.6% and specificity of 99%, but results with the very similar quality have been obtained for several combinations of the parameters M, Δθ and λ, (see Figure 4 and the Additional File 3). Table 3 presents two best results achieved by the proposed SCAbased approach to component selection together with the results obtained by competing methods reported in cited references. Linear SVM classifier yielded the best results when compared against nonlinear SVM classifiers based on polynomial and RBF kernels. According to Table 3, comparable result (although slightly worse) is in the reference [47] only. The method [47] is proposed for analysis of mass spectra for screening of prostate cancer. The system is composed of three stages: a feature selection using statistical significance test, a classification by radial basis function and probabilistic neural networks and an optimization of the results through the receiveroperatingcharacteristic analysis. The method achieved sensitivity 97.1% and specificity 96.8% but the crossvalidation setting has not been described in details. In [46], the training group has been used to discover a pattern that discriminated cancer from noncancer group. This pattern has then been used to classify an independent set of 38 patients with the prostate cancer and 228 patients with the benign conditions. The obtained specificity is low. The predictive matrix factorization method [1] yielded significantly worse result than the method proposed here. In [45] a clustering based method for feature selection from mass spectrometry data is derived combining kmeans clustering and genetic algorithm. Despite a threefold crossvalidation, the reported error was 28.97%. Figure 4 shows sensitivities and specificities estimated by 100 independent twofold crossvalidations using linear SVM classifier on components selected by the method proposed here. For each M the optimal values of the parameters λ and Δθ (obtained by crossvalidation) have been used to obtain results shown in Figure 4. Increasing a postulated number of components from 2 to 5 increased accuracy from 97.4% to 98.3%. Thus, better accuracy is achieved with the smaller number of features (m/z ratios) contained in selected components.
Figure 4. prostate cancer prediction. Sensitivities (a) and specificities (b) (with standard deviations as error bars) estimated in prostate cancer prediction from protein expression levels using 100 independent twofold crossvalidations and linear SVM classifier. Four sets of selected components were extracted by SCAbased factorization using LMMs (2a) and (2b) with control reference (c.r.) and disease reference (d.r.) samples respectively, where the overall number of components M has been set to 2 (red bars), 3 (green bars), 4 (blue bars) and 5 (magenta bars). Optimal values of the parameters λ and Δθwere used for each M. Performance improvement is visible when number of components is increased from 2 to 5.
1.7 Colon cancer prediction from gene expression profiles
Gene expression profiles of 40 colon cancer and 22 normal colon tissue samples obtained
by an Affymetrix oligonucleotide array [48], have been also used for validation and comparative performance analysis of proposed
feature extraction method. Gene expression profiles have been downloaded from [49]. Original data produced by oligonucleotide array contained more than 6500 genes but
only 2000 highintensity genes have been used for cluster analysis in [48] and are provided for download on the cited website. The proposed SCAbased approach
to feature extraction/component selection has been used to extract four sets of components
with up and downregulated genes and with the overall number of components M assumed to be 2, 3, 4 and 5. The linear SVM classifier has been applied to groups
of the four sets of selected components extracted from gene expression levels for
specific combinations of parameters Δθ, λ and M. The best result in terms of sensitivity and specificity for each M has been selected and shown in Figure 5. The complete list of results obtained by linear SVM classifier is presented in the
Additional File 4. An increased number of postulated components M did not decrease accuracy but it yielded components selected for classification with
reduced number of genes. This is verified in Figure 6 which shows component with upregulated genes
Figure 5. colon cancer prediction. Sensitivities (a) and specificities (b) (with standard deviations as error bars) estimated in colon cancer prediction from gene expression levels using 100 independent twofold crossvalidations and linear SVM classifier. Four sets of selected components were extracted by using LMMs (2a) and (2b) with control reference (c.r.) and disease reference (d.r.) samples respectively, where the overall number of components M has been set to 2 (red bars), 3 (green bars), 4 (blue bars) and 5 (magenta bars). Optimal values of the parameters λ and Δθwere used for each M. Increasing number of components M did not decrease prediction accuracy but did reduce the number of features (genes) in components used for classification (see Figure 6).
Figure 6. colon cancer feature vectors. Component containing upregulated genes extracted from a cancerous sample w.r.t. to a control reference sample using LMM (2a): a) assumed number of components M = 2; b) assumed number of components M = 4.
Conclusions
This work presents a feature extraction/component selection method based on innovative additive linear mixture model of a sample (protein or gene expression levels represented respectively by mass spectra or microarray data) and sparseness constrained factorization that operates on a sample(experiment)bysample basis. That is different in respect to the existing methods which factorize complete dataset simultaneously. The sample model is comprised of a test sample and a reference sample representing disease and/or control group. Each sample is decomposed into several components selected automatically (the number is determined by crossvalidation), without using label information, as disease, control specific and differentially not expressed. An automatic selection is based on mixing angles which are estimated from each sample directly. Hence, due to the locality of decomposition, the strength of the expression of each feature can vary from sample to sample. However, the feature can still be allocated to the same (disease or control specific) component in different samples. As opposed to that, feature allocation/selection algorithms that operate on a whole dataset simultaneously try to optimize a single threshold for the whole dataset. Selected components can be used for classification due to the fact that labelled information is not used in the selection. Moreover, disease specific component(s) can also be used for further biomarker related analysis. As opposed to the existing matrix factorization methods, such disease specific component can be obtained from one sample (experiment) only. By postulating one or more components with differentially not expressed features the method yields less complex disease and control specific components that are composed of smaller number of features with higher discriminative power. This has been demonstrated to improve prediction accuracy. Moreover, decomposing sample with one or more components with indifferent features performs (indirectly) sample adaptive preprocessing related to removal of features that do not significantly vary across the sample population. The proposed feature extraction/component selection method is demonstrated on the real world proteomic datasets used for prediction of the ovarian and prostate cancers as well as on the genomic dataset used for the colon cancer prediction. Results obtained by 100 twofold crossvalidations are compared favourably against most of the stateoftheart methods cited in the literature and used for cancer prediction on the same datasets.
Authors' contributions
IK has proposed novel linear mixture model of the samples and methodology for automatic selection of disease and control specific components extracted from the samples by means of sparse component analysis. He also has been performed model validation and implemented the clustering phase of the sparse component analysis method. MF implemented iterative thresholding based shrinkage algorithm for extraction of the components and performed crossvalidation based component extraction and classification. All authors read and approved the final manuscript.
Acknowledgements
This work has been supported by Ministry of Science, Education and Sports, Republic of Croatia through Grant 09809829032558. Professor Vojislav Kecman's and Dr. Ivanka Jerić's help in proofreading the manuscript is gratefully acknowledged.
References

Henneges C, Laskov P, Darmawan E, Backhaus J, Kammerer B, Zell A: A factorization method for the classification of infrared spectra.
BMC Bioinformatics 2010, 11:561. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Alfo M, Farcomeni A, Tardella L: A Three Component Latent Class Model for Robust Semiparametric Gene Discovery.
Stat Appl in Genet and Mol Biol 2011., 10(1)
Article 7

Schachtner R, Lutter D, Knollmüller P, Tomé AM, Theis FJ, Schmitz G, Stetter M, Vilda PG, Lang EW: Knowledgebased gene expression classification via matrix factorization.
Bioinformatics 2008, 24:16881697. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Liebermeister W: Linear modes of gene expression determined by independent component analysis.
Bioinformatics 2002, 18:5160. PubMed Abstract  Publisher Full Text

Lutter D, Ugocsai P, Grandl M, Orso E, Theis F, Lang EW, Schmitz G: Analyzing MCSF dependent monocyte/macrophage differentiation: Expression modes and metamodes derived from an independent component analysis.
BMC Bioinformatics 2008, 9:100. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Stadtlthanner K, Theis FJ, Lang EW, Tomé AM, Puntonet CG, Górriz JM: Hybridizing Sparse Component Analysis with Genetic Algorithms for Microarray Analysis.
Neurocomputing 2008, 71:23562376. Publisher Full Text

CarmonaSaez P, PascualMarqui RD, Tirado F, Carazo JM, PascualMontano A: Biclustering of gene expression data by nonsmooth nonnegative matrix factorization.
BMC Bioinformatics 2006, 7:78. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Lee SI, Batzoglou S: Application of independent component analysis to microarrays.
Genome Biol 2003, 4:R76. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Girolami M, Breitling R: Biologically valid linear factor models of gene expression.
Bioinformatics 2004, 20:30213033. PubMed Abstract  Publisher Full Text

Brunet JP, Tamayo P, Golub TR, Mesirov JP: Metagenes and molecular pattern discovery using matrix factorization.
Proc Natl Acad Sci USA 2004, 101:41644169. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gao Y, Church G: Improving molecular cancer class discovery through sparse nonnegative matrix factorization.
Bioinformatics 2005, 21:39703975. PubMed Abstract  Publisher Full Text

Kim H, Park H: Sparse nonnegative matrix factorizations via alternating nonnegativityconstrained least squares for microarray data analysis.
Bioinformatics 2007, 23:14951502. PubMed Abstract  Publisher Full Text

Li L, Umbach DM, Terry P, Taylor JA: Application of the GA/KNN method to SELDI proteomics data.
Bioinformatics 2004, 20:16381640. PubMed Abstract  Publisher Full Text

Yu JS, Ongarello S, Fiedler R, Chen XW, Toffolo G, Cobelli C, Trajanoski Z: Ovarian cancer identification based on dimensionality reduction for highthroughput mass spectrometry data.
Bioinformatics 2005, 21:22002209. PubMed Abstract  Publisher Full Text

Qiu P, Wang ZJ, Liu RKJ: Ensemble dependence model for classification and prediction of cancer and normal gene expression data.
Bioinformatics 2005, 21:31143121. PubMed Abstract  Publisher Full Text

Mischak H, Coon JJ, Novak J, Weissinger EM, Schanstra J, Dominiczak AF: Capillary electrophoresismass spectrometry as powerful tool in biomarker discovery and clinical diagnosis: an update of recent developments.

Comon P, Jutten C: Handbook on Blind Source Separation: Independent Component Analysis and Applications. Academic Press; 2010.

Hyvärinen A, Karhunen J, Oja E: Independent Component Analysis. Wiley Interscience; 2001.

Cichocki A, Zdunek R, Phan AH, Amari SI: Nonnegative Matrix and Tensor Factorizations  Applications to Exploratory Multiway Data Analysis and Blind Source Separation. Chichester: John Wiley; 2009.

Hyvärinen A, Oja E: A fast fixedpoint algorithm for independent component analysis.
Neural Computation 1997, 9:14831492. Publisher Full Text

Decramer S, Gonzalez de Peredo A, Breuil B, Mischak H, Monsarrat B, Bascands JL, Schanstra JP: Urine in clinical proteomics.
Mol Cell Proteomics 2008, 7:18501862. PubMed Abstract  Publisher Full Text

Kopriva I, Jeric I: Blind separation of analytes in nuclear magnetic resonance spectroscopy and mass spectrometry: sparsenessbased robust multicomponent analysis.
Analytical Chemistry 2010, 82:19111920. PubMed Abstract  Publisher Full Text

Kopriva I, Jerić I: Multicomponent Analysis: Blind Extraction of Pure Components Mass Spectra using Sparse Component Analysis.
Journal of Mass Spectrometry 2009, 44:13781388. PubMed Abstract  Publisher Full Text

Hyvärinen A, Cristescu R, Oja E: A fast algorithm for estimating overcomplete ICA bases for image windows. In Proc Int Joint Conf On Neural Networks. Washington DC, USA; 1999:894899.

Lewicki M, Sejnowski TJ: Learning overcomplete representations.
Neural Comput 2000, 12:337365. PubMed Abstract  Publisher Full Text

Bofill P, Zibulevsky M: Underdetermined blind source separation using sparse representations.
Signal Proc 2001, 81:23532362. Publisher Full Text

Georgiev P, Theis F, Cichocki A: Sparse component analysis and blind source separation of underdetermined mixtures.
IEEE Trans Neural Net 2005, 16:992996. Publisher Full Text

Li Y, Cichocki A, Amari S: Analysis of Sparse Representation and Blind Source Separation.
Neural Comput 2004, 16:11931234. PubMed Abstract  Publisher Full Text

Li Y, Amari S, Cichocki A, Ho DWC, Xie S: Underdetermined Blind Source Separation Based on Sparse Representation.

Cichocki A, Zdunek R, Amari SI: Hierarchical ALS Algorithms for Nonnegative Matrix Factorization and 3D Tensor Factorization.

Kopriva I, Cichocki A: Blind decomposition of lowdimensional multispectral image by sparse component analysis.
J of Chemometrics 2009, 23:590597. Publisher Full Text

Hoyer PO: Nonnegative matrix factorization with sparseness constraints.

Reju VG, Koh SN, Soon IY: An algorithm for mixing matrix estimation in instantaneous blind source separation.
Signal Proc 2009, 89:17621773. Publisher Full Text

Kim SG, Yoo CD: Underdetermined Blind Source Separation Based on Subspace Representation.

Naini FM, Mohimani GH, BabaieZadeh M, Jutten C: Estimating the mixing matrix in Sparse Component Analysis (SCA) based on partial kdimensional subspace clustering.
Neurocomputing 2008, 71:23302343. Publisher Full Text

Tibshirani R: Regression shrinkage and selection via the lasso.

Tropp JA, Wright SJ: Computational Methods for Sparse Solution of Linear Inverse Problems.

Beck A, Teboulle M: A fast iterative shrinkagethresholding algorithm for linear inverse problems.
SIAM J on Imag Sci 2009, 2:183202. Publisher Full Text

Selected publications list of professor Amir Beck [http://ie.technion.ac.il/Home/Users/becka.html] webcite

Kecman V: Learning and Soft Computing  Support Vector Machines, Neural Networks and Fuzzy Logic Models. The MIT Press; 2001.

Hastie T, Tibshirani R, Fiedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 3rd edition. Springer; 2009:649698.

Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC, Liotta LA: Use of proteomic patterns in serum to identify ovarian cancer.
The Lancet 2002, 359:572577. Publisher Full Text

National Cancer Institute clinical proteomics program [http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp] webcite

Assareh A, Volkert LG: Fuzzy rule based classifier fusion for protein mass spectra based ovarian cancer diagnosis.
Proceedings of the 2009 IEEE Symposium Computational Intelligence in Bioinformatics and Computational Biology (CIBCB'09) 2009, 193199.

Yang P, Zhang Z, Zhou BB, Zomaya AY: A clustering based hybrid system for biomarker selection and sample classification of mass spectrometry data.
Neurocomputing 2010, 73:23172331. Publisher Full Text

Petricoin EF, Ornstein DK, Paweletz CP, Ardekani A, Hackett PS, Hitt BA, Velassco A, Trucco C, Wiegand L, Wood K, Simone CB, Levine PJ, Linehan WM, EmmertBuck MR, Steinberg SM, Kohn EC, Liotta LA: Serum proteomic patterns for detection of prostate cancer.
J Natl Canc Institute 2002, 94:15761578. Publisher Full Text

Xu Q, Mohamed SS, Salama MMA, Kamel M: Mass spectrometrybased proteomic pattern analysis for prostate cancer detection using neural networks with statistical significance testbased feature selection.
Proceedings of the 2009 IEEE Conference Science and Technology for Humanity (TICSTH) 2009, 837842.

Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
Proc Natl Acad Sci USA 1999, 96:67456750. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Data pertaining to the article 'Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays' [http://genomicspubs.princeton.edu/oncology/affydata/index.html] webcite

Ambroise C, McLachlan G J: Selection bias in gene extraction on the basis of microarray geneexpression data.
Proc Natl Acad Sci USA 2002, 99:65626566. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Huang TM, Kecman V: Gene extraction for cancer diagnosis using support vector machines.
Artificial Intelligence in Medicine 2005, 35:185194. PubMed Abstract  Publisher Full Text