Abstract
Background
External stimulations of cells by hormones, cytokines or growth factors activate signal transduction pathways that subsequently induce a rearrangement of cellular gene expression. The analysis of such changes is complicated, as they consist of multilayered temporal responses. While classical analyses based on clustering or gene set enrichment only partly reveal this information, matrix factorization techniques are well suited for a detailed temporal analysis. In signal processing, factorization techniques incorporating data properties like spatial and temporal correlation structure have shown to be robust and computationally efficient. However, such correlationbased methods have so far not be applied in bioinformatics, because large scale biological data rarely imply a natural order that allows the definition of a delayed correlation function.
Results
We therefore develop the concept of graphdecorrelation. We encode prior knowledge like transcriptional regulation, protein interactions or metabolic pathways in a weighted directed graph. By linking features along this underlying graph, we introduce a partial ordering of the features (e.g. genes) and are thus able to define a graphdelayed correlation function. Using this framework as constraint to the matrix factorization task allows us to set up the fast and robust graphdecorrelation algorithm (GraDe). To analyze alterations in the gene response in IL6 stimulated primary mouse hepatocytes, we performed a timecourse microarray experiment and applied GraDe. In contrast to standard techniques, the extracted timeresolved gene expression profiles showed that IL6 activates genes involved in cell cycle progression and cell division. Genes linked to metabolic and apoptotic processes are downregulated indicating that IL6 mediated priming renders hepatocytes more responsive towards cell proliferation and reduces expenditures for the energy metabolism.
Conclusions
GraDe provides a novel framework for the decomposition of largescale 'omics' data. We were able to show that including prior knowledge into the separation task leads to a much more structured and detailed separation of the timedependent responses upon IL6 stimulation compared to standard methods. A Matlab implementation of the GraDe algorithm is freely available at http://cmb.helmholtzmuenchen.de/grade webcite.
Background
With the availability of highthroughput 'omics' data, more and more methods from statistics and signal processing are applied in the field of bioinformatics [1]. Direct application of such methods to biological data sets is essentially complicated by three issues, namely (i) the largedimensionality of observed variables (e.g. transcripts or metabolites), (ii) the small number of independent experiments and (iii) the necessity to take into account prior information in the form of e.g. interaction networks or chemical reactions. While (i) may be tackled by targeted analysis, feature selection or efficient dimension reduction methods, the issue of low number of samples (experiments) may hinder the transfer of methods. For example, with cDNA microarrays, the number of genes (p) is usually much larger than the experiment size n (number of arrays). Quantitative data from experiments are often classified as 'smallnlargep' problems [2] and algorithms that are currently being developed are tailored for such kind of data. Detailed prior information is in general best handled by Bayesian methods [3], which are however not straightforward to formulate in smallnlargep problems.
Here, we focus on the unsupervised extraction of overlapping clusters in data sets exhibiting properties (iiii). If applied to gene expression profiles acquired by microarrays or metabolic profiles from mass spectrometry, we can interpret these clusters as jointly acting species (cellular processes). While partitioned clustering based on kmeans [4] or hierarchical clustering [5] has been successful in some domains and is often the initial tool of choice for data grouping, overlapping clusters are better described by fuzzy techniques [6] or linear models [7]. Focusing on the latter, we can essentially summarize these techniques as matrix factorization algorithms. Constraining the factorization using e.g. decorrelation, statistical independence or nonnegativity then leads to algorithms like principal component analysis (PCA), independent component analysis [8] and nonnegative matrix factorization [9], respectively. Although such methods are successfully applied in bioinformatics [1012], they partially run into issues (iiii) as described above. In particular, it is not clear how to include prior knowledge, which has been a quite successful strategy in other contexts [13]. A first step towards this direction is network component analysis (NCA) [14,15]. It integrates prior knowledge in form of a multipleinput motif to uncover hidden regulatory signals from the outputs of networked systems, a task also covered in [16]. Hence, it focuses on the estimation of single gene's expression profiles, not in a linear decomposition of a data set into overlapping clusters. NCA poses strict assumptions on the topology of this predefined network, which makes it hardly applicable to mammalian highthroughput 'omics' data. Moreover, feedbacks from the regulated species back to the regulators are treated only as 'closedloops', without explicitly modeling the feedback structure.
To overcome these constraints, this contribution provides a novel framework for the linear decomposition of data sets into expression profiles. We present a new matrix factorization method that is computationally efficient (i), able to deal with the low number of experiments (ii) and includes as much prior information as possible (iii). In order to achieve computational efficiency coupled with robust estimation, we use delayed correlations instead of higherorder statistics. In signal processing, this strategy has been shown to be advantageous [17,18] for two reasons: such methods use more information from the data without overfitting it and they can be formulated as second order techniques. This is crucial for the application to microarray data, since dimensionality tends to be high in this environment.
However, delayed correlations can usually not be computed in the case of biological highthroughput experiments such as in microarray samples. While timeresolved experiments may provide correlations, the number of temporal observations is commonly too small (<10) for the estimation of timedelayed correlations.
Hence, we instead pose factorization conditions along the set of genes or other biological variables. We link these variables using prior knowledge e.g. in the form of a transcription factor or proteinprotein interaction (PPI) network, metabolic pathways or via explicitly given models. Using this information enables us to define a graphdecorrelation algorithm that combines prior knowledge with sourceseparation techniques, for illustration see Figure 1. In case of gene expression analysis the input of GraDe are the expression data and an underlying regulatory network. After applying GraDe, we obtain two matrices, a mixing and a source matrix. We interpret the sources as the biological processes and the mixing coefficients as their timedependent activities. Hereby, the extracted sources group the genes' expression that can be explained by the underlying regulatory network, e.g. different responses of a cell to an external stimulus.
Figure 1. GraDe: Graphdecorrelation algorithm. In cells, various biological processes are taking place simultaneously. Each of these processes has its own characteristic gene expression pattern, but different processes may overlap. A cell's total gene expression is then the sum of the expression patterns of all active processes, weighted by their current activation level. The GraDe algorithm combines a matrix factorization approach with prior knowledge in form of an underlying regulatory network. The input of GraDe is the transcriptional expression data, where observations can be different conditions or a time points, and the underlying regulatory network (prior knowledge). GraDe decomposes the observed expression data into the underlying sources S and their mixing coefficients A. Analyzing timecourse microarray data, we interpret these sources as the biological processes and the mixing coefficients as their timedependent activities. Observations indicate their expression behavior either in the different conditions or timepoints and activity their activation strength. We further filter processrelated genes by taking only the genes with the strongest contribution in each process. Finally, we test for enrichment of cellular processes (GO) and biological pathways (KEGG).
The cytokine interleukin IL6 mediates the production of acute phase proteins by hepatocytes and promotes liver regeneration [19]. In order to unveil the multilayered temporal gene responses in these processes, we measure gene expression in IL6 stimulated mouse hepatocytes by a timecourse microarray experiment. Applying GraDe with a literature based gene regulatory network, we are able to infer associated biological processes as well as the dynamic behavior of IL6 related gene expression. In addition, we find that the estimated factors are robust against the high number of false positives contained in largescale biological databases.
Results and Discussion
The activation of gene regulatory processes upon external stimulations induces a rearrangement of cellular gene expression patterns. Matrix factorization techniques are currently explored in the analysis of such multilayered and overlapping temporal responses. In the following, we propose an algorithm that incorporates prior knowledge as a constraint to the factorization task (see Figure 1).
Algorithm: Matrix factorization incorporating prior knowledge
In signal processing, various matrix factorization techniques have been developed that employ intrinsic properties of data to decompose them into underlying sources [17,18,20]. These methods are based on delayed correlations that can be defined for data having a temporal or spatial structure. For instance, the timedelayed correlation matrix of a centered, widesense stationary multivariate random process x(t) is defined as
where E denotes expectation. Here, offdiagonal elements detect timeshifted correlations between different data dimensions. For τ = 0 this measure reduces to the common crosscorrelation. Given l features, e.g. genes, aggregated in a data matrix X, e.g. mRNA expression data, the crosscorrelation matrix can be easily estimated with the unbiased variance estimator:
However, the experimentally generated quantitative data sets we face in bioinformatics rarely imply a natural order like which allows defining a generic kind of delayed correlation. We therefore generalize this concept by introducing prior knowledge that links features (e.g genes) along a predefined underlying network. This network may be largescale, but can be also an explicitly given smallscale process. Moreover, integrated information may be of qualitative (e.g. interaction) as well as quantitative nature (e.g. interaction strength, reaction rates).
Graphdelayed correlation
We encode prior knowledge in a directed, weighted graph defined on vertices corresponding to our features. The edges ℰ are weighted with weights w: ℰ → ℛ. These are collected in a weight matrix W ∈ ℛ^{l×l}, where w_{ij }specifies the weight of edge i → j. Note that our weights may be negative, and G may contain selfloops. For any vertex , we denote by S(i) := {j(i, j) ∈ ℰ} the set of successors of i, by P(i) := {j(j, i) ∈ ℰ} its predecessors.
The graph G introduces a partial ordering on the l features. We use the weight matrix W as propagator for an activity pattern X ∈ ℛ^{l }of our features and define the Gshift x^{G }of x as the vector with components
Recursively, we define any positive shift x^{G}(τ ) (see Figure 2). For negative shifts we replace predecessors P(i) by successors S(i), which formally corresponds to a transposition of the weight matrix W. Using the convention of trivial weights for nonexisting edges of G, we can extend the above sum to all vertices. Gathering available m experiments (rows) into a data matrix X ∈ ℛ^{m × l }, we obtain the simple, convenient formulation of a Gshifted data set
Figure 2. Illustration of the Gshift. Illustration of the Gshift in the unweighted graph G shown in (a). We start with an initial node activity x depicted in (b). We use the graph as propagator for the time evolution of this pattern: after one positive shift we achieve the activity pattern x^{G}(1) in (c).
After mean removal, we may assume that each row of X is centered. Then, in analogy to the unbiased estimator for crosscorrelations in Equation 2, we define the graphdelayed (cross)correlation
Note that our definition includes the standard timedelayed correlation by shifting along the line graph 1 → 2 → ... → l  1 → l.
The graphdelayed correlation is only symmetric if the used graph shows this feature which is, for instance in regulatory networks, rarely the case. For our following derivations, a symmetric generalized correlation measure however will turn out to be very convenient. In the remainder of this work, we will therefore use the symmetrized graphdelayed correlation
Enforcing the symmetry property is strategy has been often applied in the case of temporally or spatially delayed correlations. It has also been demonstrated that symmetrization stabilizes the estimation of the crosscorrelations from data [8]. Moreover, it can be shown that asymptotically using either normal or symmetrized correlations end up giving the same eigenvectors [17].
Factorization model
The linear mixing model for the input data matrix X ∈ ℛ^{m×l }is given by
Here, the matrix of source contributions A ∈ ℛ^{m×n }(m ≥ n) is assumed to have full column rank. The sources S ∈ ℛ^{n×l}are uncorrelated, zeromean stationary processes with nonsingular covariance matrix. We allow for a noise term ε ∈ ℛ^{m×l}, which is modeled by a stationary, white zeromean process with variance σ^{2 }. We assume white unperturbed data (possibly after whitening transformation). In other words, we interpret each row of X as linear mixture of the n sources (rows of S), weighted by mixing coefficients stored in A. Without additional restrictions, this general linear blind sourceseparation problem is underdetermined.
Here, we assume that the sources have vanishing graphdelayed crosscorrelation with respect to some given graph G and all shifts τ. Formally, this means that is diagonal. We observe
Clearly, a full identification of A and S is not possible, because Equation (7) defines them only up to scaling and permutation of columns: Multiplication of a source by a constant scalar can be compensated by dividing the corresponding row of the mixing matrix by the scalar. Similarly, the factorization implies no natural order of the sources. We can take advantage of the scaling indeterminacy by requiring our sources to have unit variance, i.e. With this, as we assumed white data , we see that AA^{T }= I, i.e. A is orthogonal. Thus, the factorization in Equation (8) represents an eigenvalue decomposition of the symmetric matrix . If additionally we assume that has pairwise different eigenvalues, the spectral theorem guarantees that A  and with it S  is uniquely determined by X except for permutation. The reason why we focused on the symmetrized instead of the simple graphdelayed correlation matrix was precisely that we wanted to have a symmetric matrix, because then the eigenvalue decomposition is well defined and also simple to compute.
However, we have to be careful, because we cannot expect to be of full rank. Obviously, we require more features than obtained sources (l » m), hence in general rank(X) = m. If G contains an adequate amount of information, rank(W) is of order l and since l » m, rank is essentially determined by (the upper bound) m. Hence, when analyzing highthroughput biological data linked by underlying largescale networks, we can extract as many sources as observations are available.
The GraDe algorithm
Equation (8) also gives an indication of how to solve the matrix factorization task in our setting. The first step consists of whitening the nonoise term of the observed mixtures X. The whitening matrix can be easily estimated from X by diagonalization of the symmetric matrix , provided that the noise variance σ^{2 }is known or reasonably estimated. If more signals than sources are observed, dimension reduction can be performed in this step. Insignificant eigenvalues then allow estimation of the noise variance, compare [17]. Now, we may estimate the sources by diagonalization of the single, symmetric graphdelayed correlation matrix . Altogether, we subsume this procedure as GraDe algorithm. In summary, the input of GraDe is (i) a expression matrix X ∈ ℛ^{m×l }containing m experiments and l genes and (ii) a weight matrix W ∈ ℛ^{l×l }containing the prior knowledge. We obtain a mixing matrix A ∈ ℛ^{m×n }(m ≥ n) and a source matrix S ∈ ℛ^{n×l }. In the case of gene expression data analysis the sources can be interpreted as biological processes and the mixing coefficients as their timedependent activities. A Matlab implementation is freely available at http://cmb.helmholtzmuenchen.de/grade webcite.
Including prior knowledge into the sourceseparation task may introduce bias in the patterns that are predefined and, in turn, the analysis and results obtained. It is important to note that annotation of biological knowledge is biased and under permanent change. Therefore, when using gene regulatory networks as prior knowledge one has to keep in mind that this information is subject to annotation bias. Thus the density of connections in certain regions of the network might be higher due to the fact that these parts are better explored. In the case of classification problem, recent studies have shown that methods can be improved in terms of classification accuracy by including prior knowledge into the classification process [21]. These methods benefit from the fact that genes are not treated as independent. Hence, most of these methods are based on the hypothesis that genes in close proximity, which are connected to each other, should have similar expression profiles. The same assumption may be transferred to sourceseparation methods. Applying standard methods like ICA or PCA, implies the assumption that all data points, i.e. in our setting the expression levels of different genes are sampled i.i.d. from an underlying probability density. This assumption is obviously not fulfilled, since the genes' expression values are the readouts of different states of a complex dynamical system: Genes obey dynamics along a transcription factor network. Instead of ignoring the genes' dependencies, we here proposed to explicitly model them using prior knowledge given within a generegulatory network. Therefore, one of the key advantages of GraDe is to overcome the assumption of the independencies. Applying GraDe to timecourse expression data (see section Validation of the timedependent signals), we will show that including prior knowledge into the source separation task leads to an improvement compared to fullyblind methods like PCA. Finally, we believe that with increasing quality and amount of biological knowledge, methods incorporating prior knowledge will increase in performance as well.
Illustration of GraDe
In order to illustrate GraDe, we analyze two toy examples. We first focus on a bifan structure shown in Figure 3a and assume to have six genes from the timecourses of expression levels depicted in Figure 3b. For data generation, the system is simulated by ordinary differential equations:
Figure 3. Illustration of GraDe. For the bifan motif in a we take 6 genes (dots) from the simulated timecourses in b (for parameters see Additional file 1) and apply GraDe: c shows the eigenvalues of the decomposition in GraDe. In d we plot the timecourses of the extracted sources s_{1 ... }s_{6}, hence the curves are the columns of the mixing matrix. From c we see that only the first three sources are relevant, which are visualized as heatmap e. For our second example f we assume to know expressions in different conditions as shown in g. The factorization by GraDe is visualized in sub figures h to j.
where we model interactions by sigmoidal Hill functions [22]. In this case, one input x_{1 }is active until timepoint 10, when it is turned off and instead production of x_{2 }is switched on. Consequently, x_{3 }peaks at time 10, but also x_{4 }shows an early activation due to low expression of its inhibitor. Applying GraDe (with the known bifan topology, but without access to the underlying ODE system), we find that three sources are sufficient to explain the data (Figure 3b). From the extracted sources and their timecourses (shown in Figure 3e and 3d) we see that the strongest source s_{1 }represents the externally controlled inputs and the network topology: the source couples x_{1 }and x_{3}, and in opposite direction x_{2 }and x_{4}. Therefore, GraDe is able to recover the two processes. Source s_{2 }has the lowest contribution to the total expression values and is needed for finetuning the combined dynamics, as we obtain an early activation of x_{4 }due to low expression of its inhibitor. Consequently, the source s_{2 }is active at timepoints 2 and 4, i.e. immediately after the switching operations. Source s_{3 }again reflects the crossover inhibitions, accordingly its timecourse is at. This source groups the input of the network, which could be linked e.g. to pathway stimulation. For our second example we use the funnel structure in Figure 3f, where we defined the expression values for three different input conditions (Figure 3g). Eigenvalues and the factorization obtained by GraDe are visualized in Figure 3hj. Source s_{1 }again reflects the network topology, by grouping the cascade genes, while s_{2 }allows the reconstruction of the last condition. As we expect, GraDe are able to recover the two independent inputs. Applying GraDe to two different toy examples, we are able to show that GraDe is applicable both timecourse as well as conditional experiments. In both cases, GraDe identifies the different responses and inputs of the system.
Application: IL6 mediated responses in primary hepatocytes
In liver, the cytokine interleukin IL6 mediates two major responses. First, it induces hepatocytes to produce acute phase proteins upon infectionassociated in inflammation. These proteins include complement factors to destroy or inhibit growth of microbes. In addition, IL6 promotes liver regeneration and protects against liver injury [19]. IL6 regulates several cellular processes such as proliferation, differentiation and the synthesis of acute phase proteins [23]. Upon binding to its cell surface receptor, IL6 activates the receptor associated Janus tyrosine kinase (JAK) 1  signal transducer and activator of transcription (STAT) 3  signal transduction pathway. The latent transcription factor STAT3 is translocated to the nucleus after activation and subsequently alters gene expression.
To identify the biological responses to IL6 in a timeresolved manner, we stimulated primary mouse hepatocytes with 1 nM IL6 up to 4 hours and analyzed the changes in gene expression by microarray analysis. In a first approach, we extracted all genes that were significantly regulated compared to time point 0 h. In total, we obtained 121 genes and applied kmeans clustering to detect groups within this set. Based on this approach, we could not identify any timeresolved responses upon IL6 stimulation (see Additional file section Clustering of significantly regulated genes). Due to the small number of significantly regulated genes, we decided to employ a genomewide approach using GraDe to resolve the cellular responses upon IL6 in more detail.
GraDe discovers timedependent biological processes upon IL6 stimulation
Using GraDe, we linked all 5709 expressed genes along a gene regulatory network derived from the TRANSPATH database (see Methods). We obtained four graphdecorrelated gene expression sources (GES), which we labeled from 1 to 4 according to their decreasing eigenvalues (Figure 4b). We see that dimension reduction and with it noise level estimation were not possible in our case. The estimated mixing matrix is shown in Figure 4a. The matrix of source contributions contains positive and negative components. We partitioned a source into submodes that contain either the negative signals or the positive signals, respectively. We selected all genes in the positive submodes by choosing a threshold ≥2 as well as all genes in the negative submodes with a threshold ≤2, respectively. These sets were subsequently used for GO enrichment analysis using a pvalue < 0.05 after correction by False Discovery Rate.
Figure 4. Result of GraDe. This figure illustrates the decomposition of the timecourse microarray experiment on IL6 stimulated hepatocytes with GraDe. As underlying network we used interactions from the TRANSPATH database (see Methods). (a) shows the timecourses of the four extracted sources, centered to time point 0 h. The xaxis shows the measured timepoints and the yaxis the contribution of the mixing matrix. In (b), we plot the strength of the eigenvalues (EV) of the resulting sources. All four extracted sources have significant contributions.
Differentially expressed genes within GES 1 display an immediate strong increase in expression following IL6 stimulation. After peaking at one hour, expression decreases to elevated levels compared to untreated samples. Significantly enriched GOTerms within this GES correspond to responses triggered by external stimuli and in ammation (see Table 1, for a complete list of biological processes see Additional file 1). In liver, upon infection or injuryassociated in inflammation IL6 mediates production of acute phase proteins (APP) by hepatocytes as represented by the GOTerm "(acute) in ammatory response " (e.g Saa4, Fgg, Pai1 ). Angptl4 is a positive acute phase protein [24] showing a strong increase in expression during the first hour after stimulation followed by a decrease after two hours (see Figure S4). GraDe reconstruct the expression pattern by the mixing of the four different source timepatterns GES 1 to 4. We identify Angptl4 in GES 1 and 3 having a source contribution ≥2. The combination of both GESs showing perfectly the strong increase after IL6 (GES 1) and the induced decreased after 2 hours (GES 3). The GOTerm "(external) stimulus" includes genes of the JAKSTAT signaling pathway like STAT3 as well as several genes encoding for signaling components such as Hamp, Cepbd and Osmr. These entities represent regulatory processes like negative feedbacks as well as secondary signaling events. Genes with negative contribution in GES 1 were associated with metabolic processes like "Lserine biosynthesis" or "fructose metabolic processes". This is in line with the function of IL6 as a priming factor, mediating the conversion of quiescent hepatocytes from G0 to G1 phase of the cell cycle during liver regeneration [19]. It can be argued that downregulation of genes associated with metabolic processes is due to the transformation of differentiated metabolically active hepatocytes into proliferative cells. The downregulated metabolic functions at least partially take place in mitochondria. Accordingly, parts of the glycolysis pathway were downregulated in primary hepatocytes.
Table 1. Main biological processes in response to IL6
Additional file 1. Supplementary information. Additional file contains supplementary information.
Format: PDF Size: 698KB Download file
This file can be viewed with: Adobe Acrobat Reader
GES 2 shows a slight decrease after stimulation followed by a latephase increase in expression. We identify several biological processes associated with "cell cycle and division" within this GES. A representative gene of GES 2 is the cell cycle inhibitor Cdkn1b. Its reduction of expression corresponds to the induction of cell cycle progression and in particular to the transfer from G0 to G1. These characteristics are further supported by the negative contribution of Cdkn1b in GES 3. Analyzing genes with a positive contribution in GES 2 only, we found, in addition to involvement in early cell cycle events, genes showing an association with (programmed) cell death and apoptosis. It was already indicated that IL6 promotes liver regeneration and protects against liver injury by inducing antiapoptotic and survival genes [19,25]. GOTerms corresponding to genes found in GES 2 having a negative contribution are more heterogeneous. Within the top GOTerms we identified several biological functions associated with the IL6 stimulus. Based on the induction of the acute in ammatory response, coagulation factors were activated. Moreover, several genes associated with gene translation were found. In addition, genes associated with metabolic processes are represented by this GES.
The time course behavior of GES 3 shows a delayed activation subsequent to stimulation with IL6. We identified several GOTerms associated with "cell cycle" and "cell division" similar to GES 2. However, GES 3 includes mainly genes related to late events in the cell cycle, i.e. during G2 and M phase (e.g. Gmnn, Mcm2, Plk2 ). Wee1 as a main regulator of Cdc2 displays a negative contribution to GES 3, hence indicating Wee1 downregulation and subsequent progression through the G2M check point. In addition, we identify Ccnb2 a late cell cycle genes, which repression leads to cell cycle arrest in the G2 phase. The timecourse expression pattern, shows a strong increase after IL6 stimulation followed by a decrease after two hours (see Figure S5). We identify Ccnb2 in GES 1 and GES 3 perfectly reconstruct the strong increase after the stimulation and the inactivation after two hours. The IL6 induced priming phase is characterized by the activation of the latent transcription factor STAT3. This immediate response induces the expression of early responsive genes like the transcription factor AP1 [26] subsequently inducing a secondary gene response leading to transcription of cyclins AE, p53, and the cyclin dependent kinase P34cdc2 [27].
Applying KEGG pathway enrichment, we found the cell cycle, with DNA replication in particular, and p53 pathway enriched within this GES. Interestingly, IL6 stimulation alone is not sufficient to efficiently induce proliferation of primary mouse hepatocytes in vitro. Hence, despite the persistent reorganization of the induced gene expression profile and the induction of early cell cycle players such as cyclin A, additional stimuli may be necessary to initiate a strong proliferative response of primary mouse hepatocytes. GES 4 shows the lowest eigenvalue. It has a strong increase in expression following the IL6 stimulus. GOTerm enrichment reveals several biological processes found in GES 1  3 like coagulation, translation, acute phase, and response of the stimulus. Genes having a negative contribution in GES 4, indicating a decrease in expression after the stimulus, are again associated with metabolic processes. Both, GES 3 and 4 imply that hepatocytes stimulated with IL6 show affection for division causing a downregulation of genes associated with the metabolic processes.
Validation of the timedependent signals
In order to evaluate our findings, we compared the outcome of GraDe with standard methods. As there is no established matrix factorization technique that incorporates prior knowledge, we employed PCA [28], kmeans clustering [29] and FunCluster [30], a clustering method that incorporates Gene Ontology information into the clustering task.
To test the biological findings obtained by GraDe, we applied a similar approach as proposed by Teschendor et al. [12]. We first asked how well biological pathways can be mapped to the inferred submodes or clusters. For GraDe and PCA, we selected in each submode the genes having an absolute source contribution above 2 standarddeviations. The average number of selected genes in each submode ranges from 75 to 280. For kmeans clustering, we infer 8 clusters on a subset of the top 15% most variable genes to ensure that the average number of selected genes is comparable to GraDe and PCA.
To evaluate the mapping of pathways to submodes or clusters we applied the pathway enrichment index (PEI). For each submode or cluster we evaluated significantly enriched pathways by using a hypergeometric test (see Methods). The PEI is then defined as the fraction of significant pathways mapped to at least one submode or cluster. The PEI for each method is shown in Figure 5. We find that the PEI is higher for GraDe compared to PCA, kmeans clustering or FunCluster indicating that GraDe maps submodes closer to biological pathways.
Figure 5. Pathway enrichment. Result of the pathway enrichment analysis. For each method applied to our data set, we plotted the pathway enrichment index (PEI). This index gives the fraction of KEGG pathways found enriched in at least one submode or cluster (see Methods). GraDe obtained a much higher PEI than PCA, kmeans clustering or FunCluster. This indicates that sources obtained by GraDe map much closer to biological pathways.
In addition, we validated the timedependent responses upon IL6 stimulation in more detail by searching for enriched GOTerms. Applying PCA, we found that the first principle component (PC) contains 99% of data variance (see Figure S1). GOTerm enrichment analysis revealed that PC 1 contains genes linked to (blood) coagulation and hemostasis (see Additional file 1). A second major response after IL6 is the activation of cell cycle or cell division. We found an enrichment of these biological processes in PC 2 and PC 4. PC 2 shows a decreased timecourse behavior after the stimulation. Genes linked to cell cycle and corresponding pathways have a negative contribution in PC 2 indicating an increased timecourse expression pattern after IL6 stimulation. This finding is analogous to GraDe, where we find cellcycle pathways in GES 1 and 3 showing also an increasing expression pattern after the stimulation. With GraDe we identified several genes that are associated with metabolic processes showing a downregulation after stimulus. PCA covers these biological processes by two components PC 2 and PC 3, where PC 3 shows a strong increase and PC 2 a decrease of expression after the stimulus (see Figure 6a). The direct response of IL6 was found in PC 4, but we identified only acute in ammatory response. Moreover, PCA grouped cell cycle (negative mode) and the direct response (positive mode) into PC 4 and was not able to separate the cell cycle processes into the early (e.g. Cdkn1b) and late (e.g Mcm2 ) responses after IL6 stimulation.
Figure 6. Result of PCA, kmeans clustering and FunCluster. (a) illustrates the result of PCA for the timecourse data of IL6 stimulated hepatocytes. The xaxis corresponds to the measured timepoints and the yaxis gives the centered (to time point 0 h) contributions of the mixing matrix. The result of the kmeans clustering is shown in (b). The xaxis shows the measured timepoints and the yaxis shows the foldchange values of the centroids at that timepoints. (c) shows the result of FunCluster. The plot shows the mean expression of the different cluster and the bars indicates the standard deviation at a particular timepoint. The xaxis shows the measured timepoints and the yaxis shows the relative expression at that timepoints.
Focusing on the results of the kmeans clustering, we obtained an enrichment of cell cycle processes in cluster 3 (see Figure 6b). This cluster shows only a marginal increase in expression after the stimulus and therefore does not reflect the strong activation of cell cycle found by GraDe and PCA. Genes associated with metabolic processes are grouped in cluster 5, which has a constant expression level after IL6 stimulus. Hence, kmeans clustering failed to infer a cluster associated to the downregulation of metabolic processes upon IL6. Cluster 4 shows a characteristic timecourse pattern after IL6 stimulation, but we were not able to reveal any significant biological processes associated to IL6. Altogether, kmeans clustering neither identifies the direct response upon IL6 nor the separation between early and late cell cycle genes. Comparing the result of FunCluster, we also identify a set of coregulated genes associated with cell cycle (Cluster 3; see Figure 6c). Genes grouped in this cluster show an increase in expression after the stimulus. However, FunCluster was also not able to separate the early and late cell cycle processes, observed by GraDe. Genes associated with metabolic processes are grouped in cluster 5, showing a decreasing expression pattern after one hour of stimulation. Therefore, FunCluster also identifies the downregulation of metabolic processes indicating that IL6 reduces expenditures for the energy metabolism. However, FunCluster was not able to identify the primary response of IL6 mediating the production of acute phase proteins (APP) by hepatocytes. Moreover, FunCluster also did not find any significant processes related to the JAKSTAT related genes, such as Stat3, Hamp, Cepbd and Osmr, showing an increased expression pattern.
These results show that the decomposition obtained by GraDe provided much more detailed biological insights than PCA, kmeans clustering or FunCluster. PCA was able to identify three main biological processes upon IL6 stimulus. However, it failed to give a correct timeresolved pattern of these biological processes, whereas sources from GraDe reproduce the characteristic timecourse behavior of the IL6 response. Moreover, GraDe reveals a much more structured and timeresolved result, which allows assigning each source to a different main process.
Robustness analysis
Detailed knowledge about gene regulation is often not available and far from complete. Therefore, the quality of a largescale gene regulatory network is not perfect. In order to test the effect of network errors on the output of GraDe, we performed two robustness analyses. Starting with our TRANSPATH network, we generated randomized versions by either shuffling the network content or adding random information (see Methods). By shuffling edge information of the gene regulatory network between 0.1 and 100% of all original edges, we simulated a loss of information. To quantify robustness, we employed the Amariindex, which measures the deviation between two mixing matrices. We obtained significantly low Amariindices for up to 3% reshuffled edges within the gene regulatory networks (mean Amariindex = 3.83, p = 0:034), whereas a complete randomization of the network results in an Amariindex of 9.63 (see Figure 7a). This shows that the quality of the regulatory network has of course a strong influence on the output of the GraDe algorithm. It is obvious that GraDe depends on the regulatory network, and replacing gene interaction through random information will lead to loss of the signals.
Figure 7. Robustness analysis. Robustness analysis: We evaluated the robustness of GraDe against errors in the underlying graph. To this end, we compared the mixing matrix that we extracted with the TRANSPATH network with those obtained based on perturbed versions. For this comparison we use the Amari index (see Methods). The boxplots show Amariindices obtained with (a) a network rewiring approach and (b) when adding random information to the network. The xaxis shows the amount of information randomized (in %), the yaxis gives the obtained Amariindex. * indicates significant 95% quantiles compared to a random sampling (pvalue ≤ 0.05). We see that GraDe is robust against a reasonable amount of wrong information.
We ran a second robustness analysis by adding random information to the existing gene regulatory network. This is important because we expect largescale networks extracted from literature to contain many falsepositives. Significantly low Amariindices were obtained by adding up to 13% random information (mean Amariindex = 3.94, p = 0:046) to the network (see Figure 7b). This result shows that GraDe is able to detect the signals even after adding a large amount of probably wrong information to the network. The tolerance of the algorithm to the second randomization strategy is much higher, as here no correct information is destroyed. Overall, with both randomization procedures we were able to prove that GraDe is robust against a reasonable amount of both, false positives and missing information.
In addition, we analyzed the noise effect of gene expression data by randomly choosing between one and three replicates for each time point. We found significantly low Amariindices (mean Amariindex = 4.16 p = 0:026) by comparing the 95% quantile of the resulting Amariindex with a random sampling. Thus, GraDe is also robust against biological noise.
Conclusion
IL6 promotes liver regeneration and protects against liver injury. In order to understand these effects in a timeresolved manner, we performed a timecourse microarray experiment of IL6 stimulated primary mouse hepatocytes. Standard techniques applied to this data set only partly revealed temporal gene expression patterns following the stimulation. To resolve the interaction of IL6 and the corresponding cellular responses in more detail, we developed GraDe. It extracts overlapping clusters from largescale biological data by combining a matrix factorization approach with the integration of prior knowledge. Applying GraDe to our experiment, we identified the activation of acute phase proteins, which are known to be one of the primary response upon infection based in ammation. Moreover, we observed that IL6 activates cell cycle progression, as well as the downregulation of genes associated with metabolic processes and programmed cell death. Therefore, IL6 mediated priming renders hepatocytes more responsive towards cell proliferation and reduces expenditures for the energy metabolism.
Methods
IL6 stimulated mouse hepatocytes
RNA probes from primary mouse hepatocytes were assessed with the Bioanalyzer 2100 (Agilent) to ensure that 28S/18S rRNA ratios were in the range of 1.5 to 2.0 and concentrations were comparable between probes. For each time point, 4 g of total RNA were used for the hybridization procedure using the OneCycle Target Labeling Kit (Affymetrix). Fluorescence intensities were acquired with the GeneChip Scanner 3000 and the GCOS software (Affymetrix). GeneChip Mouse Genome 430 2.0 Arrays (Affymetrix) were used in the analysis comprising stimulations with 1 nM IL6 for 1 h, 2 h, 4 h and an unstimulated control (0 h) each performed in triplicates. As a probe level model (PLM) for microarray data an additivemultiplicative error model was used. Data processing was performed using the Limma toolbox [31] provided by Bioconductor [32]. The RMA approach was used for normalization and background correction. Probe sets were filtered out by the genefilter package. A gene was considered as expressed if the signal was above 100 (unlogged data) for at least one time point. Finally, we obtained a data set of 5709 genes. Significantly regulated genes compared to time point 0 h were determined by using the LIMMA (Linear Models for Microarray Data) method [33]. The Limma toolbox uses the moderated tstatistics to identify significant regulated genes. Moreover the moderated tstatistics is advisable for a small number of arrays [33,34]. A gene was determined as significant regulated if the pvalue was <0.05 after multiple testing correction by the BenjaminiHochberg procedure [35]. Raw data are available at GEO with accession number GSE21031.
Gene Regulatory network
In order to link genes along an underlying network we used the TRANSPATH database [36] that provides detailed knowledge of intracellular signaling information based on changes in transcription factor activity.
We searched for direct gene or protein interactions within the TRANSPATH database using the terms: transactivation, increase of abundance, expression, activation, DNA binding, increase of DNA binding, transrepression, decrease of abundance, decrease of DNA binding, and inhibition.
Principle component analysis
For principle component analysis (PCA) we performed an eigenvalue decomposition of the covariance matrix of the data set X. Thereby we obtained a decomposition into an orthonormal source matrix S and an orthogonal mixing matrix A. We applied PCA to the same set of expressed genes as GraDe and also inferred four sources. We defined for each component two submodes by grouping genes with a threshold ≥+2 standarddeviations and a second set of genes having a source weight of ≤2 standarddeviations.
kmeans clustering
In order to ensure a fair comparison of kmeans clustering with GraDe and PCA, we first applied a gene selection step to provide that all methods selected an approximately equal number of genes, as proposed in [12]. We ranked all expressed genes according to their expression variance across the timecourse and then selected the top 15% variable genes. Having the selected genes, clustering was then performed using kmeans clustering [29], where k was set to 8 in order to match the same number of submodes inferred by GraDe and PCA.
FunCluster
In addition to kMeans clustering, we also include a clustering method which incorporates Gene Ontology information into the clustering task. We use the FunCluster method, which performs functional analysis of gene expression data [30,37]. FunCluster detects coregulated biological processes through a specially designed clustering procedure involving biological annotations (GO and KEGG) and gene expression data. We apply the FunCluster implementation provided within the R environment [38] and using standard parameters.
Enrichment analysis
For gene sets grouped in sources obtained by GraDe and PCA or kmeans clusters we performed an enrichment analysis to determine significantly enriched biological processes and pathways. For biological processes we performed a Gene Ontology (GO) [39] term enrichment analysis, which was carried out with the R package GOstats [32]. For pathway enrichment analysis we used nonmetabolic pathways that are manually curated by the Kyoto Encyclopedia of Genes and Genomes (KEGG) [40]. Pathway enrichment was also evaluated with the GOstats package. To correct for multiple testing, we used the BenjaminiHochberg procedure [35] and called an association significant if the pvalue was less than 0.05. To evaluate the mapping of pathways to submodes or clusters we applied the pathway enrichment index (PEI), as proposed by [12]. For each submode or cluster we evaluated the significance of enrichment of a set of genes in a particular pathway by using a hypergeometric test. A pathway association was considered as significant if the pvalue was below 0.05 after multiple testing correction using the BenjaminiHochberg procedure. The PEI was then defined as the fraction of significant pathway mapped to at least one submode or cluster.
Robustness analysis
Robustness analysis was performed by two network randomizations. The gene regulatory network is interpreted as a weighted bipartite graph, i.e. a graph with two sets of nodes (regulators and regulated genes). Weighted edges indicate interactions either activating or inhibiting. First, we randomized existing edge information within the network between 0.1 and 100%. In each step we shuffled 10:000 times the corresponding amount of edges using a degreepreserving rewiring [41,42]. Applying GraDe with the resulting networks we obtained new factorizations. To compare the original and new results in a quantitative way, we used the Amariindex [43]. For each step we took the 95% quantile of the random sampling and calculated a pvalue by comparing this quantile to Amariindices obtained comparing normally distributed random separating matrices and the original mixing matrix. In a second randomization approach, we added 10:000 times new information (edges) between 0.1 and 100% of the original network content and calculated the 95% quantile of the resulting Amariindices. Again, the pvalue was calculated by comparing each quantile with a random sampling.
For analysis of robustness against noise we randomly chose between one and three replicates for each time point and ran GraDe. For each run, we calculated the Amariindex. Again, we compared the 95% quantile of the resulting distribution with a random sampling to obtain the corresponding pvalue.
Authors' contributions
FJT and UK conceptualized the study. AK, FB and FJT designed and implemented the algorithm, AK analyzed and interpreted the expression data. SB, MS and NG performed the experiment. AK, FB, SB, UK and FJT wrote the manuscript. All authors read and approved the final version of the manuscript.
Acknowledgements
We thank S. RoseJohn for generous donation of IL6 and D. Wittmann for stimulating remarks. This work was supported by the Federal Ministry of Education and Research (BMBF) project 'HepatoSys' and the MedSys initiative (project 'LungSys') and the Helmholtz Alliance on Systems Biology (projects 'CoReNe' and 'SBCancer'). FJT gratefully acknowledges financial support by the European Union within the ERC grant LatentCauses.
References

Tarca AL, Carey VJ, Chen Xw, Romero R, Draghici S: Machine learning and its applications to biology.
PLoS Comput Biol 2007, 3(6):e116. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kosorok MR, Ma S: Marginal asymptotics for the "large p, small n" paradigm: With applications to microarray data.
The Annals of Statistics 2007, 35(4):14561486. Publisher Full Text

Gelman A, Carlin J, Stern H, Rubin D: Bayesian Data Analysis. New York: Chapman and Hall; 2004.

Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture.
Nat Genet 1999, 22(3):2815. PubMed Abstract  Publisher Full Text

Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns.
Proc Natl Acad Sci USA 1998, 95(25):148638. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gasch AP, Eisen MB: Exploring the conditional coregulation of yeast gene expression through fuzzy kmeans clustering.
Genome Biol 2002, 3(11):research0059.1research0059.22. BioMed Central Full Text

Kerr MK, Churchill GA: Experimental design for gene expression microarrays.
Biostatistics 2001, 2(2):183201. PubMed Abstract  Publisher Full Text

Hyvärinen A, Karhunen J, Oja E: Independent Component Analysis. John Wiley & Sons; 2001. Publisher Full Text

Lee DD, Seung HS: Learning the parts of objects by nonnegative matrix factorization.
Nature 1999, 401(6755):78891. PubMed Abstract  Publisher Full Text

Liebermeister W: Linear modes of gene expression determined by independent component analysis.
Bioinformatics 2002, 18:5160. PubMed Abstract  Publisher Full Text

Schachtner R, Lutter D, Knollmüller P, Tomé AM, Theis FJ, Schmitz G, Stetter M, Vilda PG, Lang EW: Knowledgebased Gene Expression Classification via Matrix Factorization.
Bioinformatics 2008, 24(15):16881697. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Teschendorff AE, Journee M, Absil PA, Sepulchre R, Caldas C: Elucidating the altered transcriptional programs in breast cancer using independent component analysis.
PLoS Comput Biol 2007, 3(8):e161. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles.
Proc Natl Acad Sci USA 2005, 102(43):1554550. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Liao JC, Boscolo R, Yang YL, Tran LM, Sabatti C, Roychowdhury VP: Network component analysis: reconstruction of regulatory signals in biological systems.
Proc Natl Acad Sci USA 2003, 100(26):155227. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Boscolo R, Sabatti C, Liao JC, Roychowdhury VP: A generalized framework for network component analysis.
IEEE/ACM Trans Comput Biol Bioinformatics 2005, 2(4):289301. Publisher Full Text

Blöchl F, Theis FJ: Estimating hidden influences in metabolic and gene regulatory networks.

Belouchrani A, AbedMeraim K, Cardoso JF, Moulines E: A blind source separation technique using secondorder statistics.
IEEE Trans Signal Proces 1997, 45(2):434444. Publisher Full Text

Tong L, Inouye Y, Soon VC, Huang YF: Indeterminacy and identifiability of blind identification.
IEEE Trans Circuits Sys 1991, 38:499509. Publisher Full Text

Journal of hepatology 2000, 32(1 Suppl):1931. PubMed Abstract  Publisher Full Text

Theis FJ, MeyerBäse A, Lang EW: Secondorder blind source separation based on multidimensional autocovariances. In Proc {ICA} 2004, Volume 3195 of LNCS, Granada. Spain: Springer; 2004:726733.

Johannes M, Brase JC, Frohlich· H, Gade S, Gehrmann M, Fälth M, Sültmann H, Bei barth T: Integration Of Pathway Knowledge Into A Reweighted Recursive Feature Elimination Approach For Risk Stratification Of Cancer Patients.
Bioinformatics 2010. PubMed Abstract  Publisher Full Text

Blöchl F, Wittmann DM, Theis FJ: Effective parameters determining the information flow in hierarchical biological systems.

Gauldie J, Richards C, Harnish D, Lansdorp P, Baumann H: Interferon beta 2/Bcell stimulatory factor type 2 shares identity with monocytederived hepatocytestimulating factor and regulates the major acute phase protein response in liver cells.
Proc Natl Acad Sci USA 1987, 84(20):72515. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lu B, Moser A, Shigenaga JK, Grunfeld C, Feingold KR: The acute phase response stimulates the expression of angiopoietin like protein 4.
Biochem Biophys Res Commun 2010, 391(4):173741. PubMed Abstract  Publisher Full Text

Streetx KL, Luedde T, Manns M, Trautwein C: Interleukin 6 and liver regeneration.
Gut 2000, 47(2):309312. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Westwick J, Weitzel C, Minden A, Karin M, Brenner D: Tumor necrosis factor alpha stimulates AP1 activity through prolonged activation of the cJun kinase.
J Biol Chem 1994, 269(42):2639626401. PubMed Abstract  Publisher Full Text

Albrecht JH, Hansen LK: Cyclin D1 promotes mitogenindependent cell cycle progression in hepatocytes.
Cell Growth Differ 1999, 10(6):397404. PubMed Abstract  Publisher Full Text

Ringnér M: What is principal component analysis?
Nat Biotechnol 2008, 26(3):3034. PubMed Abstract  Publisher Full Text

Kaufman L, Rousseew PJ: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probaility and Statistics. John Wiley & Sons; 2005.

Henegar C, Cancello R, Rome S, Vidal H, Clement K, Zucker JD: Clustering biological annotations and gene expression data to identify putatively coregulated biological processes.
J Bioinform Comput Biol 2006, 4(4):83352. PubMed Abstract  Publisher Full Text

Smyth GK, Ritchie M, Thorne N, Wettenhall J: Limma: linear models for microarray data. In Bioinformatics and Computational Biology Solutions using R and Bioconductor. Springer; 2005:397420. Publisher Full Text

Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J: Bioconductor: open software development for computational biology and bioinformatics.
Genome Biol 2004, 5(10):R80. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Smyth GK: Linear models and empirical bayes methods for assessing differential expression in microarray experiments.
Stat Appl Genet Molec Biol 2004., 3
Article3

Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response.
Proc Natl Acad Sci USA 2001, 98(9):511621. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.

Krull M, Pistor S, Voss N, Kel A, Reuter I, Kronenberg D, Michael H, Schwarzer K, Potapov A, Choi C, KelMargoulis O, Wingender E: TRANSPATH: an information resource for storing and visualizing signaling pathways and their pathological aberrations.
Nucleic Acids Res 2006, (34 Database):D54651. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cancello R, Henegar C, Viguerie N, Taleb S, Poitou C, Rouault C, Coupaye M, Pelloux V, Hugol D, Bouillot JL, Bouloumié A, Barbatelli G, Cinti S, Svensson PA, Barsh GS, Zucker JD, Basdevant A, Langin D, Clément K: Reduction of macrophage in filtration and chemoattractant gene expression changes in white adipose tissue of morbidly obese subjects after surgeryinduced weight loss.
Diabetes 2005, 54(8):227786. PubMed Abstract  Publisher Full Text

Team RDC: R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria. ISBN 2008., 3
900051070

Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology.
Nucleic Acids Res 2004, (32 Database):D2626. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, Yamanishi Y: KEGG for linking genomes to life and the environment.
Nucleic Acids Res 2008, (36 Database):D4804. PubMed Abstract  PubMed Central Full Text

Maslov S, Sneppen K: Specificity and stability in topology of protein networks.
Science 2002, 296(5569):9103. PubMed Abstract  Publisher Full Text

Hartsperger ML, Blöchl F, Stümpflen V, Theis FJ: Structuring heterogeneous biological information using fuzzy clustering of kpartite graphs.
BMC Bioinformatics 2010, 11:522. PubMed Abstract  BioMed Central Full Text

Cichocki A, Amari S: Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications.. New York: Wiley; 2002. Publisher Full Text