This article is part of the supplement: Italian Society of Bioinformatics (BITS): Annual Meeting 2013: Bioinformatics

Open Access Research

timeClip: pathway analysis for time course data without replicates

Paolo Martini1, Gabriele Sales1, Enrica Calura1, Stefano Cagnin1, Monica Chiogna2 and Chiara Romualdi1*

Author Affiliations

1 Department of Biology, University of Padova, Padova, Italy

2 Department of Statistics, University of Padova, Padova, Italy

For all author emails, please log on.

BMC Bioinformatics 2014, 15(Suppl 5):S3  doi:10.1186/1471-2105-15-S5-S3

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/15/S5/S3


Published:6 May 2014

© 2014 Martini et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Abstract

Background

Time-course gene expression experiments are useful tools for exploring biological processes. In this type of experiments, gene expression changes are monitored along time. Unfortunately, replication of time series is still costly and usually long time course do not have replicates. Many approaches have been proposed to deal with this data structure, but none of them in the field of pathway analysis. Pathway analyses have acquired great relevance for helping the interpretation of gene expression data. Several methods have been proposed to this aim: from the classical enrichment to the more complex topological analysis that gains power from the topology of the pathway. None of them were devised to identify temporal variations in time course data.

Results

Here we present timeClip, a topology based pathway analysis specifically tailored to long time series without replicates. timeClip combines dimension reduction techniques and graph decomposition theory to explore and identify the portion of pathways that is most time-dependent. In the first step, timeClip selects the time-dependent pathways; in the second step, the most time dependent portions of these pathways are highlighted. We used timeClip on simulated data and on a benchmark dataset regarding mouse muscle regeneration model. Our approach shows good performance on different simulated settings. On the real dataset, we identify 76 time-dependent pathways, most of which known to be involved in the regeneration process. Focusing on the 'mTOR signaling pathway' we highlight the timing of key processes of the muscle regeneration: from the early pathway activation through growth factor signals to the late burst of protein production needed for the fiber regeneration.

Conclusions

timeClip represents a new improvement in the field of time-dependent pathway analysis. It allows to isolate and dissect pathways characterized by time-dependent components. Furthermore, using timeClip on a mouse muscle regeneration dataset we were able to characterize the process of muscle fiber regeneration with its correct timing.

Background

Time course gene expression experiments are widely used to study the dynamics of biological processes. Usually, the main goal of such experiments is to identify genes modulated along a biological process or after a system perturbation (such as drug treatments or genetic modifications). However, time course data are costly and usually long time series have few or no replicates. In this context a differentially expressed gene can be defined as a gene with the expression profile changing significantly along time and/or across multiple conditions. Several statistical models have been proposed to account for clusters and differential expression in the contest of time series with [1-18] and without replicates [10,19-21], but none of them were proposed in the context of pathway analysis. Pathway analysis has acquired great relevance in the last years especially for the ability to increase interpretability of gene expression results. Expression experiments typically provide lists of differentially expressed genes (DEGs) that represent the starting point for result interpretation. This step is not trivial and remains challenging for this type of analysis. The grouping of genes into functionally related entities (such as pathways) is of great help in the interpretation of the results. Several methods have been proposed to this aim, based on very different statistical tests and null hypotheses [22,23]. Broadly speaking, they can be divided into the classical enrichment analysis [24-28], working on gene lists selected through a gene-level test, and the novel global and multivariate approaches [29-37], that define a model for the whole gene set (see [22,38-40] for a comprehensive reviews and comparative analysis). The latter can be further divided into 'topological' and 'non-topological' methods according to their ability to gain power from the topology of the pathway [25,35,36,41-43].

A pathway is a complex structure comprising chemical compounds mediating interactions and different types of gene groups (e.g. protein complexes or gene families) that are usually represented as single nodes but whose measures are not available using gene expression data. However, after appropriate biologically-driven conversion [44,45], a biological pathway can be represented as a graph where genes and their interactions are, respectively, nodes and edges of the graph.

Taking advantage of the structure of the graph, Massa et al. [35] used Gaussian graphical model theory to test both differences in mean and in covariance matrices between two experimental conditions. In particular, graphical models are useful to decompose the overall graph (obtained from a pathway) into smaller components (cliques), that can be explored and tested in detail. Martini et al. [36] proposed an extension of this method, called CliPPER, based on a two-step empirical approach. In the first step, it selects pathways with covariance matrices and/or means significantly different between experimental conditions dealing with the p >> n case; in the second step, it identifies the sub-paths (called signal paths) most associated with the phenotype.

Pathway analysis is mainly tailored to two-groups comparisons and few efforts have been dedicated to the time course design. Here, we propose a modification of [36], called timeClip, to deal with long time course data without replicates. Specifically, timeClip combines principal component analysis, regression models and graph decomposition to explore temporal variations across and within pathways. Moreover, timeClip implements an easy and effective visualization of the dynamics of the pathways.

On simulated datasets, timeClip shows good performances in term of power, specificity and sensitivity. Using real data on mouse muscle regeneration [46], we obtain excellent results in agreement with the scientific literature.

Method

Pathway annotation

A critical step in the field of topology based pathway analyses is the availability and the quality of the pathway topology. Our group has recently developed graphite a Bioconductor package for the storage, interpretation and conversion of pathway topology to gene-only networks [44]. graphite discriminates between different types of biological gene groups and propagates gene connections through chemical compounds. Specifically, protein complexes are expanded into a clique (all proteins connected to the others), while the gene families are expanded without connections among them; see [44,45] for more details. The current version of graphite Bioconductor package is limited to human, so here we build a dedicated graphite package for mouse KEGG pathways. This package is available at http://romualdi.bio.unipd.it/wp-uploads/2013/10/graphite.mmusculus_0.99.2.tar.gz webcite.

timeClip: general approach

A pathway is composed by multiple genes so to reduce the dimension of a whole or of a portion of a pathway, we used principal component analysis. Then the first principal component is explored for temporal variation. A vast amount of techniques exist for analyzing regularly sampled time series. Unfortunately, the irregular sampling of the values (a common practice in biology) makes direct use of such estimation techniques impossible. To avoid the well known biases associated with the most common approach for irregularly sampled time series based on transforming unevenly-spaced data into equally spaced observations using some form of interpolation, here we propose to use a regression model combining a polynomial trend and a continuous-time Gaussian autoregressive process of order 1 (AR(1)). Then, timeClip resembles the two-steps approach of CliPPER. In the first step, the whole pathway is explored for its temporal variation. If the pathway is defined as time-dependent, in the second step, timeClip decomposes the pathway into a junction tree and highlights the portion mostly dependent on time. A general schema of the approach is summarized in Figure 1.

thumbnailFigure 1. timeClip. Global overview of timeClip approach.

Step 1: exploring the whole pathway

Let Xn × t be the normalized log transformed gene expression matrix with genes on the rows and experiments (equal to time points t) on the columns. Let <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M20">View MathML</a> the sub-matrix of genes belonging to pathway P. Pathway P has p genes. Then, on the transpose of XP, XP', we perform principal component analysis (PCA). We used both the classical (R package stats) and the robust (rrcov R package) version of PCA. Let <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M21">View MathML</a> be the scores matrix and <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M22">View MathML</a> the loadings matrix. We call <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M1">View MathML</a> the p principal components. In this way, the first PCs summarize the temporal variation of the genes in pathway P (if present). Thus, from now on we will indicate <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M2">View MathML</a> as <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M3">View MathML</a>. A similar approach was recently proposed by [15] (PCA-maSigFun). PCA-maSigFun uses principal component analysis to identify temporally-homogeneous groups of gene within the pathway.

Then, for irregularly sampled time series we assume that our irregularly sampled signal <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M4">View MathML</a> can be decomposed as Z(t) = p(t) + ∈(t), where p(t) is a deterministic function, hereafter called "trend", and (t) is the realization of a stationary stochastic process with mean zero. Extensive exploratory analysis suggests that a reasonable choice for the trend component is a polynomial of degree 2 in t, i.e.,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M5">View MathML</a>

with β1 capturing existing temporal behaviors of <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M6">View MathML</a> and β2 correcting for potential non linearities.

Moreover, we assume that ∈(t) follows a continuous-time Gaussian autoregressive process of order 1. The model is fitted using generalized least squared (as implemented in nlme R package). The representative p - value of pathway P, pP, is then taken to be the p - value of the test of nullity of β1 (obtained by a t-test as implemented in the gls function of the nlme R package). Bonferroni correction is used to adjust p - values for multiple tests.

We evaluated the possibility to fit a polynomial regression not only on the first PC, but also on few additional <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M7">View MathML</a>, with i = 2, 3. However, we did not find significant improvements in the final list of significant time-dependent pathways (data not shown).

Step 2: decomposing the pathway

Pathways declared as time-dependent in step 1 are then moralized, triangulated and decomposed into a junction tree as described in [36].

Briefly, moralization inserts an undirected edge between two nodes that have a child in common and then eliminates directions on the edges; triangulation inserts edges in the moralized graph so that in the moralized graph all cycles of size ≥ 4 have chords, where a chord is defined as an edge connecting two non-adjacent nodes of a cycle. A clique in the triangulated graph is a complete subgraphs having all its vertices joined by an edge while a junction tree construction is a hyper-tree having cliques as nodes and satisfying the running intersection property according to which, for any cliques C1 and C2 in the tree, every clique on the path connecting C1 and C2 contains C1 C2 [36,47]. For a given graph there could be more than one junction tree. Here we force the root of the junction tree to be in agreement with the structure of the pathway.

A clique k of pathway P, noted as <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M11">View MathML</a> (with k = 1,..., K), is composed by a subset of genes in P, <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M9">View MathML</a>. Let <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M10">View MathML</a> be the sub-matrix of X corresponding to the genes of the clique <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M11">View MathML</a>. For each clique k of P we apply the same approach as described in step 1: PCA transformation and then a linear model with polynomial trend and autoregressive process of order 1 on the first PCs. The p - value of clique k in pathway P, <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M12">View MathML</a> is given by the p of the β1 of the polynomial regression. Finally, the best time-dependent paths within a pathway P, hereafter called <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M13">View MathML</a>, j = 1,..., J, are identified using the relevance measure as described in [36]. Briefly, a path is a chain of consecutive time-dependent cliques <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M14">View MathML</a> with gaps at most of size one. Then, for each path in the pathway a cumulative score is calculated along the path: lower the the p -- value of a clique in the path, higher the contribution to the score, in case of gap the score is penalized. The final score of a path is the maximum value reached by the score along the path. Then, the score is normalized for the path length; this quantity is called relevance [36].

As final results, for each time-dependent pathway, we report a list of relevant paths, ranked according to their relevance. Currently, step 2 is the most innovative feature of timeClip and, as far as we known, there are no existing tools using a similar strategy.

Simulated data

As some paths may be declared time-dependent by timeClip step 2 simply as a consequence of type I errors in timeClip step 1, we used a simulation to evaluate the percentage of false positives under the null hypothesis and to estimate the statistical power in different scenarios.

False positive rate estimation

Given a pathway P and its graph structure (G), for 1,000 runs we randomly generate a gene expression matrix Xn × t from a multivariate normal distribution with zero mean and variance ∑, with <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M15">View MathML</a> (where S+(G) is the set of symmetric positive definite matrices with null elements corresponding to the missing edges of G). In this case, gene expression profiles are time independent. Then, for each run we calculate pP (either for the case of irregularly and regularly sampled time points, see Section Step 1: exploring the whole pathway). Under this scenario, at the nominal level α = 0.05 we expect a number of rejections around 5%. We repeat the simulation for different values of n (n = 5, 10, 15, 20, 25, 30) and t (t = 5, 10, 15, 20, 30).

Power estimation

In order to be sure that the model were able to identify time-dependency coming from different models, we simulate data using polynomial models, autoregressive models of order 1 and a combination of both (polynomial models with autocorrelated errors). Then, the power is estimated for irregularly and regularly sampled time points.

Given a pathway P and its graph structure (G), for 1,000 runs we randomly generate a gene expression matrix X(n s) × t from a multivariate normal distribution with zero mean and variance ∑ with <a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M15">View MathML</a>. Then, the expression profiles of the remaining s genes, with s n are simulated to have different degree of time-dependency. Specifically, we use polynomial models (Equation 1), autoregressive models of order 1 (Equation 2, where * is a white noise) and the combination of both (Equation 3, where an AR(1)).

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M17">View MathML</a>

(1)

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M18">View MathML</a>

(2)

<a onClick="popup('http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/15/S5/S3/mathml/M19">View MathML</a>

(3)

The coefficients α* are independently generated from a U(−5, 5), and φi are generated so as to achieve stationarity. In this way, we simulate expression profiles with different degrees of temporal variations. Then, for each run we calculate pP (see Section Step 1: exploring the whole pathway). Under this scenario, the number of rejection estimates the statistical power. We repeat the simulation for different combinations of φ, n, s and t.

Real data: muscle regeneration model

The benchmark dataset used [46] (GSE469) follows mouse muscle regeneration after intra-muscluar injection of cardiotoxin. Regeneration process is followed for 27 unevenly spaced time-points with only two technical replicates for each time-point. Expression data were produced using single channel Affimetrix microarrays. The probes in the platform were annotated with EntrezGene custom CDF (version 14) [48] and data was normalized using the robust multi array analysis (rma) and quantile normalization. Then, technical replicates were averaged to get one measure for every time-point.

Implementation and visualization: the wheel of time

timeClip is implemented as an R package available from the authors. The package allows to analyze equally and non-equally spaced time series according to the user setting. To get better insights into the temporal activation of the different portions of the pathway, we develop a new way of visualization using Cytoscape software [49] and Rcytoscape Bioconductor package. The visualization, called the wheel of time, allows visualizing pie charts inside network nodes. For each pathway, timeClip exports in Cytoscape the structure of the junction tree where each time-dependent clique has a pie chart that represents the time trend. Specifically, the pie is divided into as many slices as the number of time points in the dataset. Each slice in the pie is colored (from green to red) according to the scores of first principal component: the higher the value, the stronger the activation of a clique in a specific time point (red color) and viceversa (green).

Results and discussion

Many biological processes need to be followed and monitored along time. In these cases time course designs are ideals: higher the number of time points, finest the monitoring process. However, long time courses are often characterized by small or no replicates. Here, we present timeClip, a two-step approach to perform topological pathway analysis for time course gene expression data, specifically tailored to long time series without replicates (Figure 1). In the first step, we select pathways that show time dependency. In the second step, the selected pathways are decomposed into cliques and the time-dependent portions are isolated. In the next sections, we will show the performance of timeClip using simulated and real datasets.

Simulations results

Two simulation strategies have been considered. The first one was designed to estimate the number of false positives under the null hypothesis of no temporal variation, the second to estimate the statistical power (see section method for details).

Table 1 and Table S1 (Additional file 1) report the percentage of false positives obtained with different n and t for the irregularly and regularly sampled time points, respectively. The average false positive percentage for each t and n is always limited to ~4-5%, with the exception of small time series (t = 5) and equally spaced time points where it is slightly higher. Thus, we can conclude that, in general, for long time series we have an excellent control of type I error even with exceptionally low sample sizes.

Table 1. Simulation results - False positives rate with different pathway dimensions n and irregularly sampled time points t.

Additional file 1. Additional tables. This file contains additional tables mention on the text (pdf format).

Format: PDF Size: 93KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Table 2 and Table S2 (Additional file 1) report the number of true positives obtained with n = 30 and different t and s for equally and not-equally spaced time points respectively. Here, the genes with temporal variation are simulated using different models (if s is the number of time-dependent genes among the n of the pathway, we simulate s/3 with polynomial, s/3 with AR(1) and s/3 with the combination of both). As expected, the power increases with the increase of t and s: the longer the time course and the higher the number of time dependent genes s within the pathway, the higher the power.

Table 2. Simulation results - Power estimate in case of n = 30 and different time course length t and time dependent genes s.

Specifically, when the time course is short (t = 10 − 20) the maximum power reaches 60%, while with long time series t = 30 the power is above 80%. Moreover, it is worth noting that the increase of the time dependent genes does not affect significantly the power level. The greater impact that the number of time points has on statistical power with respect to the number of time-depending genes can be explained by the presence of two steps in our strategy: i) a data reduction step (with PCA on genes within pathways) and ii) a model-fitting step of the reduced variables on time points. PCA is an efficient method to detect variance components in the data. Thus, even in case of a small number of time-dependent genes, the first PC is able to capture the time trend when present. On the other hand, once the trend is captured, the goodness of fit of the regression model increases by increasing the number of time points. The use of robust PCA does not change the performance of the method substantially (data not shown).

Case study: muscle regeneration model

Step 1 results

In step 1 every pathway is explored for its temporal dependence. In the benchmark dataset, we have to deal with 27 not equally spaced times (14 of which are equally spaced).

Comparing step 1 results for equally and not equally spaced time-point we obtain an overlap of 70%. This high degree of overlap makes us confident about the reliability of our approach. We summarized the results in the heat map of Figure 2 (values reported in Additional file 2). The heat map is obtained using the scores of the first principal component of each time-dependent pathway. From the unsupervised cluster analysis, we can define 3 pathway groups characterized respectively by a 'very early','early-intermediate' and 'intermediate-late' activation. Pathways characterized by a very early activation like 'Malaria' and 'African trypanosomiasis' reflect the early activation of the inflammation processes deputed to clean injured fibers. These processes are carried-out by macrophages that have a central role in the 'Malaria' and 'Africa trypanosomiasis' pathways. Macrophages clean up injured fiber and release growth factors like vascular endothelial growth factor (VEGF) and hepatocyte growth factor (HGF) [50].

thumbnailFigure 2. Heat map of pathway PCs. Heat map colored according to the expression of the first PCs from green to red. According to the color pattern, pathways are divided in early, early-intermediate and late-intermediate. Time is measured in hours after treatment.

Additional file 2. Figure 2 Heat map values. This file contains the values used to create the heat map in Figure 2. In column 2 the pathways that are called significant also by PCA-maSigFun with an alpha <= 0.05 are marked with "*". In addition p - values and adjusted p - values (Bonferroni) for timeClip are show in col 3 and 4 (tab delimited format).

Format: TXT Size: 28KB Download fileOpen Data

In the early-intermediate pathway group, we can see the effects of the early signal secretion: in fact, the group contains pathways like 'mTOR signaling pathway', 'VEGF signaling pathway', 'Insulin signaling pathway' and other metabolic pathways like 'Ether lipid metabolism' and 'Citrate cycle (TCA cycle)'. Globally, these pathways indicate that the regeneration progress has begun.

'mTOR signaling pathway', probably the most important pathway in the muscle regeneration, on one side sustains VEGF signaling and on the other promotes protein production needed for clonal expansion of the myoblasts, their growth and fusion. In particular, mTOR integrates growth factor signaling with a variety of signals from nutrients (amino acids metabolism activate mTOR pathway) and cellular energy status [51]. The energy status of the cell is indeed monitored by those pathways involved in energy metabolism like 'carbohydrate digestion and adsorption', 'Citrate cycle (TCA cycle)' and 'Fatty acid metabolism'. These processes are very important in the regeneration process, in fact, it was demonstrated that glycolitic metabolism is restored after three days from myofibril formation [52].

Intermediate-late activation pathways mainly present pathways involved in inflammatory responses like 'B and T Cell receptor signaling pathway', 'Toll-like receptor signaling pathway', 'Adipocytokine signaling pathway' and 'Leukocyte transendothelial migration'. Recent discoveries reveal complex interactions between skeletal muscle and the immune system that regulate all phases of the muscle regeneration [50]. Moreover in this pathway group there is the 'Axon guidance' and 'Dopaminergic synapse' pathways that are involved in nervous impulse transduction. We can speculate that at the end of the regenerative processes nervous system can contact the restored contractile cells to ensure and maintain their functionality.

This contains also pathways involved in signaling transduction like 'HIF-1 signaling pathway'. HIF-1 has been recently demonstrated to be essential for skeletal muscle regeneration in mice [53]. In fact this pathway manages a plethora of signals and interface with pathways like mTOR signaling pathway, PI3K-Akt signaling pathway, MAPK signaling pathway, Citrate cycle (TCA scycle), Calcium signaling pathway, VEGF signaling pathway and Ubiquitin mediated proteolysis. Together with all these pathways, 'HIF-1 signaling pathway' finely tune the balance between oxygen consumption.

In step 1, we are able to see only the strongest signals and not always the pathway name alone reflects the activity of the pathway. To tackle the complexity of the pathway, timeClip step 2 deeply investigates the timing activation of different portion of the pathway.

Step 2 results

In the second step, we focused on the the Akt-mammalian target of rapamycin (mTOR) signaling pathway. It regulates a pletora of signals: cell growth, VEGF signaling pathway, autophagy and its action is related to other pathways known to be involved in the muscle regeneration like Insulin signaling pathway and MAPK signaling pathway [54].

The junction tree of mTOR signaling pathway (Figure 3A) starts with Igf1 (Insulin-like growth factor 1) as represented in the KEGG map (Figure 3B). Within mTOR signaling pathway we identified a total of 6 paths, ranked by their relevance score (Table 3).

thumbnailFigure 3. Activation of the mTOR signaling pathway. Panel A. Junction tree of the mTOR signaling pathway (using graphite R package and database KEGG). The top ranked time-dependent paths identified in timeClip step 2 are highlighted using the wheel of time visualization. Panel B. KEGG representation of mTOR signaling pathway. Genes are colored according to the paths in panel A. Panel C. Enlargement of the wheels of time representative of the main block of mTOR signaling pathway: from t0 to t27 (clock-wise) every slice of the pie is colored according to the value of the clique first PC (green means no activation; red means activation).

Table 3. mTOR signaling pathway: relevant paths identified by timeClip step 2

The most relevant of these paths goes from the 1st to the 21st clique and contains 16 cliques. The second and the third path share a big portion with the first one. This big portion goes from clique 1 to cliques 13 (blue nodes on the junction tree - Figure 3A) and contains genes like Igf1, Insulin, Mapk3, Mtor and Akt that globally represent the backbone of the pathway where the starting activating signal is regulated by Igf1. Then Pi3k, Mapk and Akt translate the signal and activate Mtor that organize the effectors. From the junction tree we can identify three different terminal effectors: the first, in pink, is the portion that brings to the VEGF signaling pathway. The second, in purple, is the regulation of autophagy and the third, in yellow, is the regulation of protein synthesis that is necessary for the skeletal muscle mass recovery during regenerating processes [55]. In the panel C of Figure 3, we summarized the timing of the 'mTOR signaling pathway' activation. With the wheel of time, we can see that the pathway backbone is activated in the early phases. The portion that brings to the VEGF signaling pathway is activated in the late phases. The effectors that bring to authophagy are switched off at the end of the regenerative precess while the activation of the protein synthesis begun from the early-intermediate phases and last till the end of the process.

Recently, as discussed before, it was demonstrated the involvement of HIF-1 in the skeletal muscle regeneration process [53]. We observed that the most relevant path of HIF-1 signaling pathway is 37 cliques long underlining its importance in this process. This path is activated by different growth factors (Igf, Ins, Egf) and signals are translated through Akt and mTOR towards HIF-1α/β. Hif-1α regulated many processes from the oxygen balance to apoptosis (See Additional file 3). Such downstream effectors confirm its importance in skeletal muscle regeneration in accordance with results obtained from [53].

Additional file 3. Activation of the HIF-1 signaling pathway. KEGG representation of HIF-1 signaling pathway. Genes of the 37 clique long path colored in cyan.

Format: PDF Size: 198KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Comparison with other methods

In this section we compare timeClip step 1 results with the methods proposed by [15]. Step 2, that is the most innovative feature of timeClip, cannot be compared to any existing tool. [15] proposed two different strategies. The first one, called maSigFun, considers individual genes as different observations of the expression profile of the pathway. The second approach PCA-maSigFun uses PCA to identify groups of genes showing different time-dependencies. maSigFun did not give any significant time-dependent pathway using our dataset describing skeletal muscle regeneration (p ≤ 0.05), while PCA-maSigFun returned 59 significant KEGG pathways (p ≤ 0.05). 26 out of 59 (44%) pathways are in common with timeClip step 1 results. Indeed, both the methods retrieve mTOR signaling pathway, however PCA-maSigFun did not call HIF signaling pathway as significant, although it seems to be closely related to the muscle regeneration [53]. Most of the PCA-maSigFun specific pathways (15 out of 33) referred to metabolic processes like Inositol phosphate methabolism, Pyruvate metabolism, Tyrosine metabolism, Glycerolipid metabolism. The remaining pathways are highly heterogeneous and comprise Acute myeloid leukemia, Bladder cancer, Melanoma, Pancreatic cancer.

Conclusions

Pathway analysis is a useful and widely used statistical approach to test groups of genes between two or more biological conditions. Although many efforts have been dedicated to implement novel gene set analysis in a multivariate and topological contexts, few of them deal with time course experiments. Time course experiments are used to monitor the dynamics of biological processes under physiological conditions or after perturbations.

In this context there is a clear trade-off between the number of time points and the number of replicates. In general, if the goal of the study is the identification of time-dependency, long time course are required at the expense of replicates; on the other hand, if the goal is the characterization of short term response a large number of replicates for each time point is required to increase statistical power. In general, there are few long time series datasets and in our opinion this is partly due to the experimental costs but also to the lack of effective methods to study and interpret results. Here, we present timeClip, an empirical two-step approach specifically tailored to long time course gene expression data without replicates. Using simulated data timeClip shows good performance in terms of controlling type I error and power. Furthermore, we successfully identify most of the key pathways involved in the early, middle and late phases of the skeletal muscle regeneration process. A visualization tool has also been implemented to tackle the dynamics of the transcriptome.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

CR define the concept proposed. PM developed the proposed methods and performed the analysis on gene expression data. GS and EC developed the computational infrastucture for pathway topology retrieval. CR and MC supervised the work and wrote the paper. SC participated in the biological discussion of the results. All Authors read and approved the final manuscript.

Acknowledgements

The authors acknowledge the CARIPARO Foundation (Project for Excellence 2012: 'Role of coding and non-coding RNA in chronic myeloproliferative neoplasms: from bioinformatics to translational research') and the CRIBI Center for high performance computing resources funded by the Regione Veneto (RISIB project SMUPR n. 4145). The authors want to thank the University of Padova [CPDR075919 and CPDA119031 to C.R.; CPDR070805 to G.S.] and the Italian Association for Cancer Research [AIRC fellowship n.14982 to EC] for support of this work.

Declarations

Pubblication of this article was founded by the University of Padova and CARIPARO Foundation.

This article has been published as part of BMC Bioinformatics Volume 15 Supplement 5, 2014: Italian Society of Bioinformatics (BITS): Annual Meeting 2013. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S5

References

  1. Park T, Yi S, Lee S, Lee S, Yoo D, Ahn J, Lee Y: Statistical tests for identifying differentially expressed genes in time course microarray experiments.

    Bioinformatics 2003, 19:694-703. PubMed Abstract | Publisher Full Text OpenURL

  2. Smyth G: Limma: linear models for microarray data.

    Bioinformatics and computational biology solutions using R and Bioconductor 2005, 12837-12842. OpenURL

  3. Tai Y, Speed T: A multivariate empirical Bayes statistic for replicated microarray time course data.

    Ann Stat 2006, 34:2387-2412. Publisher Full Text OpenURL

  4. Yuan M, Kendziorski C: Hidden Markov Models for Microarray Time Course Data in Multiple Biological Conditions.

    J Am Stat Assoc 2006, 101(476):1323-1332. Publisher Full Text OpenURL

  5. Sun W, Wei Z: Multiple Testing for Pattern Identification, With Applications to Microarray Time-Course Experiments.

    J Am Stat Assocsss 2011, 106:73-88. Publisher Full Text OpenURL

  6. Ramsay J, Silverman B:

    Functional data analysis. 2005., 2005 OpenURL

  7. Coffey N, Hinde J: Analysing time-course microarray data using functional data analysis - A review.

    BMC Bioinf 2011, 10:23. OpenURL

  8. Xu X, Olson J, Zhao L: A regression-based method to identify differentially expressed genes in microarray time course studies and its application in an inducible Huntington's disease transgenic model.

    Human Mol Genet 2002, 11(17):1977-1985. Publisher Full Text OpenURL

  9. Bar-Joseph Z, Gerber G, Simon I, Gifford D, Jaakkola T: Comparing the continuous representation of time-series expression profiles to identify differentially expressed genes.

    Proc Nat Acad Sci USA 2003, 100:10146-10151. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  10. Storey J, Xiao W, Leek J, Tompkins R, Davis R: Significance analysis of time course microarray experiments.

    Proc National Acad Sci USA 2005, 102(36):12837-12842. Publisher Full Text OpenURL

  11. Hong F, Li H: Functional hierarchical models for identifying genes with different time-course expression profiles.

    Biometrics 2006, 62:534-544. PubMed Abstract | Publisher Full Text OpenURL

  12. Liu X, Yang M: Identifying temporally differentially expressed genes through functional principal component analysis.

    Biostatistics 2009, 10:667-679. PubMed Abstract | Publisher Full Text OpenURL

  13. Chen K, Wang J: Identifying differentially expressed genes for time-course microarray data through functional data analysis.

    Stat Biosci 2010, 2:95-119. Publisher Full Text OpenURL

  14. Ma P, Zhong W, Liu J: Identifying differentially expressed genes in time course microarray data.

    Stat Biosci 2009, 1:144-159. Publisher Full Text OpenURL

  15. Nueda M, Sebastian P, Tarazona S, Garcia-Garcia F, Dopazo J, Ferrer A, Conesa A: Functional assessment of time course microarray data. [http://www.biomedcentral.com/1471-2105/10/S6/S9] webcite

    BMC Bioinformatics 2009, 10(Suppl 6):S9. BioMed Central Full Text OpenURL

  16. Schliep A, Costa IG, Steinhoff C, Sch?nhuth A: Analyzing Gene Expression Time-Courses.

    IEEE/ACM Transactions on Computational Biology and Bioinformatics 2005, 2(3):179-193. Publisher Full Text OpenURL

  17. Ramoni MF, Sebastiani P, Kohane IS: Cluster analysis of gene expression dynamics.

    Proceedings of the National Academy of Sciences 2002, 99(14):9121-9126. Publisher Full Text OpenURL

  18. Son YS, Baek J: A modified correlation coefficient based similarity measure for clustering time-course gene expression data.

    Pattern Recogn Lett 2008, 29(3):232-242. Publisher Full Text OpenURL

  19. Han X, Sung W, Feng L: Identifying differentially expressed genes in time-course microarray experiment without replicate.

    J Bioinf Comput Biol 2007, 5:281-296. Publisher Full Text OpenURL

  20. Billups S, Neville M, Rudolph M, Porter W, Schedin P: Identifying significant temporal variation in time course microarray data without replicates. [http://www.biomedcentral.com/1471-2105/10/96] webcite

    BMC Bioinformatics 2009, 10:96. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  21. Wu S, Wu H: More powerful significant testing for time course gene expression data using functional principal component analysis approaches. [http://www.biomedcentral.com/1471-2105/14/6] webcite

    BMC Bioinformatics 2013, 14:6. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  22. Goeman JJ, Buhlmann P: Analyzing gene expression data in terms of gene sets: methodological issues.

    Bioinformatics 2007, 23(8):980-987. PubMed Abstract | Publisher Full Text OpenURL

  23. Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y: Gene-set analysis and reduction.

    Brief Bioinform 2008, 10:24-34. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  24. Draghici S, Khatri P, Tarca AL, Amin K, Done A, Voichita C, Georgescu C, Romero R: A systems biology approach for pathway level analysis.

    Genome Research 2007, 17(10):1537-1545. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  25. Tarca AL, Draghici S, Khatri P, Hassan SS, Mittal P, Kim Js, Kim CJ, Kusanovic JP, Romero R: A novel signaling pathway impact analysis.

    Bioinformatics 2009, 25:75-82. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  26. Hosack D, Dennis G, Sherman B, Lane H, Lempicki R: Identifying biological themes within lists of genes with EASE.

    Genome Biol 2003, 4:R70. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  27. Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems.

    Bioinformatics 2005, 21:3587-3595. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  28. Vencio R, Shmulevich I: ProbCD: enrichment analysis accounting for categorization uncertainty.

    BMC Bioinformatics 2007, 8:383. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  29. Emmert-Streib F: The Chronic Fatigue Syndrome: A Comparative Pathway Analysis.

    Journal of Computational Biology 2007, 14(7):961-972. PubMed Abstract | Publisher Full Text OpenURL

  30. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles.

    Proceedings of the National Academy of Sciences of the United States of America 2005, 102(43):15545-15550. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  31. Mansmann U, Meister R: Testing Differential Gene Expression in Functional Groups. Goeman's Global Test versus an ANCOVA Approach.

    Methods of Inf Med 2005, 44:449-53. OpenURL

  32. Tsai CA, Chen JJ: Multivariate analysis of variance test for gene set analysis.

    Bioinformatics 2009, 25:897-903. PubMed Abstract | Publisher Full Text OpenURL

  33. Dinu I, Potter J, Mueller T, Liu Q, Adewale A, Jhangri G, Einecke G, Famulski K, Halloran P, Yasui Y: Improving gene set analysis of microarray data by SAM-GS.

    BMC Bioinformatics 2007, 8:242. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  34. Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling studies.

    Proceedings of the National Academy of Sciences of the United States of America 2005, 102(38):13544-13549. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  35. Massa MS, Chiogna M, Romualdi C: Gene set analysis exploiting the topology of a pathway.

    BMC Systems Biology 2010, 4:121. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  36. Martini P, Sales G, Massa MS, Chiogna M, Romualdi C: Along signal paths: an empirical gene set approach exploiting pathway topology. [http://nar.oxfordjournals.org/content/41/1/e19.abstract] webcite

    Nucleic Acids Research 2013, 41:e19. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  37. Goeman JJ, van de Geer SA, de Kort F, van Houwelingen HC: A global test for groups of genes: testing association with a clinical outcome. [http://bioinformatics.oxfordjournals.org/content/20/1/93.abstract] webcite

    Bioinformatics 2004, 20:93-99. PubMed Abstract | Publisher Full Text OpenURL

  38. Ackermann M, Strimmer K: A general modular framework for gene set enrichment analysis. [http://www.biomedcentral.com/1471-2105/10/47] webcite

    BMC Bioinformatics 2009, 10:47. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  39. Liu Q, Dinu I, Adewale A, Potter J, Yasui Y: Comparative evaluation of gene-set analysis methods.

    BMC Bioinformatics 2007, 8:431. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  40. Nam D, Kim SY: Gene-set approach for expression pattern analysis.

    Brief Bioinform 2008, 9:189-197. PubMed Abstract | Publisher Full Text OpenURL

  41. Laurent J, Pierre N, Dudoit S: Gains in Power from Structured Two-Sample Tests of Means on Graphs.

    Annals of Applied Statistics 2012, in press. OpenURL

  42. Antonov AV, Schmidt EE, Dietmann S, Krestyaninova M, Hermjakob H: R spider: a network-based analysis of gene lists by combining signaling and metabolic pathways from Reactome and KEGG databases.

    Nucleic Acids Research 2010, 38(suppl 2):W78-W83. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  43. Isci S, Ozturk C, Jones J, Otu HH: Pathway analysis of high-throughput biological data within a Bayesian network framework.

    Bioinformatics 2011, 27(12):1667-1674. PubMed Abstract | Publisher Full Text OpenURL

  44. Sales G, Calura E, Cavalieri D, Romualdi C: graphite - a Bioconductor package to convert pathway topology to gene network.

    BMC Bioinformatics 2012, 13:20. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  45. Sales G, Calura E, Martini P, Romualdi C: Graphite Web: web tool for gene set analysis exploiting pathway topology. [http://nar.oxfordjournals.org/content/early/2013/05/10/nar.gkt386.abstract] webcite

    Nucleic Acids Research 2013. OpenURL

  46. Zhao P, Iezzi S, Carver E, Dressman D, Gridley T, Sartorelli V, Hoffman EP: Slug Is a Novel Downstream Target of MyoD: TEMPORAL PROFILING IN MUSCLE REGENERATION. [http://www.jbc.org/content/277/33/30091.abstract] webcite

    Journal of Biological Chemistry 2002, 277(33):30091-30101. PubMed Abstract | Publisher Full Text OpenURL

  47. Lauritzen SL: Graphical models. Clarendon Press, Oxford; 1996. OpenURL

  48. Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, Watson SJ, Meng F: Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. [http://nar.oxfordjournals.org/content/33/20/e175.abstract] webcite

    Nucleic Acids Research 2005, 33(20):e175. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  49. Smoot ME, Ono K, Ruscheinski J, Wang PL, Ideker T: Cytoscape 2.8: new features for data integration and network visualization.

    Bioinformatics 2011, 27(3):431-432. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  50. Tidball JG, Villalta SA: Regulatory interactions between muscle and the immune system during muscle regeneration.

    American Journal of Physiology-Regulatory, Integrative and Comparative Physiology 2010, 298(5):R1173-R1187. Publisher Full Text OpenURL

  51. Sancak Y, Peterson TR, Shaul YD, Lindquist RA, Thoreen CC, Bar-Peled L, Sabatini DM: The Rag GTPases bind raptor and mediate amino acid signaling to mTORC1.

    Science 2008, 320(5882):1496-1501. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  52. Sesodia S, Choksi RM, Nemeth PM: Nerve-dependent recovery of metabolic pathways in regenerating soleus muscles.

    Journal of Muscle Research & Cell Motility 1994, 15(5):573-581. PubMed Abstract | Publisher Full Text OpenURL

  53. Scheerer N, Dehne N, Stockmann C, Swoboda S, Baba HA, Neugebauer A, Johnson RS, Fandrey J: Myeloid Hypoxia-Inducible Factor-1α Is Essential for Skeletal Muscle Regeneration in Mice.

    The Journal of Immunology 2013, 191:407-414. PubMed Abstract | Publisher Full Text OpenURL

  54. Richard-Bulteau H, Serrurier B, Crassous B, Banzet S, Peinnequin A, Bigard X, Koulmann N: Recovery of skeletal muscle mass after extensive injury: positive effects of increased contractile activity.

    American Journal of Physiology-Cell Physiology 2008, 294(2):C467-C476. OpenURL

  55. Dickinson JM, Fry CS, Drummond MJ, Gundermann DM, Walker DK, Glynn EL, Timmerman KL, Dhanani S, Volpi E, Rasmussen BB: Mammalian target of rapamycin complex 1 activation is required for the stimulation of human skeletal muscle protein synthesis by essential amino acids.

    The Journal of nutrition 2011, 141(5):856-862. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL