Institute of Mathematics and Statistics, University of São Paulo, Rua do Matão, 1010, São Paulo 05508-090, Brazil

Center for Experimental Research, Albert Einstein Research and Education Institute, Av. Albert Einstein, 627 - São Paulo, 05652-000, Brazil

Human Genome Center, Institute of Medical Science, The University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan

Center of Mathematics, Computation and Cognition, Universidade Federal do ABC, Rua santa Adélia, 166 - Santo André, 09210-170, Brazil

Abstract

Background

A common approach for time series gene expression data analysis includes the clustering of genes with similar expression patterns throughout time. Clustered gene expression profiles point to the joint contribution of groups of genes to a particular cellular process. However, since genes belong to intricate networks, other features, besides comparable expression patterns, should provide additional information for the identification of functionally similar genes.

Results

In this study we perform gene clustering through the identification of Granger causality between and within sets of time series gene expression data. Granger causality is based on the idea that the cause of an event cannot come after its consequence.

Conclusions

This kind of analysis can be used as a complementary approach for functional clustering, wherein genes would be clustered not solely based on their expression similarity but on their topological proximity built according to the intensity of Granger causality among them.

Background

Gene network analysis of complex datasets, such as DNA microarray results, aims to identify relevant structures that help the understanding of a certain phenotype or condition. These networks comprise hundreds to thousands of genes that may interact generating intricate structures. Consequently, pinpointing genes or sets of genes that play a crucial role becomes a complicated task.

Common analyses explore gene-gene level relationships and generate broad networks. Although this is a valuable approach, genes might interact more intensely to a few members of the network, and the identification of these so-called sub-networks should lead to a better comprehension of the entire regulatory process.

Several

Regulatory networks

**Regulatory networks.****(a)** Functional clustering. Genes are clustered based on their topological proximity given by Granger causality. **(b)** Usual clustering. Genes are clustered based on the similarity between gene expression levels.

The concept of Granger causality

Materials and Methods

Granger causality for sets of time series

Granger causality identification is a potential approach for the detection of possible interactions in a data driven framework couched in terms of temporal precedence. The main idea is that temporal precedence does not imply, but may help to identify causal relationships, since a cause never occurs after its effect.

A formal definition of Granger causality for sets of time series

Definition 1

_{
t
}
**X**
_{
t
},
**X**
_{
t
}, **X**
_{
t
}(_{
t
}) _{
t
}. _{
X
}(_{
t
}).

For the linear case,

where

In order to simplify both notation and concepts, only the identification of Granger causality for sets of time series in an Autoregressive process of order one is presented. Generalizations for higher orders are straightforward.

Functional clustering in terms of Granger causality

There are numerous definitions for clusters in networks in the literature

A usual approach for network clustering when the structure of the graph is known is the spectral clustering proposed by

In order to overcome this limitation, we developed a framework to cluster genes by their topological proximity using the time series gene expression information. We developed concepts of distance and degree for sets of time series based on Granger causality, and combined them to the modified spectral clustering algorithm. The procedures are detailed below.

Functional clustering

Given a set of time series
_{
ij
} ≥ 0 between all pairs of data points
_{
i
} in this graph represents a time series gene expression
_{
ij
} between the corresponding time series
_{
ij
}). In other words, a _{
ij
} > 0 represents existence of Granger causality between time series
_{
ij
} = 0 represents Granger non-causality. The problem of clustering can now be reformulated using the similarity graph: we want to find a partition of the graph such that there is less Granger causality between different groups and more Granger causality within the group.

Let _{1},…,_{
p
}}(where each vertex represents one time series) and weighted edges set _{
i
} and _{
j
} carries a non-negative weight _{
ij
} ≥ 0. The weighted adjacency matrix of the graph is the matrix **W** = _{
ij
}; _{
ij
} = 0, this means that the vertices _{
i
} and _{
j
} are not connected by an edge. As _{
ij
} = _{
ji
}. Therefore, in terms of Granger causality, _{
ij
} can be set as the distance between two time series

Definition 2

Notice that

Moreover, notice that the CCA is the Pearson correlation after dimension reduction, therefore,

Another necessary concept is the idea of degree of a time series
_{
i
}) which can be defined as

Definition 3

Notice that in-degree and out-degree represent the total information flow that “enters” and “leaves” the vertex _{
i
}, respectively. Therefore, the degree of vertex _{
i
} contains the total information flow passing through vertex _{
i
}.

Without loss of generality, it is possible to extend the concept of degree of a vertex _{
i
} (time series

Definition 4

Now, by using the definitions of distance and degrees for time series and sets of time series in terms of Granger causality, it is possible to develop a spectral clustering-based algorithm to identify sub-networks (set of time series that are highly connected within sets and poorly connected between sets) in the regulatory networks. The algorithm based on spectral clustering

**Input:** The

**Step 1:** Let **W** be the (

**Step 2:** Compute the non-normalized (**L** as (Mohar, 1991)

where **D** is the (_{1},…,_{
p
}(

**Step 3:** Compute the first **e**
_{1},…,**e**
_{
k
}} (corresponding to the **L**.

**Step 4:** Let **U** ∈ ^{
p×k
} be the matrix containing the vectors {**e**
_{1},…,**e**
_{
k
}} as columns.

**Step 5:** For **y**
_{
i
} ∈ ^{
k
} be the vector corresponding to the **U**.

**Step 6:** Cluster the points (**y**
_{
i
})_{
i=1,…,p
} ∈ ^{
k
} with the **X**
_{1},…,**X**
_{
k
}}. For

**Output:** Sub-networks {**X**
_{1},…,**X**
_{
k
}}.

Notice that this clustering approach does not infer the entire structure of the network.

Estimation of the number of clusters

The method presented so far describes a framework for clustering genes (time series) using their topological proximity in terms of Granger causality.

Now, the challenge consists in determining the optimum number of sub-networks

In order to determine the most appropriate number of clusters in this specific context, we used a variant of the silhouette method

Let us first define the cluster index **A** the sub-network to which it has been assigned. When sub-network **A** contains other time series apart from
**A**. Let us now consider any sub-network **C** which is different from **A** and compute:
**C**. After computing
**C** ≠ **A**, we set the smallest of those numbers and denote it by
**B** for which this minimum value is attained (that is,
**A**, the best sub-network to belong to would be **B**. Therefore, **A**, thus it is necessary to assume that there is more than one sub-network

After calculating

Indeed, from the above definition we easily see that −1 ≤

The average cluster index

Estimation of the number of clusters in biological data

In order to estimate the most appropriate number of sub-networks present in the data set, we estimate the average cluster index

(a) The representation of network with four clusters; (b) The network obtained by applying the proposed method with four clusters (

**(a) The representation of network with four clusters; (b) The network obtained by applying the proposed method with four clusters (****=4); (c) The representation of the network when the number of clusters is set to five; (d) The network obtained by applying the proposed method with five clusters (****=5).** The solid edges represent Granger causality. Notice that the structure of the “true” network (**(a)** and **(c)**) is not observed and it can only be estimated.

The optimum number of sub-networks is indicated by the breakpoint in the graph

**The optimum number of sub-networks is indicated by the breakpoint in the graph.** The breakpoint appears when the number of sub-networks is greater than the adequate number of sub-networks. The breakpoint selection criterion is based on two linear regressions that best fit the data.

Network construction

The network connecting clusters is constructed following procedures previously described

The Granger causality between cluster is identified by:

where
**X**
_{
t
} minus the set

Then, test

_{0} and H_{1} are the null and alternative hypothesis, respectively.

Simulations

Four sets of Monte Carlo simulations were carried out in order to evaluate the proposed approach under controlled conditions. The first scenario represents four sub-networks without Granger causality between them (Figure

(a) Four independent sub-networks without Granger causality between them; (b) Four sub-networks in a cyclic graph; (c) Feedback loop between sub-networks A and B; (d) A network between sub-networks, where sub-networks A only sends Granger causality and D only receives Granger Causality

**(a) Four independent sub-networks without Granger causality between them; (b) Four sub-networks in a cyclic graph; (c) Feedback loop between sub-networks A and B; (d) A network between sub-networks, where sub-networks A only sends Granger causality and D only receives Granger Causality.**

For each scenario, time series lengths varied: 50, 75, 1000 and 200 time points. The number of repetitions for each scenario is 1,000. The synthetic gene expression time series data in sub-networks A, B, C and D were generated by the following equations described below.

Simulation 1:

Simulation 2:

Simulation 3:

Simulation 4:

where _{
i,t
}∼**Σ**) with

and

for

Actual biological data

In order to illustrate an application of the proposed approach, a dataset collected by

In order to evaluate our proposed approach, we chose to analyze the same gene set examined in Figure

The optimum number of sub-networks in the actual biological data is indicated by the breakpoint in the graph (the optimum number in this case is three)

**The optimum number of sub-networks in the actual biological data is indicated by the breakpoint in the graph (the optimum number in this case is three).**

Results

Simulated data

In order to study the properties of the proposed functional clustering method and to check its consistency, we performed four simulations with distinct network characteristics in terms of structure and Granger causality.

Table

**Time series length/Number of clusters**

**1**

**2**

**3**

**4**

**6**

**5**

**silhouette width**

In bold are the correct number of clusters. Between brackets is one standard deviation for the silhouette width calculated in the breakpoint. For each scenario and each time series length, the number of repetitions was set to 1,000.

Scenario 1

50

0

0

48

**700**

252

0

0.502 (0.098)

75

0

0

1

**785**

214

0

0.582 (0.054)

100

0

0

3

**805**

192

0

0.610 (0.042)

200

0

0

4

**825**

171

0

0.641 (0.034)

Scenario 2

50

0

0

65

**713**

222

0

0.479 (0.112)

75

0

0

28

**760**

212

0

0.555 (0.071)

100

0

0

9

**834**

157

0

0.587 (0.050)

200

0

0

3

**883**

114

0

0.621 (0.029)

Scenario 3

50

0

0

63

**666**

271

0

0.461 (0.123)

75

0

0

18

**784**

198

0

0.552 (0.078)

100

0

0

8

**851**

141

0

0.586 (0.050)

200

0

0

6

**883**

111

0

0.618 (0.031)

Scenario 4

50

0

0

53

**686**

261

0

0.465 (0.110)

75

0

0

17

**786**

197

0

0.551 (0.075)

100

0

0

11

**815**

174

0

0.581 (0.055)

200

0

0

6

**887**

107

0

0.619 (0.033)

Table

**Scenario/Time series length**

**50**

**75**

**100**

**200**

1

78.8

96.0

98.9

99.9

2

72.9

91.2

95.8

99.2

3

71.6

90.6

95.2

99.7

4

68.9

88.7

93.7

99.1

Table

**from/to**

**A**

**B**

**C**

**D**

The rows and columns represent the clusters A, B, C, and D. The rate of false positives was controlled to 5% (p-value < 0.05). The edges which actually exists in the network are shown in bold.

Scenario 1

A

**100/100/100/100**

6.7/6.3/5.2/5.4

8.9/6.0/5.0/5.3

4.8/5.7/5.4/4.5

B

6.9/7.1/5.5/6.8

**99.9/100/100/100**

7.8/6.2/6.3/4.6

5.6/6.9/4.9/5.6

C

7.6/5.9/6.5/5.6

6.9/7.7/4.7/5.1

**100/100/100/100**

4.9/5.4/5.7/5.8

D

6.2/5.3/5.1/4.7

5.3/5.2/5.3/5.7

7.0/5.2/5.2/5.6

**100/100/100/100**

Scenario 2

A

**100/100/100/100**

**28.9/59.8/80.4/99.7**

8.0/6.4/6.8/5.2

6.4/6.6/5.0/5.0

B

5.4/5.3/5.5/4.6

**100/100/100/100**

**29.6/60.9/82.1/99.9**

6.4/6.3/5.7/5.7

C

7.5/5.4/6.7/4.5

8.8/6.6/6.6/6.3

**100/100/100/100**

**23.0/50.4/71.2/99.1**

D

**17.6/35.5/51.2/95.4**

6.5/4.2/3.4/5.0

12.5/10.4/7.5/5.0

**100/100/100/100**

Scenario 3

A

**100/100/100/100**

**29.6/61.9/82.1/100**

7.8/7.3/4.5/5.0

7.4/6.8/4.6/5.2

B

**28.5/53.0/78.0/99.9**

**100/100/100/100**

**31.8/61.1/82.9/99.9**

7.0/7.1/6.2/4.7

C

8.4/6.9/6.4/5.6

7.6/7.8/7.3/5.2

**99.9/100/100/100**

**25.5/46.8/70.6/99.3**

D

6.8/5.6/5.8/5.0

5.5/4.5/5.7/4.3

13.9/8.2/6.1/5.4

**100/100/100/100**

Scenario 4

A

**100/100/100/100**

**25.1/52.6/75.8/99.6**

**22.9/41.8/59.5/96.0**

6.8/5.8/5.2/4.7

B

6.7/5.9/5.7/5.9

**100/100/100/100**

**28.6/58.4/81.9/100**

7.9/6.0/6.1/5.2

C

9.3/8.8/6.1/6.2

8.8/6.2/6.3/4.5

**100/100/100/100**

**26.5/53.2/75.4/99.2**

D

5.4/5.8/5.1/4.7

5.8/5.0/4.2/5.2

14.9/11.9/7.9/5.4

**100/100/100/100**

Biological data

By applying the method described in section Functional clustering to the biological dataset, the optimum number of sub-networks was identified as three. Notice in Figure

Once clusters were obtained, the cluster-cluster network (Figure

The network obtained with three (

**The network obtained with three (****=3) sub-networks.** Solid arrows are significant Granger causality with p-value < 0.05 and dashed arrow is significant Granger causality with p-value < s0.10. The circles represent the clusters.

In

Even though MCL-1 and P21 play important roles in cell survival, and BAI1 is transcriptionally regulated by P53, the analysis run here clustered them separately from P53 containing cluster. This result suggests that, in the context of this dataset, their interaction is stronger with genes such as c-JUN, also functionally related to cell survival, proto-oncogene MET and tumor suppressor MASPIN, for instance. Also worth noticing is the interaction of this cluster with the two members of cluster 3: FGF5 and FOP. Like the other members of FGF family grouped in cluster 2, FGF5 is involved in cell survival activities, while FOP was originally discovered as a fusion partner with FGFR1 in oncoproteins that give raise to stem cell myeloproliferative disorders. It would be interesting to identify specific details regarding the intensity and direction of the information flow within this cluster for a clearer understanding of their relationship in the context of cell cycle progression.

Discussions

Fujita

Krishna

A disadvantage of our method is that it cannot be applied for very large datasets. The larger is the number of time series (genes), or the higher the order of the autoregressive process to be analyzed, the higher the chance to generate non-invertible covariance matrices in the calculation of distance (definition 2) and degree (definition 4) between clusters. We believe that this drawback can be overcome through sparse canonical correlation analysis

We only analyzed the autoregressive process of order one because gene expression time series data, possibly due to experimental limitations, are typically not large. However, if one is interested in analyzing greater orders, one minus the maximum canonical correlation analysis value among all the tested autoregressive orders can be used as the distance measure between two time series.

The clustering algorithm used here is based on the well-known spectral clustering. Although results were satisfactory, other graph clustering methods may be used. The normalized cuts algorithm proposed by

Finally, which biological process underlie time series datasets correlation, remains a difficult question to be answered. Studies suggest that correlated genes may belong to common pathways or present the same biological function. However, it is also known that methods based exclusively on correlation cannot reconstruct entire gene networks. Further studies in the field of systems biology might be able to answer this question in the future.

Conclusions

We propose a time series clustering approach based on Granger causality and a method to determine the number of clusters that best fit the data. This method consists of (1) the definition of degree and distance, usually used in graph theory but now generalized for time series data analysis in terms of Granger causality; (2) a clustering algorithm based on spectral clustering and (3) a criterion to determine the number of clusters. We demonstrate, by simulations, that our approach is consistent even when the number of genes is greater than the time series’ length.

We believe that this approach can be useful to understand how gene expression time series relate to each other, and therefore help in the functional interpretation of data.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

AF has made substantial contributions to the conception and design of the study, analysis and interpretation of data. KK, AGP and JRS contributed to the analysis and interpretation of mathematical results. PS contributed to the analysis and interpretation of biological data. AF and PS have been involved in drafting of the manuscript. SM directed the work. All authors read approved the final manuscript.

Acknowledgements

The supercomputing resource was provided by Human Genome Center (Univ. of Tokyo). This work was supported by FAPESP and CNPq - Brazil and RIKEN - Japan.