A phase synchronization clustering algorithm for identifying interesting groups of genes from cell cycle expression data

Kim, Chang Sik; Bae, Cheol Soo; Tcha, Hong Joon

doi:10.1186/1471-2105-9-56

Research article
Open access
Published: 28 January 2008

A phase synchronization clustering algorithm for identifying interesting groups of genes from cell cycle expression data

Chang Sik Kim¹,
Cheol Soo Bae² &
Hong Joon Tcha³

BMC Bioinformatics volume 9, Article number: 56 (2008) Cite this article

4998 Accesses
11 Citations
Metrics details

Abstract

Background

The previous studies of genome-wide expression patterns show that a certain percentage of genes are cell cycle regulated. The expression data has been analyzed in a number of different ways to identify cell cycle dependent genes. In this study, we pose the hypothesis that cell cycle dependent genes are considered as oscillating systems with a rhythm, i.e. systems producing response signals with period and frequency. Therefore, we are motivated to apply the theory of multivariate phase synchronization for clustering cell cycle specific genome-wide expression data.

Results

We propose the strategy to find groups of genes according to the specific biological process by analyzing cell cycle specific gene expression data. To evaluate the propose method, we use the modified Kuramoto model, which is a phase governing equation that provides the long-term dynamics of globally coupled oscillators. With this equation, we simulate two groups of expression signals, and the simulated signals from each group shares their own common rhythm. Then, the simulated expression data are mixed with randomly generated expression data to be used as input data set to the algorithm. Using these simulated expression data, it is shown that the algorithm is able to identify expression signals that are involved in the same oscillating process. We also evaluate the method with yeast cell cycle expression data. It is shown that the output clusters by the proposed algorithm include genes, which are closely associated with each other by sharing significant Gene Ontology terms of biological process and/or having relatively many known biological interactions. Therefore, the evaluation analysis indicates that the method is able to identify expression signals according to the specific biological process. Our evaluation analysis also indicates that some portion of output by the proposed algorithm is not obtainable by the traditional clustering algorithm with Euclidean distance or linear correlation.

Conclusion

Based on the evaluation experiments, we draw the conclusion as follows: 1) Based on the theory of multivariate phase synchronization, it is feasible to find groups of genes, which have relevant biological interactions and/or significantly shared GO slim terms of biological process, using cell cycle specific gene expression signals. 2) Among all the output clusters by the proposed algorithm, the cluster with relatively large size has a tendency to include more known interactions than the one with relatively small size. 3) It is feasible to understand the cell cycle specific gene expression patterns as the phenomenon of collective synchronization. 4) The proposed algorithm is able to find prominent groups of genes, which are not obtainable by traditional clustering algorithm.

Background

Since Hereford et al. [1] first discovered yeast histone mRNAs oscillate during cell division cycle, several experimental studies have identified that many genes are expressed in a cell-cycle-specific manner. These studies have motivated the study of global extent of cycle-specific gene expression. To this end, there have been a number of studies using DNA microarrays to understand whole-genome expression patterns during cell division cycle [2–8]. A particular example is flagella biogenesis in Caulobactor, which has four distinct and dependent waves of transcription. Laub et al. [3] showed that 20% of Caulobactor genes are cell cycle regulated, their expression level consistently having peaks when they function. Another example is the study of yeast Saccharomyces cerevisiae [6], in which they also discovered that between 10 and 20% of yeast genes are periodically expressed during cell division. Therefore, it suffices to say that a certain percentage of genes may have the periodicity for its oscillatory activity throughout the cell division. These cell-cycle-specific oscillatory activities can be explained by a biological phenomenon in terms of efficiency and logical order. The cell only makes the enzyme when it is needed. If the enzymes were made all the time, the cell would be inefficient in an environment devoid of the substrates of the enzymes [9].

In this study, we are motivated to apply the theory of multivariate phase synchronization to cell-cycle-specific gene expression data. Synchronization is one of the most commonly present phenomena in various fields of science [10, 11]. Generally, we understand synchronization as a complete coincidence of the states between oscillating systems due to their interactions. Rosenblum et al. [12] show that the phase difference of two coupled oscillating systems is bounded while the amplitude is uncorrelated and irregular. There have been numerous applications in different areas such as cardiorespiratory interaction [13–15], brain activity of Parkinsonian patients [16], EEG measurements [17–20], ecology [21], and climate systems [22]. Because our interests of this study are cellular activity during cell cycle, our interested systems are the cell cycle specific genes. Based on the theory of phase synchronization, we pose a hypothesis that expression signals from two genes could be synchronized if these two genes are biologically interacting with each other. That is, two biologically interacting genes produce oscillating expression signals with a common rhythm. Therefore, we propose the phase synchronization as a measure to identify biologically relevant interactions using cell-cycle-specific gene repression data and the cell cycle specific genes are oscillating systems, which produce gene expressions with rhythms (periodicity).

In this study, we present the effort of applying the theory of multivariate phase synchronization to find groups of cell cyclic gene expression signals according to the specific biological process, which is based on the study of Allefeld and Kurths [17]. They present a method for the multivariate analysis of statistical phase synchronization phenomena in empirical data, which is based on the theory of synchronization cluster. The basic idea of their analysis is to consider the oscillating systems forming a cluster in which each one contributes to the cluster in different degree. The cluster consists of a common rhythm that is a mean oscillation for all oscillating systems inside the cluster. Based on their theory, we propose an algorithm named as Phase Synchronization Clustering (PSC) algorithm, which produce the clusters of cell cycle specific genes from genome expression data set, and the genes from the same cluster are expected to be involved in the specific biological process. The PSC algorithm is evaluated with synthetic data and cell cycle specific expression data of Saccharomyces cerevisiae from the study of Spellman et al. [6], in which they analyze gene expression levels in yeast cell cultures whose cell cycle has been synchronized by various methods.

Results and discussion

Case study 1: in silico experiments

The purpose of this experiment is to show how the proposed PSC algorithm is able to identify the signals that are expressed in the same specific process. In this study, it is assumed that a certain group of gene expression levels during cell cycle can be explained as the synchronization of large ensembles of oscillators, in which each element of the ensemble interacts with all others and is driven by the mean field that is formed by all elements, provided that every members from the group play a role for a certain biological process. The driving force, or the mean field, is not predetermined, but arises from interactions within the ensemble. This force determines whether the systems synchronize, but it itself depends on their oscillation. We use the modified Kuramoto model [23] as a phase governing equation that gives the long-term dynamics of globally coupled oscillators

\begin{matrix} \frac{d ϕ_{i}}{d t} = C \sum_{j = 1}^{K} \sin (ψ - ϕ_{j}), & i = 1, \dots, N, \end{matrix}

where the φ_is are the instantaneous phase, the ψ are the mean phase, and the positive constant C represents the coupling strength. It should be noted that the autonomous (or natural) frequency term is excluded from the original Kuramoto model for this model, and the mean phase ψ is roughly approximated by averaging the phases of all oscillators at current time point. The original Kuramoto model describes a large population of coupled limit-cycle oscillators [23]. With this modified model, it is assumed that the instantaneous rate of phase change is proportional to the mean sinusoidal coupling between the mean phase and each instantaneous phase. Given a set of initial conditions and a step size, we can simulate the instantaneous phase using the following for each gene i

ϕ_{i} (t + 1) = ϕ_{i} (t) + δ t \frac{d ϕ_{i}}{d t} .

With given initial random instantaneous phase signals, the expression signal can be simulated and converted into real signals asx_i(t) = real[A exp[(jφ_i(t))] = A cos(φ_i(t)),

where A is the instantaneous amplitude and is set to 1 for all signals. Then the simulated signals are updated by adding random noise from Gaussian distribution with mean μ = 0 and standard deviation ε.

To evaluate the PSC algorithm, we generate four sets of the expression signals with four different standard deviations of random Gaussian noise (i.e. noise level s) ε = 0.1, 0.2, 0.3, 0.4. For each set, we generate two groups of 100 signals. For first group, 20 measurements of signals are simulated with the coupling strength C = 3.0, and for second group, 20 measurements with the coupling strength C = 2.0. It suffices to say that these two groups are separately involved in their own oscillating process because each group has different coupling strength. It means that each group has different driving forces or mean field for their own signals. For each data set, we generate a group of random signals with same number of genes and measurements. This random group is combined with two other groups of simulated signals. Thus, in each data set, two-thirds of signals are simulated signals and one-third of signals are random signals. Then, we randomly shuffle the locations of all expression signals for each data set. We use four different values of cutoff (=0.9, 0.8, 0.7, 0.6) for these four data sets. Figure 1 shows all three groups of sorted expression signals without any addition of noises, and Figure 2 displays the change of the simulated signals of first group (C = 3.0) as the noise level increases from 0.1 to 0.4.

As an initial step, the algorithm creates a set of clusters of which the size is equal to the number of signals in the input data set. In this case, the algorithm creates 300 initial clusters, of which all sizes are equal to one. After the final step of the algorithm, the size of each cluster will be different depending on the values of cutoff and noise level ε. For each non-empty cluster, the signals from the group with simulated signals are counted and labeled as true positive (TP) for each group, and the signals from the group with random signals are also counted and labeled as false positive (FP).

In Figure 3, we explore the relationship between the size of clusters and the number of TP from the two groups of simulated signals with four different values of cutoff and noise level ε, and it is shown that all of them have linear relationship. It is also shown that the 1^st and 2^nd largest clusters contain the most TP signals among the output clusters. Hence, only these two largest output clusters are used as the output of the algorithm, and we explore the effect of the cutoff and noise level on the performance of the algorithm with these two clusters. We then systemically compare the sensitivity (percentage of correctly identified from input expression signals) and precision (percentage of TP expression signals among the output expression signals) for different cutoff and noise level s ε (Figure 4, 5, 6, 7) See Table 7 for the index of each figure. To test the variability of the results, we run the algorithm 20 times for each value of cutoff and noise level ε. Note that the simulated expression signals are different for each run due to random generations of initial phase signals and random noise addition, and the results are also expected to have certain degree of variability.

It is shown that the more noises are included in the data set, the less the sensitivity is obtained by the method (Figure 4, 5). On the other hand, the overall precision is almost constant (i.e. = 100%) as the noise level ε increases (Figure 6, 7), i.e. the almost 100% of the output signals are TP signals. It is shown that the sensitivity are approximately 82 – 96% with cutoff = 0.7 for all noise level s. If we assume that the noise level ε is ≤0.4, the cutoff values to obtain the sensitivity ≥82% for both groups should be 0.7. Based on this experiment, we conclude that the cutoff value ≥ 0.7 should be used for the analysis of yeast expression data to evaluate the PSC method, provided that the noise level in yeast data is ≤0.4. This could be reasonable assumption, because it is believed that the noise level = 0.4 is relatively large.

Case study 2: α factor-synchronized cell cycle gene expression data analysis

We evaluate the PSC algorithm with expression data from the study of Spellman et al. [6], in which they monitor genome-wide mRNA levels by using four different cell cycle synchronization techniques. We evaluate the PSC algorithm with the data sets by three yeast experiments (Alpha, Cdc15, and Cdc28), in which mRNA levels of 6,178 yeast ORFs are measured simultaneously over approximately two cell cycle periods. A fourth data set using elutriation-based experiment by Spellman et al. [6] is not used, since it only covers a single cell cycle and because most published methods were not applied to this data set. There are many missing values in Spellman et al. [6]'s data set – only 605 genes have no missing values in all three data sets. These missing values can lead to problems in the data analysis, because they obviously interfere with computation of any statistical test or clustering. The one of default ways to handle these missing values is to exclude all data points that have missing data in at least one of the selected genes. However, if missing data points are randomly distributed across the arrays, there could be very few "valid" data points left to be analyzed in the data sets, because of the abundance of missing values in our chosen data sets. Therefore, we replace the missing values for each gene by the mean expression levels of its gene. We perform mean imputation in the gene expression levels for the K = 4,201 genes, which have no more than one missing value in each expression data set. Then the expression profile is normalized to the standardized variable. Let's say we have an expression profile z(t), t = 1, 2, ..., n. If the expression profile has mean μ and variance σ², then the corresponding normalized expression value

x (t) = \frac{z (t) - μ}{σ}

has the mean 0 and variance 1.

It is clear that each expression data set contains artifacts, which would not occur in freely growing cells, due to the treatment of the cells for the cell cycle synchronization. For example, in Cdc15 and Cdc28 experiments, cells are released from arrest by an abrupt drop in temperature, which likely results in changes in expression of e.g. heat shock genes. To avoid the artifacts due to each cell cycle synchronization treatment, three expression data sets should be combined to uncover the "correct" sets of genes using the PSC algorithm as follows. The PSC algorithm mainly consists of the estimation of the strength between a system and the cluster, ρ_aCin Eq. (9). For the estimation of ρ_aC, the algorithm requires the strength values of the bivariate synchronization between all systems within the cluster of interest to us, ρ_a,bin Eq. (4). As a step for combining three data sets, we calculate the average ρ_a,bvalues from the three data sets as follows

ρ_{a, b} = \frac{1}{3} [ρ_{a, b}^{α} + ρ_{a, b}^{c d c 15} + ρ_{a, b}^{c d c 28}],

where ρ_a,b^αis the values of bivariate synchronization by alpha-factor data set, ρ_a,b^{cdc 15}by cdc15 data set, and ρ_a,b^{cdc 28}by cdc28 data set. It is noteworthy that this step could also reduce the noises in expression data due to the missing values.

The main purpose of the PSC algorithm is to find groups of genes according to the specific biological process using the cell-cycle-specific expression data. The previous in silico experiment provide the effectiveness of the PSC algorithm to reach this purpose. Based on the theory of multivariate phase synchronization [17], it is assumed that each gene from the same output cluster is closely associated by having relevant biological interactions and/or sharing significant Gene Ontology (GO) terms. That is, each cluster is related with a certain specific biological process. To evaluate the PSC algorithm, all of output clusters with different cutoff = 0.9, 0.8, 0.7, 0.6, 0.5 are analyzed with the GO Term Finder. It is a tool for searching significant GO terms, or parent GO terms, used to annotate genes in a given cluster and is available from Saccharomyces Genome Database [24]. The significant GO terms for each output clusters depending on cutoff are presented as Additional files 1, 2, 3, 4, 5 and the summary of the GO analysis is also provided in Table 1. Note that the p-value cutoff for significant GO terms is set to 0.05. It is found that significant number of clusters from the output have significant GO terms. It should be noted that genes within those significant clusters are evidently associated with certain specific biological processes depending on the GO terms of their own.

Table 1 The result from the analysis for significant GO terms according to the cutoff value.

Full size table

As another evaluation experiment, experimentally identified physical and genetic interactions between genes are mined from BioGRID database for each cluster from the output. Note that the BioGRID database is a freely accessible database of physical and genetic interactions available at [25]. The numbers of known interactions are presented for clusters only with significant GO terms in Table 2, 3, 4, 5, 6. Note that the clusters are sorted according the size of clusters for each output. It is noticed that the relatively large clusters tend to have many known biological interactions. It means that the genes within relatively large clusters are evidently interacting each other during cell division cycle by participating in the certain specific biological process depending on the significant GO terms of their own. Therefore, it suffices to say that the relatively large clusters are prominent clusters among the output clusters.

Table 2 The number of known biological interactions mined from BioGRID database [25] for each output cluster with cutoff = 0.9.

Full size table

Table 3 The number of known biological interactions mined from BioGRID database [25] for each output cluster with cutoff = 0.8.

Full size table

Table 4 The number of known biological interactions mined from BioGRID database [25] for each output cluster with cutoff = 0.7.

Full size table

Table 5 The number of known biological interactions mined from BioGRID database [25] for each output cluster with cutoff = 0.6.

Full size table

Table 6 The number of known biological interactions mined from BioGRID database [25] for each output cluster with cutoff = 0.5.

Full size table

The traditional clustering algorithms focus on relationships based on similar expression profiles, identifying cluster of genes whose expression signals simultaneously rise or fall with an assumption that genes with similar expression profiles have similar biological functions. For example, Spellman et al. [6] identify a large number of genes (~800) as giving a cell-cycle-specific patterns of gene expression by fitting the expression profile of given gene to a sine wave, which is used as a surrogate pattern of ideal cyclicity. Then, they use the hierarchical clustering algorithm to linearly correlate the expression profile for a given gene with the expression profile of other genes, which are considered to be confirmed as certain cell-cycle-regulated genes. To this end, they cluster genes into five cell cycle phases (G1, S, S/G2, G2/M, and M/G1). On the other hand, the PSC algorithm use the theory of multivariate phase synchronization, in which the mean phase coherence in Eq. 4 are used to find closely related genes that have relevant biological interactions and/or sharing significant GO terms. Here, the PSC algorithm deal with a special case of random variable that is defined on a circular scale, such that values whose difference is an integral multiple of a certain period (i.e. 2π) are regarded the same, and all values are wrapped into a single period. Note that the phase difference between expression profiles (or the phase of a expression profile) is an example of circular random variables φ_i(i = 1, 2,...). It is noteworthy that standard (or linear) statistical measures and moments like mean and variance are not applicable, because they yield different values if the period is added to or subtracted from some values, though the physical meaning of these changed values is the same. Based on the theory of phase synchronization, it is assumed that expression signals from two genes could be synchronized if these two genes are biologically interacting with each other. That is, two biologically interacting genes produce oscillating expression signals with a common rhythm. This phenomenon is explained in terms of coincidence of frequencies defined as "phase locking" [12]. With this theory, it is possible to measure the coupling strength between genes, which describes how strong the interaction is between genes.

To compare the capabilities of the PSC algorithm over the traditional clustering algorithm, we investigate whether genes from the output clusters are linearly correlated with each other as follows. Let's suppose that we have n number of genes in one of output clusters. Then, there are (n²-n)/2 number of all possible pairs of genes in the cluster. For each pair of genes, the linear correlation coefficients can be calculated for three expression data sets (i.e. alpha, cdc15, and cdc28), and the mean value of these three linear correlation coefficients is used as the "true" linear correlation coefficient for the given pair of genes. Note that the average values are used because of artifacts due to different cell cycle arrest treatments. Then, the mean linear correlation coefficient of all possible pairs can be obtained for each cluster, and the distribution of mean linear correlation coefficients for each output cluster with cutoff = 0.9, 0.8, 0.7, 0.6, 0.5 are presented in Figure 8. It is observed that the overall mean linear correlation coefficients are relatively low, and some of them are significantly low enough to be considered that their clusters are randomly created based on the linear correlation. That is, there is a significant portion of output clusters that are not obtainable by the traditional clustering algorithm. As an example, let's consider the 1^st cluster from the output clusters with cutoff = 0.7, which consists of genes associated with DNA metabolism process. From all possible pairs of genes in this cluster, we present two types of pairs: 1) similar expression profiles in Figure 9a and 2) time-shifted expression profiles in Figure 9b. It should be noted that all of these presented pairs in Figure 9 are identified as having known biological interactions between genes from the BioGRID database [25]. It is obvious that the time-shifted expression profiles are not obtainable by traditional clustering algorithm, because these profiles have significantly low linear correlation coefficients (<0.5). It is noteworthy that these time-shifted profiles are "similar" expression profiles that are constantly time-shifted from each other, and each gene is identified as having peak levels during different cell cycle phases according to the classification by Spellman et al. [6]. That is, these profiles are oscillating expression signals with a common rhythm during cell division process, which can be obtainable by PSC algorithm based on the theory of multivariate phase synchronization. As an another example, let's consider a relatively smaller cluster (41^st cluster from the output cluster with cutoff = 0.7) than 1^st cluster at this point. The 41^st cluster consisted of 11 genes, which are associated with translation process resulting in the formation of proteins. There are 23 known biological interactions identified from the BioGRID database [25], and these interactions are presented in Figure 10 (also see Table 9). It should be noted that none of genes in this cluster are identified as cell cycle regulated by Spellman et al. [6], and this example also includes expression profiles with significantly low linear correlation coefficient. This is another evidence that PSC algorithm is able to find prominent groups of genes, which are not obtainable by traditional clustering algorithm. It means that the PSC algorithm is able to find prominent groups of non-cell-cycle-regulated genes, which share significant GO terms and/or have relatively many known biological interaction from the BioGRID database [25]. There are more examples of such output clusters that have relative many known interactions and small (or zero) number of identified as cell-cycle-regulated genes by Spellman et al. [6]: e.g. 1) 3^rd, 9^th, 18^th, 41^st clusters with cutoff = 0.7 (Table 4). 2) 5^th, 19^th, 37^th, 50^th, 169^th clusters with cutoff = 0.6 (Table 5) 3) 1^st, 4^th, 6^th, 7^th, 11^th, 24^th clusters with cutoff = 0.5 (Table 6). It should be reminded that these clusters have relatively low linear correlation. Therefore, it suffices to say that the PSC algorithm has capabilities to find many prominent groups of genes that can not be obtained by the traditional clustering algorithms.

For clarification of the significant of cutoff value, the P-value from the distribution of strength of phase synchronization between each oscillator (gene) and the cluster is calculated. It is reasonable to assume that the complete understanding of cellular process during cell division cycle in whole genome scale is not available yet. However, it is well known that the cell division process is a single continuous process. The cell division process is mainly consisted of interphase (G1, S, and G2 phase) and division (M phase). During the S phase, the DNA in the nucleus is replicated, and the M phase includes two separate processes, i.e. mitosis and cytokinesis. The G1 phase is an interval phase between the end of M phase and the beginning of DNA synthesis, and the G2 phase is an interval phase separating the end of DNA synthesis from the beginning of M phase. It is assumed that genes during a certain cell cycle have relatively fewer interactions with genes during the other cell cycle phase. Therefore, the cell division process is a single continuous process and each process is "weakly" connected with other process in the downstream of cell cycle process. Based on this point of view, it is assumed that the whole cell division process is consisted of genes that create a single interacting network with heterogeneous connectivity distribution; thus, whole genomes are considered to estimate the P-value. In order to estimate a P-value for a given strength of phase synchronization between each gene and the cluster ρ_kC, a set of random expression signals is generated by shuffling the expression signals at different time points by interchanging the expression signal at time points 3 and 14. Using Eq. 9, the strength values of phase synchronization between each gene and the cluster ρ_kCare calculated and tabulated their distribution with combined expression set of alpha-factor, cdc15 and cdc28 (Figure 11). This distribution is an approximation of true negatives for input expression signals. By integration, we could estimate a P-value, which is defined as the probability of obtaining a ρ_kClarger than the cutoff from the random distribution: the smaller the P-value, the more significant the strength value ρ_kCand vice versa.

For further understanding of the significance of cutoff values, we examine two biological processes (i.e. mitotic cell cycle and protein amino acid glycosylation) in three output clusters: 11^th cluster (cutoff = 0.7), 1^st cluster (cutoff = 0.6), and 2^nd cluster (cutoff = 0.5), and the known physical or genetic interactions from BioGRID database [25] are visualized for the selected clusters in Figure 12. It is reasonable to assume that genes associated with mitotic cell cycle are more "tightly" interacting with each other than the ones associated with protein amino acid glycosylation. Figure 12 shows that genes with protein amino acid glycosylation have relatively fewer known interactions than the ones with mitotic cell cycle. Therefore, it suffices to say that genes with protein amino acid glycosylation are "weakly" connected to the genes with mitotic cell cycle, and the genes with protein amino acid glycosylation are combined with genes with mitotic cell cycle as cutoff decreases. It means that the size of cluster increases as cutoff decreases, and the PSC algorithm creates relatively small clusters with significantly prominent genes with relatively larger cutoff value and these clusters grows in size by combining other "weakly" interacting genes as cutoff decreases. Therefore, it can be concluded that the larger the cutoff, the more portions of prominent genes in the output clusters and vices versa.

Conclusion

PSC algorithm is mainly based on the theory of multivariate phase synchronization, and the phase synchronization could be understood as a common rhythm of oscillatory activities of systems due to their interactions with each other. We develop the strategy of identifying and categorizing cell cycle specific gene expressions according to the specific biological process, in which expression signals share a common rhythm during cell cycle. That is, PSC algorithm is efficient to find groups of genes that share same periodic variations of expression profiles, which is coincident with the length of the cell cycle. On the other hand, the traditional clustering algorithms search similar expression profiles with an assumption that genes with similar expression profiles have similar biological functions. Our evaluation analysis clearly indicates that PSC algorithm produces prominent clusters, which are not obtainable by traditional clustering algorithms.

Our evaluation analysis also shows that the PSC algorithm is able to find groups of gene, which are significantly associated with each other by sharing significant GO terms of biological process and/or relevant biological interactions. However, the algorithm does not have a capability to create a directed and weighted network of synchronization. Recently, Motter et al. [26] showed that the maximum synchronizability can be achieved when the network of synchronization is weighted and directed for a given degree distribution of heterogeneous connectivity. Therefore, the study for the analysis of cell cycle specific genome expression data could be further advanced by considering the directed and weighted network structure and addressing the effect that asymmetry has on the synchronizability of complex networks.

Based on the evaluation experiments, we draw the conclusion as follows: 1) Based on the theory of multivariate phase synchronization, it is feasible to find groups of genes, which have biological interactions and/or significantly shared GO slim terms of biological process, with cell cycle specific gene expression signals. 2) Among all the output clusters by PSC algorithm, the cluster with relatively larger size has a tendency to include more known interactions than the one with relatively smaller size. 3) It is feasible to understand the cell cycle specific gene expression patterns as the phenomenon of collective synchronization. 4) PSC algorithm is able to find prominent groups of genes, which are not obtainable by traditional clustering algorithms.

Methods

1) Fundamental mathematical concept: multivariate synchronization

The proposed algorithm builds on the concepts of analytic signal and phase synchronization. Hence, we first explain the basic idea of analytic signal and phase synchronization [12, 29]. Then we continue to describe the basic idea of synchronization in ensembles of oscillating systems. By "oscillating systems", we mean systems that produce the response signals with period and frequency. As a first step, we convert the gene expression signal x(t) into analytic signal x_a(t) using Hilbert transform (HT). The analytic signal of gene expression signal x(t) is defined byx_a(t) = x(t) + jx_h(t)

where j is the imaginary unit and x_h(t) is the HT of x(t)

x_{h} (t) = π^{- 1} \lim_{R \to \infty} \int_{- R}^{R} \frac{x (τ)}{t - τ} d τ .

(2)

From this equation, it is noticed that the HT of x(t) may be considered as the convolution of the x(t) and 1/πt. Due to the properties of convolution, the Fourier transform (FT) X_h(ς) of x_h(t) is the product of the FT of x(t) and 1/πt. For physically relevant Fourier frequencies ς > 0, X_h(ς) = -jX(ς). In other words, the HT can be considered by an ideal filter whose amplitude response is unity and phase response is a constant π/2 lag at all Fourier frequencies. The analytic signal can also be expressed in terms of complex polar coordinatesx_a(t) = A(t)exp(jφ(t)),

where $A (t) = | x_{a} (t) | = \sqrt{x^{2} (t) + x_{h}^{2} (t)}$ and φ(t) = arg{x_a(t)}. These two functions are respectively called the amplitude and instantaneous phase of the signal x(t). The basic idea of the analytic signal is that the negative frequency components of the FT (or spectrum) of x(t) s are superfluous, due to the hermitian symmetry of such a spectrum. These can be removed without any loss of information, if an analytic signal is used instead. But note that the removal of the negative frequencies will eliminate such spectral symmetry; the inverse FT of such a one-sided spectrum will give back a complex analytic signal.

In this study, we use the phase of the analytic signal x_a(t) to detect phase synchronization between oscillating systems; i.e. the phase synchronization can be defined as locking of the phases, while the amplitudes can be quite different. Using the methods of analytic signal, it can be shown that the interaction of nonidentical oscillators can lead to a perfect locking of their phases, whereas their amplitudes remain uncorrelated [12]. The strength of phase synchronization between two signals can be measured using the mean phase coherence [20] as follows

ρ_{a, b} = | \frac{1}{n} \sum_{t} \exp (j (ϕ_{a} (t) - ϕ_{b} (t))) | .

(4)

The values of ρ_a,bare confined between 0 (no synchrony) and 1 (perfect synchrony) and this value monotonically increases with the strength of phase synchronization [18].

For multivariate oscillating systems, we use the concept of synchronization in ensembles of oscillators, in which each component interacts with all others. This can be described as global coupling. Let's consider an ensemble of non-identical oscillators to understand the process of collective synchronization. From the previous section, it is understood that a pair of systems can be synchronized, and it is expected that synchronization can be extended to a whole population of systems, or at least to a large portion of it. Pikovsky et al. [11] explained how such coupling results in synchronization in the ensemble with a drawing as shown in Figure 13. With this figure, they have described the driving inputs that come to each system from all others by one input from the whole ensemble. It means that a common force operates each system and this force is proportional to the sum of outputs of all systems in the ensemble. This force can entrain many oscillating systems if their frequencies are close. In the case of global coupling this force is not predetermined, but comes from interaction within the ensemble. To explain qualitatively this force, we consider the study of Allefeld and Kurths [17]. They described the basic idea of multivariate synchronization analysis to understand the oscillating systems as constituting a cluster, in which each component system participates in different degree. The cluster consists of a common rhythm and it is described by the dynamics of a cluster phase Φ. For each measurement, the phase of the cluster is defined as a circular weighted mean of all phases inside the cluster,

Φ (t) = \arg \sum_{m} c_{m} \exp (j ϕ_{m} (t)),

(5)

where the participation index c_mcan be obtained as a function of the synchronization strength between a system and the cluster,

\begin{matrix} c_{m} = f (ρ_{m, C}) & with & ρ_{m, C} | \frac{1}{n} \sum_{t} \exp (j (ϕ_{m} (t) - Φ (t))) | \end{matrix} .

(6)

This participation index c_mmeasure both how close each system inside the cluster follow the common rhythm Φ as well as how much a system contribute to the cluster. In this definition, it is not clear which function f should be chosen for the relationship between the ρ_m,Cand the c_m. They modified the idea of synchronization cluster analysis in a way that is much more generally applicable. For the strength values of the bivariate synchronization ρ_a,bcan be defined as

\begin{matrix} ρ_{a, b} = | 〈 \exp (j (ϕ_{a} - ϕ_{b})) 〉 | \\ = | 〈 \exp (j (Δ ϕ_{a} - Δ ϕ_{b})) 〉 |, \end{matrix}

(7)

where Δφ_a= φ_a- Φ and Δφ_b= φ_b- Φ. They introduce the phase of the mean field to factorize the strength of phase synchronization as follow,

\begin{matrix} ρ_{a, b} = | 〈 \exp (j Δ ϕ_{a}) 〉 | | 〈 \exp (- j Δ ϕ_{b}) 〉 | \\ = ρ_{a, C} ρ_{b, C}, \end{matrix}

(8)

where a ≠ b (ρ_aa= 1). This factorization makes possible to estimate the strength of the synchronization between a system and the cluster, ρ_aC, thus this leads to the derivation of an iteration equation for estimating ρ_kCas follow

{ρ^{'}}_{k C} = \frac{1}{2} (ρ_{k C} + \frac{\sum_{i} F_{i k} ρ_{i C} ρ_{i k}}{\sum_{i} F_{i k} ρ_{i C}^{2}}),

(9)

with F_ik= 1/(1 - ρ_iC²ρ_ik²)² for i ≠ k and F_ik= 0 for i = k.

2) Phase synchronization clustering algorithm

Based on the concept of synchronization in ensembles of oscillating systems, we propose the strategy to make clusters of genes based on the theory of multivariate synchronization. There are 5 steps in this procedure. The descriptions for each step are listed as follow. Inputs to this method are the time series of expression data set and cutoff value for synchronization strength.

Step 1. Obtaining the phase vector φ_i

Let's say there are signals x_i(t) of the K systems i = 1, 2, ..., K with n number of observations t = 1, 2, ..., n of the stochastic process. In this step, the analytical signal can be approximated using Fast Fourier transform [27]. The output of this step is phase vector φ_i, which is defined as φ_i= {φ_i(1), φ_i(2), ..., φ_i(n)}, for 1 ≤ i ≤ K.

Step 2. Initialization of cluster array

First, an array for K number of clusters is produced. For each cluster cluster(i), a phase vector φ_iis stored for 1 ≤ i ≤ K. The output of this step is cluster array cluster(i), for 1 ≤ i ≤ K. The pseudo algorithm of this step is presented in APPENDIX A.

Step 3. Initial clustering

For each phase vector, this step finds how closely the phase vector follows the common rhythm for each cluster from the array. This can be measured by the synchronization strength between the phase vector and the cluster. Then the algorithm finds the cluster in which the phase vector has the highest value of the synchronization strength. If the synchronization strength between the phase vector and the selected cluster is greater or equals to the pre-defined cutoff value, this cluster is updated by appending the phase vector to the selected cluster. This procedure is repeated for the entire phase vectors. The output of this step is the updated cluster array. The pseudo algorithm of this step is presented in APPENDIX B.

Step 4. Filtering cluster

If the cluster contains no more than a system, this does not constitute as a cluster. Thus, the cluster is set to empty list. The pseudo algorithm of this step is presented in APPENDIX C.

Step 5. Combining clusters

Empty clusters are not considered in this step. For each non-empty cluster, the algorithm finds a cluster from the array such that these two clusters will have the most common rhythm when they are combined. If all of the synchronization strength between the combined cluster and each element are greater or equals to the cutoff value, these two clusters are combined. The pseudo algorithm is presented in APPENDIX D.

Appendix A: The pseudo algorithm for initialization of cluster array

Input: phase vectors, φ_ifor 1 ≤ i ≤ K.

Output: cluster array, cluster(i) for 1 ≤ i ≤ K.

for 1 ≤ i ≤ K

cluster(i) = {φ_i}

end

Appendix B: The pseudo algorithm for initial clustering

Input: cutoff and cluster array, cluster(i) for 1 ≤ i ≤ K.

Output: cluster array, cluster(i) for 1 ≤ i ≤ K.

for 1 ≤ i ≤ K

Initialize [SynStrength]^(j)with 0, for 1 ≤ j ≤ K

for 1 ≤ j ≤ K, i ≠ j

temp_list = {cluster(j), φ_i}

n = the size of temp_list

Compute ρ_mC using φ_mfrom temp_list, for 1 ≤ m ≤ n

[SynStrength]^(j)= ρ_nC

end

Find max_SynStrength = max{[SynStrength]^(j), 1 ≤ j ≤ K, i ≠ j}

If max_SynStrength ≥ cutoff

cluster(j) = {cluster(j), φ_i}

end

Appendix C: The pseudo algorithm for filtering cluster

Input: cluster array, cluster(i) for 1 ≤ i ≤ K.

Output: cluster array, cluster(i) for 1 ≤ i ≤ K.

for 1 ≤ i ≤ K

if the size of cluster(i) equals to 1

cluster(i) = {}

end

Appendix D: The pseudo algorithm for combining clusters

Input: cutoff and cluster arrays, cluster(i) for 1 ≤ i ≤ K.

Output: cluster array, cluster(i) for 1 ≤ i ≤ K.

for 1 ≤ i ≤ K

Initialize [SynStrength]^(j)with 0, for 1 ≤ j ≤ K

for 1 ≤ j ≤ K, i ≠ j

if cluster(i) and cluster(j) are not empty lists

temp_list = {cluster(j), cluster(i)}

n = the size of temp_list

Compute ρ_mC using φ_mfrom temp_list, for 1 ≤ m ≤ n

[SynStrength]^(j)= min{ρ_mC, 1 ≤ m ≤ n}

end

Find max_SynStrength = max{[SynStrength]^(j), 1 ≤ j ≤ K, i ≠ j}

If max_SynStrength ≥ cutoff

cluster(j) = {cluster(j), cluster(i)}

cluster(i) = {}

end

References

Hereford LM, Osley MA, Ludwig TRD, McLaughlin CS: Cell-cycle regulation of yeast histone mRNA. Cell 1981, 24: 367–375. 10.1016/0092-8674(81)90326-3
Article CAS PubMed Google Scholar
Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, Wolfsberg TG, Gabrielian AE, Landsman D, Lockhart DJ, Davis RW: A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell 1998, 2: 65–73. 10.1016/S1097-2765(00)80114-8
Article CAS PubMed Google Scholar
Laub MT, McAdams HH, Feldblyum T, Fraser CM, Shapiro L: Global analysis of the genetic network controlling a bacterial cell cycle. Science 2000, 290: 2144–2148. 10.1126/science.290.5499.2144
Article CAS PubMed Google Scholar
Menges M, Murray JA: Synchronous Arabidopsis suspension cultures for analysis of cell-cycle gene activity. Plant J 2002, 30: 203–212. 10.1046/j.1365-313X.2002.01274.x
Article CAS PubMed Google Scholar
Menges M, Hennig L, Gruissem W, Murray JA: Cell cycle-regulated gene expression in Arabidopsis. J Biol Chem 2002, 277: 41987–42002. 10.1074/jbc.M207570200
Article CAS PubMed Google Scholar
Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 1998, 9: 3273–3297.
Article PubMed Central CAS PubMed Google Scholar
van der Meijden CM, Lapointe DS, Luong MX, Peric-Hupkes D, Cho B, Stein JL, van Wijnen AJ, Stein GS: Gene profiling of cell cycle progression through S-phase reveals sequential expression of genes required for DNA replication and nucleosome assembly. Cancer Res 2002, 62: 3233–3243.
CAS PubMed Google Scholar
Whitfield ML, Sherlock G, Sadanha AJ, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, Botstein D: Identification of gene periodically expressed in the human cell cycle and their expression in tumors. Mol Biol Cell 2002, 13: 1977–2000. 10.1091/mbc.02-02-0030.
Article PubMed Central CAS PubMed Google Scholar
Cooper S, Shedden K: Microarray analysis of gene expression during the cell cycle. Cell & Chromosome 2003, 2: 1. 10.1186/1475-9268-2-1
Article Google Scholar
Glass L, Mackay MC: From clocks to chaos. New Jersey: Princeton University Press; 1998.
Google Scholar
Pikovsky A, Rosenblum M, Kurths J: Synchronization: A universal concept in nonlinear sciences. Cambridge: Cambridge University Press; 2001.
Chapter Google Scholar
Rosenblum MG, Pikovsky AS, Kurths J: Phase Synchronization of Chaotic Oscillators. Phys Rev Lett 1996, 76: 1804–1807. 10.1103/PhysRevLett.76.1804
Article CAS PubMed Google Scholar
Anishchenko VS, Balanov AG, Janson NB, Igosheva NB, Bordyugov GV: Entrainment between heart rate and weak noninvasive forcing. Int J Bifurcation and Chaos 2000, 10: 2339–2348.
Google Scholar
Schäfer C, Rosenblum MG, Kurths J, Abel HH: Heartbeat synchronized with ventilation. Nature 1998, 392: 239–240. 10.1038/32567
Article PubMed Google Scholar
Stefanovska A, Haken H, McClintock PVE, Hozic M, Bajrovic F, Ribaric S: Reversible transitions between synchronization states of the cardiorespiratory system. Phys Rev Lett 2000, 85: 4831–4834. 10.1103/PhysRevLett.85.4831
Article CAS PubMed Google Scholar
Tass P, Rosenblum MG, Weule J, Kurths J, Pikovsky A, Volkmann J, Schnitzler A, Freund HJ: Detection of n:m phase locking from noisy data: Application to magnetoencephalography. Phys Rev Lett 1998, 81: 3291–3294. 10.1103/PhysRevLett.81.3291
Article CAS Google Scholar
Allefeld C, Kurths J: An approach to multivariate phase synchronization analysis and its application to event-related potentials. Int J Bifurcation and Chaos 2004, 14: 417–426. 10.1142/S0218127404009521
Article Google Scholar
Bhattacharya J: Reduced degree of long-range phase synchrony in pathological human brain. Acta Neurobiol Exp 2001, 61: 309–318.
CAS Google Scholar
Jerger KK, Netoff TI, Francis JT, Sauer T, Pecora L, Weinstein SL, Schiff SJ: Early seizure detection. J Clin Neurophysiol 2001, 18: 259–268. 10.1097/00004691-200105000-00005
Article CAS PubMed Google Scholar
Mormann F, Lehnertz K, David P, Elger CE: Mean phase coherence as a measure for phase synchronization and its application to EEG of epilepsy patients. Physica D 2000, 144: 358–369. 10.1016/S0167-2789(00)00087-7
Article Google Scholar
Blasius B, Huppert A, Stone L: Complex dynamics and phase synchronization in spatially extended ecological systems. Nature 1999, 399: 354–359. 10.1038/20676
Article CAS PubMed Google Scholar
Lunkeit F: Synchronization experiments with an atmospheric global circulation model. Chaos 2001, 11: 47–51. 10.1063/1.1338127
Article PubMed Google Scholar
Strogatz SH: From Kuramoto to Crawford: exploring the onset of synchronization in populations of coupled oscillators. Physica D 2000, 143: 1–20. 10.1016/S0167-2789(00)00094-4
Article Google Scholar
Saccharomyces Genome Database[http://www.yeastgenome.org]
BioGRID: General Repository for Interaction Datasets[http://www.thebiogrid.org]
Motter AE, Zhou C, Kurths J: Network synchronization, diffusion, and the paradox of heterogeneity. Physical Review E 2005, 71: 016116. 10.1103/PhysRevE.71.016116
Article Google Scholar
Marple SL Jr: Computing the discrete-time "analytic" signal via FFT. IEEE trans Signal processing 1999, 47: 2600–2603. 10.1109/78.782222
Article Google Scholar

Download references

Acknowledgements

The authors thank Professor Choung Ik Kim from Kangwon National University for helpful discussion.

Author information

Authors and Affiliations

Institute of Animal Resources Research, Kangwon National University, Chuncheon, Republic of Korea
Chang Sik Kim
Department of Biomedical Engineering, Kwandong University, Kangnung, Republic of Korea
Cheol Soo Bae
Department of Information Technology, Kangwon National University, Chuncheon, Republic of Korea
Hong Joon Tcha

Authors

Chang Sik Kim
View author publications
You can also search for this author in PubMed Google Scholar
Cheol Soo Bae
View author publications
You can also search for this author in PubMed Google Scholar
Hong Joon Tcha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chang Sik Kim.

Additional information

Authors' contributions

CSK conceived this study, developed the proposed method, and performed the evaluation experiments. CSK, CSB, and HJT participated in the interpretation of the evaluation experiments. CSK, CSB and HJT drafted and finalized the manuscript. All authors approved the manuscript.

Table 7 The index for each figure depending on cutoff and noise level ε in Figure 3.

Full size table

Table 8 Systematic gene names, cell cycle phases according to the study of Spellman et al. [6], and linear correlation coefficients for each pair of genes in Figure 9.

Full size table

Table 9 Systematic gene names, and linear correlation coefficients for each pair of genes in Figure 10.

Full size table

Electronic supplementary material

12859_2007_2041_MOESM1_ESM.txt

Additional file 1: Tables of significant GO terms of biological process for the output clusters with cutoff = 0.9. This file includes 8 clusters from the PSC algorithm with cutoff = 0.9, which have significantly GO terms of biological process. These significant GO terms are obtained by a tool called GO Term Finder, which is available from Saccharomyces Genome Database [24]. (TXT 7 KB)

12859_2007_2041_MOESM2_ESM.txt

Additional file 2: Tables of significant GO terms of biological process for the output clusters with cutoff = 0.8. This file includes 57 clusters from the PSC algorithm with cutoff = 0.8, which have significantly GO terms of biological process. These significant GO terms are obtained by a tool called GO Term Finder, which is available from Saccharomyces Genome Database [24]. (TXT 42 KB)

12859_2007_2041_MOESM3_ESM.txt

Additional file 3: Tables of significant GO terms of biological process for the output clusters with cutoff = 0.7. This file includes 148 clusters from the PSC algorithm with cutoff = 0.7, which have significantly GO terms of biological process. These significant GO terms are obtained by a tool called GO Term Finder, which is available from Saccharomyces Genome Database [24]. (TXT 110 KB)

12859_2007_2041_MOESM4_ESM.txt

Additional file 4: Tables of significant GO terms of biological process for the output clusters with cutoff = 0.6. This file includes 151 clusters from the PSC algorithm with cutoff = 0.8, which have significantly GO terms of biological process. These significant GO terms are obtained by a tool called GO Term Finder, which is available from Saccharomyces Genome Database [24]. (TXT 124 KB)

12859_2007_2041_MOESM5_ESM.txt

Additional file 5: Tables of significant GO terms of biological process for the output clusters with cutoff = 0.5. This file includes 21 clusters from the PSC algorithm with cutoff = 0.8, which have significantly GO terms of biological process. These significant GO terms are obtained by a tool called GO Term Finder, which is available from Saccharomyces Genome Database [24]. (TXT 50 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Kim, C.S., Bae, C.S. & Tcha, H.J. A phase synchronization clustering algorithm for identifying interesting groups of genes from cell cycle expression data. BMC Bioinformatics 9, 56 (2008). https://doi.org/10.1186/1471-2105-9-56

Download citation

Received: 26 November 2007
Accepted: 28 January 2008
Published: 28 January 2008
DOI: https://doi.org/10.1186/1471-2105-9-56

A phase synchronization clustering algorithm for identifying interesting groups of genes from cell cycle expression data

Abstract

Background

Results

Conclusion

Background

Results and discussion

Case study 1: in silico experiments

Case study 2: α factor-synchronized cell cycle gene expression data analysis

Conclusion

Methods

1) Fundamental mathematical concept: multivariate synchronization

2) Phase synchronization clustering algorithm

Step 1. Obtaining the phase vector φ i

Step 2. Initialization of cluster array

Step 3. Initial clustering

Step 4. Filtering cluster

Step 5. Combining clusters

Appendix A: The pseudo algorithm for initialization of cluster array

Appendix B: The pseudo algorithm for initial clustering

Appendix C: The pseudo algorithm for filtering cluster

Appendix D: The pseudo algorithm for combining clusters

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us

Step 1. Obtaining the phase vector φ_i