Department of Mathematics, University of Queensland, Brisbane, QLD 4072, Australia

School of Medicine, Griffith Health Institute, Griffith University, Meadowbrook, QLD 4131, Australia

Abstract

Background

Time-course gene expression data such as yeast cell cycle data may be periodically expressed. To cluster such data, currently used Fourier series approximations of periodic gene expressions have been found not to be sufficiently adequate to model the complexity of the time-course data, partly due to their ignoring the dependence between the expression measurements over time and the correlation among gene expression profiles. We further investigate the advantages and limitations of available models in the literature and propose a new mixture model with autoregressive random effects of the first order for the clustering of time-course gene-expression profiles. Some simulations and real examples are given to demonstrate the usefulness of the proposed models.

Results

We illustrate the applicability of our new model using synthetic and real time-course datasets. We show that our model outperforms existing models to provide more reliable and robust clustering of time-course data. Our model provides superior results when genetic profiles are correlated. It also gives comparable results when the correlation between the gene profiles is weak. In the applications to real time-course data, relevant clusters of coregulated genes are obtained, which are supported by gene-function annotation databases.

Conclusions

Our new model under our extension of the EMMIX-WIRE procedure is more reliable and robust for clustering time-course data because it adopts a random effects model that allows for the correlation among observations at different time points. It postulates gene-specific random effects with an autocorrelation variance structure that models coregulation within the clusters. The developed R package is flexible in its specification of the random effects through user-input parameters that enables improved modelling and consequent clustering of time-course data.

Background

DNA microarray analysis has emerged as a leading technology to enhance our understanding of gene regulation and function in cellular mechanism controls on a genomic scale. This technology has advanced to unravel the genetic machinery of biological rhythms by collecting massive gene-expression data in a time course. Time-course gene expression data such as yeast cell cycle data

Various computational models have been developed for gene clustering based on cross-sectional microarray data

Finite mixture models

(1) there are no replications on any particular entity specifically identified as such and

(2) all the observations on the entities are independent of one another,

are violated, multivariate normal mixture models may not be adequate. For example, condition (2) will not hold for the clustering of gene profiles, since not all the genes are independently distributed, and condition (1) will generally not hold either as the gene profiles may be measured over time or on technical replicates. While this correlated structure can be incorporated into the normal mixture model by appropriate specification of the component-covariance matrices, it is difficult to fit the model under such specifications. For example, the M-step may not exist in closed form

Accordingly, Ng et al. ^{2}**EM**-based **MIX**ture analysis **Wi**th **R**andom Effects) to handle the clustering of correlated data that may be replicated. They adopted a mixture of linear mixed models to specify the correlation structure between the variables and to allow for correlations among the observations. It also enables covariate information to be incorporated into the clustering process

Fourier series approximations have been used to model periodic gene expression, leading to the detection of periodic signals in various organisms including yeast and human cells

where _{0} is the average value of _{
k
}(_{
k
}and _{
k
} are the amplitude coefficients that determine the times at which the gene achieves peak and trough expression levels, respectively, and

The EMMIX-WIRE procedure of Ng et al.

The paper is organized as follow: we first present the development of the extension of the EMMIX-WIRE model to incorporate AR(1) random effects which are fitted under the EM framework. Then in the following section, we conduct a simulation study and the data analysis with three real yeast cell datasets. In the last section some discussion is provided. The technical details of the derivations are provided in the Additional file

**Supplementary file bmcbioinf-supp-2012.pdf**

Click here for file

Methods

EMMIX-WIRE Model with AR(1) Random Effects

We let **
y
**

where _{
h
} is a (2_{0}
_{1},…,_{
k
}
_{1},…,_{
k
}; see (1), _{
jh
}=(_{
jh1},…,_{
jhm
})^{
T
}and _{
h
}=(_{
h1},…,_{
hm
})^{
T
}are the random effects, where _{1} and _{2} are _{
jh
}and _{
h
} to be independent and normally distributed, _{
jh
}. To further account for the time dependent random gene effects, a first-order autoregressive correlation structure is adopted for the gene profiles, so that _{
jh
}follows a ^{2}

The inverse of

and

where

The assumptions (2) and (3) imply that our new model assumes an autocorrelation covariance structure under which measurements at each time point have a larger variance compared to the model of Kim et al.

In the context of mixture models, we consider the

where _{
h
}is the component-pdf of the multivariate normal distribution with mean vector _{
h
}
_{
h
} and covariance matrix

The vector of unknown parameters is denoted by

Maximum likelihood via the EM algorithm

In the EM framework adopted here, the observed data vector _{1},_{2},…,_{
n
})^{
T
} is augmented by the unobservable component labels, _{1},_{2},…,_{
n
}of _{1},_{2},…,_{
n
}, where _{
j
} is the _{
jh
}, which is equal to 1 if _{
j
}comes from the _{
jh
} and _{
h
}(_{
c
} is the sum of four terms _{
c
}=_{1} + _{2} + _{3} + _{4}, where

is the logarithm of the probability of the component labels _{
jh
}, and where _{2} is the logarithm of the density function of _{
jh
},_{
h
}, and _{
jh
}=1, and _{3} and _{4}is the logarithm of the density function of _{
jh
}=1,

where

To maximize the complete-data log likelihood _{
c
}, the above decomposition implies that each of _{1},_{2},_{3}, and _{4} can be maximized separately. The EM algorithm proceeds iteratively until the difference between successive values of the log likelihood is less than some specified threshold. All major derivations are given in the Additional file

Results

Simulation study

To illustrate the performance of the proposed model, we present a simulation study based on synthetic time-course data. In the following simulation, we consider an autocorrelation dependence for the periodic expressions and compare our model to that of Kim et al. ^{2}corresponding to low and high autocorrelation among the periodic gene expressions. We also assume that ^{2} and ^{2}, respectively.

There are three clusters of genes. The periods for each cluster are 6, 10, and 16, respectively. There are 24 measurements at time points 0, 1, …, 23, and the first order Fourier expansion is adopted in the simulation models. Parameters and simulation results are listed in Tables

**First component**

**Second component**

**Third component**

**Parameters**

**EM-W**

**Kim**

**EM-W**

**Kim**

**EM-W**

**Kim**

-0.002

0.016

-0.009

-0.001

0.011

-0.015

0.1,0.315)

(0.045)

(0.052)

(0.033)

(0.029)

(0.051)

(0.051)

_{0}(0.3,

0.002

0.008

-0.006

-0.036

-0.003

-0.009

1,0.2)

(0.135)

(0.137)

(0.175)

(0.186)

(0.186)

(0.182)

_{1}(0.03,

-0.001

-0.018

0.024

0.004

0.004

-0.001

1,0.02)

(0.119)

(0.124)

(0.272)

(0.160)

(0.175)

(0.152)

_{1}(0.06,

0.009

-0.015

-0.164

0.031

0.027

0.008

0.9,0.01)

(0.119)

(0.132)

(0.223)

(0.160)

(0.149)

(0.183)

^{2}(0.5,

0.055

1.543

0.089

1.346

0.110

1.443

0.5,0.5)

(0.082)

(1.547)

(0.164)

(1.349)

(0.152)

(1.446)

-0.023

-0.395

-0.043

-0.372

-0.043

-0.392

0.6,0.6)

(0.036)

(0.397)

(0.082)

(0.374)

(0.058)

(0.394)

^{2}(1.0,

0.0171

-0.017

0.011

1.0,1.0)

(0.055)

(0.127)

(0.088)

^{2}(0.4,

-0.112

-0.091

-0.118

0.2,0.3)

(0.145)

(0.102)

(0.134)

EM-W

Kim

Proportion

Mean (RMSE) SD

Mean (RMSE) SD

(EM-W is better)

Error rate

0.036 (0.044) 0.026

0.099 (0.108) 0.044

986/1000

Rand

0.954 (0.056) 0.032

0.863 (0.149) 0.060

993/1000

Adjusted

0.906 (0.113) 0.064

0.726 (0.299) 0.120

993/1000

**First component**

**Second component**

**Third component**

**Parameters**

**EM-W**

**Kim**

**EM-W**

**Kim**

**EM-W**

**Kim**

-0.006

0.035

-0.009

-0.002

0.015

-0.033

0.1,0.315)

(0.061)

(0.080)

(0.047)

(0.045)

(0.070)

(0.074)

_{0}(0.3,

0.001

0.018

-0.004

-0.069

-0.00

-0.014

1,0.2)

(0.137)

(0.147)

(0.173)

(0.197)

(0.186)

(0.178)

_{1}(0.03,

0.010

-0.062

0.017

-0.031

0.001

-0.002

1,0.02)

(0.162)

(0.227)

(0.388)

(0.236)

(0.230)

(0.199)

_{1}(0.06,

0.009

-0.042

-0.180

0.073

0.032

0.009

0.9,0.01)

(0.124)

(0.166)

(0.235)

(0.188)

(0.163)

(0.213)

^{2}(1.3,

-0.042

1.671

-0.030

1.449

0.008

1.549

1.3,1.3)

(0.097)

(1.677)

(0.223)

(1.460)

(0.153)

(1.556)

0.009

-0.249

-0.001

-0.228

0.002

-0.250

0.6,0.6)

(0.020)

(0.251)

(0.055)

(0.235)

(0.025)

(0.252)

^{2}(1.0,

0.131

0.121

0.141

1.0,1.0)

(0.155)

(0.219)

(0.186)

^{2}(0.4,

-0.151

-0.124

-0.160

0.2,0.3)

(0.172)

(0.129)

(0.168)

EM-W

Kim

Proportion

Mean (RMSE) SD

Mean (RMSE) SD

(EM-W is better)

Error rate

0.094 (0.102) 0.039

0.184 (0.192) 0.053

988/1000

Rand

0.881 (0.129) 0.049

0.758 (0.252) 0.069

1000/1000

Adjusted

0.760 (0.259) 0.097

0.518 (0.500) 0.133

1000/1000

**First component**

**Second component**

**Third component**

**Parameters**

**EM-W**

**Kim**

**EM-W**

**Kim**

**EM-W**

**Kim**

0.001

0.008

-0.001

-0.003

-0.001

-0.005

0.1,0.315)

(0.009)

(0.012)

(0.008)

(0.008)

(0.010)

(0.011)

_{0}(0.3,

0.001

0.008

-0.001

-0.018

0.003

-0.014

1,0.2)

(0.017)

(0.019)

(0.018)

(0.026)

(0.016)

(0.016)

_{1}(0.03,

-0.002

-0.023

-0.001

-0.005

0.003

-0.006

1,0.02)

(0.049)

(0.060)

(0.059)

(0.062)

(0.049)

(0.049)

_{1}(0.06,

-0.001

-0.014

0.016

0.019

0.002

0.004

0.9,0.01)

(0.026)

(0.031)

(0.033)

(0.038)

(0.032)

(0.033)

^{2}(0.5,

0.071

1.162

0.081

1.158

0.078

1.159

0.5,0.5)

(0.081)

(1.162)

(0.119)

(1.160)

(0.090)

(1.159)

-0.032

-0.337

-0.037

-0.339

-0.036

-0.339

0.6,0.6)

(0.038)

(0.337)

(0.062)

(0.340)

(0.045)

(0.340)

^{2}(1.0,

-0.059

-0.069

-0.064

1.0,1.0)

(0.068)

(0.106)

(0.077)

^{2}(0,

0

0.001

0.000

0,0)

(0.000)

(0.001)

(0.001)

EM-W

Kim

Proportion

Mean (RMSE) SD

Mean (RMSE) SD

(EM-W is better)

Error rate

0.078 (0.078) 0.008

0.081 (0.081) 0.009

738/1000

Rand

0.891 (0.110) 0.012

0.886 (0.115) 0.012

806/1000

Adjusted

0.780 (0.222) 0.023

0.769 (0.232) 0.025

802/1000

Specifically, we first investigate the performance of our new extended EMMIX-WIRE model and that of Kim et al. _{0}
_{1}
_{1}
^{2}
^{2} in the proposed model are approximately unbiased, except for ^{2}, which is slightly underestimated. In contrast, the model of Kim et al.

We now compare our model with that of Kim et al. ^{2} = 0), where gene expressions are independent. The results are presented in Tables

**First component**

**Second component**

**Third component**

**Parameters**

**EM-W**

**Kim**

**EM-W**

**Kim**

**EM-W**

**Kim**

-0.001

0.024

0.002

-0.005

-0.001

-0.019

0.1,0.315)

(0.014)

(0.029)

(0.016)

(0.017)

(0.017)

(0.026)

_{0}(0.3,

-0.001

0.018

0.003

-0.046

0.000

-0.005

1,0.2)

(0.027)

(0.035)

(0.026)

(0.053)

(0.021)

(0.021)

_{1}(0.03,

0.001

-0.068

0.005

-0.041

0.001

0.008

1,0.02)

(0.085)

(0.146)

(0.108)

(0.127)

(0.086)

(0.085)

_{1}(0.06,

0.003

-0.031

0.005

0.047

0.002

0.004

0.9,0.01)

(0.042)

(0.063)

(0.054)

(0.072)

(0.050)

(0.054)

^{2}(1.3,

-0.059

1.254

-0.076

1.251

-0.052

1.242

1.3,1.3)

(0.087)

(1.254)

(0.178)

(1.257)

(0.104)

(1.243)

0.012

-0.198

-0.013

-0.201

0.009

-0.203

0.6,0.6)

(0.019)

(0.199)

(0.039)

(0.206)

(0.023)

(0.204)

^{2}(1.0,

0.046

0.056

0.039

1.0,1.0)

(0.070)

(0.145)

(0.084)

^{2}(0.,

0.000

0.001

0.000

0.,0.)

(0.000)

(0.001)

(0.000)

EM-W

Kim

Proportion

Mean (RMSE) SD

Mean (RMSE) SD

(EM-W is better)

Error rate

0.154 (0.154) 0.011

0.161 (0.162) 0.012

835/1000

Rand

0.796 (0.204) 0.014

0.783 (0.217) 0.016

912/1000

Adjusted

0.590 (0.411) 0.028

0.566 (0.435) 0.031

896/1000

Lastly, we generate the data from the model of Kim et al.

**First component**

**Second component**

**Third component**

**Parameters**

**EM-W**

**Kim**

**EM-W**

**Kim**

**EM-W**

**Kim**

-0.003

0.000

-0.008

0.001

0.010

-0.000

0.1,0.315)

(0.004)

(0.003)

(0.023)

(0.003)

(0.024)

(0.004)

_{0}(0.3,

0.002

0.000

0.003

0.001

0.001

0.001

1,0.2)

(0.013)

(0.013)

(0.010)

(0.010)

(0.010)

(0.010)

_{1}(0.03,

0.015

0.001

-0.236

-0.002

0.047

0.003

1,0.02)

(0.041)

(0.036)

(0.333)

(0.037)

(0.073)

(0.035)

_{1}(0.06,

0.014

-0.000

-0.308

-0.001

0.058

0.001

0.9,0.01)

(0.026)

(0.021)

(0.345)

(0.023)

(0.067)

(0.025)

^{2}(0.5,

-0.034

-0.000

-0.006

-0.001

-0.021

-0.000

0.5,0.5)

(0.036)

(0.006)

(0.027)

(0.015)

(0.025)

(0.009)

0.020

-0.000

0.013

-0.001

0.023

-0.001

0.6,0.6)

(0.021)

(0.007)

(0.025)

(0.017)

(0.028)

(0.009)

^{2}(0.0,

0.025

0.014

0.022

0.0,0.0)

(0.026)

(0.015)

(0.023)

^{2}(0,

0.000

0.045

0.042

0,0)

(0.000)

(0.095)

(0.056)

EM-W

Kim

Proportion

Mean (RMSE) SD

Mean (RMSE) SD

(EM-W is not worse)

Error rate

0.018 (0.019) 0.006

0.016 (0.017) 0.004

422/1000

Rand

0.978 (0.023) 0.006

0.980 (0.021) 0.005

365/1000

Adjusted

0.955 (0.046) 0.012

0.959 (0.042) 0.011

363/1000

**First component**

**Second component**

**Third component**

**Parameters**

**EM-W**

**Kim**

**EM-W**

**Kim**

**EM-W**

**Kim**

-0.009

0.001

-0.007

0.005

0.016

-0.001

0.1,0.315)

(0.013)

(0.010)

(0.012)

(0.011)

(0.020)

(0.013)

_{0}(0.3,

-0.002

-0.000

0.015

0.001

0.003

-0.000

1,0.2)

(0.023)

(0.023)

(0.024)

(0.019)

(0.016)

(0.016)

_{1}(0.03,

-0.005

-0.001

0.054

-0.000

0.003

0.000

1,0.02)

(0.071)

(0.074)

(0.0928)

(0.083)

(0.068)

(0.064)

_{1}(0.06,

0.015

-0.000

-0.131

0.001

0.020

0.000

0.9,0.01)

(0.036)

(0.036)

(0.135)

(0.045)

(0.041)

(0.043)

^{2}(1.3,

-0.195

-0.000

-0.185

-0.003

-0.186

-0.002

1.3,1.3)

(0.196)

(0.016)

(0.192)

(0.049)

(0.189)

(0.025)

0.043

-0.000

0.037

-0.002

0.044

-0.001

0.6,0.6)

(0.043)

(0.007)

(0.042)

(0.022)

(0.045)

(0.010)

^{2}(0.0,

0.144

0.131

0.143

0.0,0.0)

(0.145)

(0.133)

(0.144)

^{2}(0.,

0.000

0.000

0.001

0.,0.)

(0.000)

(0.001)

(0.001)

EM-W

Kim

Proportion

Mean (RMSE) SD

Mean (RMSE) SD

(EM-W is not worse)

Error rate

0.103 (0.104) 0.009

0.102 (0.103) 0.010

426/1000

Rand

0.864 (0.137) 0.012

0.866 (0.135) 0.012

360/1000

Adjusted

0.725 (0.276) 0.025

0.729 (0.272) 0.025

352/1000

Our model again provides unbiased estimates for all parameters. In contrast to the model of Kim et al.

Applications: Yeast cell cycle datasets

Yeast cell cycle dataset 1

The first example considers the yeast cell cycle data analysed recently by Wong et al.

In Table ^{2}in clusters 1 and 4 are large and are greater than the corresponding estimates of ^{2}, indicating coregulation in these two clusters. If we ignore such within-cluster coregulation, we will have Rand Indices similar to those for the model of Kim et al.

Clustering of gene expression profiles into four groups for the yeast dataset 1.

**Clustering of gene expression profiles into four groups for the yeast dataset 1.**

**First cluster**

**Second cluster**

**Third cluster**

**Fourth cluster**

0.104

0.054

0.118

0.724

_{1}

-0.107

0.400

-0.807

0.298

_{1}

1.009

-0.119

-0.053

0.079

^{2}

0.027

0.011

0.025

0.278

^{2}

0.174

0.417

0.443

0.307

0.278

0.717

0.435

0.053

^{2}

0.191

0.001

0.031

0.310

85

85

85

85

Yeast cell cycle dataset 2

The second example is the subset of 384 genes from the yeast cell cycle data in Cho et al.

Each of gene is assigned a “phase”. We call each “phase” a “Main Group”. There are five “Main Groups” in this dataset, namely, early G1, late G1, S, G2, and M. We now compare and assess the cluster quality with the external criterion (the 5 phases). The raw data are log transformed and normalized by columns and rows. Figure ^{2} are all very small compared to the estimates of ^{2}.

Clustering of gene expression profiles into five groups for the yeast dataset 2.

**Clustering of gene expression profiles into five groups for the yeast dataset 2.**

**First cluster**

**Second cluster**

**Third cluster**

**Fourth cluster**

**Fifth cluster**

0.238

0.290

0.151

0.165

0.157

_{1}

0.643

-0.061

-0.736

-0.616

0.329

_{1}

-0.062

1.019

0.285

-0.772

-1.001

^{2}

0.011

0.046

0.037

0.028

0.006

^{2}

0.498

0.296

0.470

0.309

0.244

0.503

0.269

0.364

0.379

0.550

^{2}

0.062

0.052

0.044

0.065

0.030

85

85

85

85

85

A complete Yeast dataset

With this third example, we demonstrate how the proposed method can be adopted to cluster a large amount of yeast genes of which only a small proportion shows periodicity. The original dataset consists of more than 6000 genes, where the yeast cells were sampled at 7 min intervals for 119 min with a total of 18 time points after synchronization

The new mixture model with AR(1) random effects and Fourier series approximations was fitted to the periodic gene expression data with the number of clusters

Clustering of gene expression profiles into twenty-one groups for the complete yeast dataset: (a) eight clusters of periodic genes; (b) thirteen clusters of non-periodic genes.

**Clustering of gene expression profiles into twenty-one groups for the complete yeast dataset: (a) eight clusters of periodic genes; (b) thirteen clusters of non-periodic genes.**

**cluster**

**G1**

**G2/M**

**M/G1**

**S**

**S/G2**

1

1

40

0

1

42

2

98

0

19

0

1

3

24

24

31

3

2

4

16

1

0

20

13

5

0

7

30

0

0

6

72

1

3

3

1

7

0

51

1

0

2

8

12

34

8

20

31

With reference to the findings by Spellman et al.

Discussion

We have presented a new mixture model with AR(1) random effects for the clustering of time-course gene expression profiles. Our new model involves three elements taking important role in modelling time-course periodic expression data, namely, (a) Fourier expansion which models the periodic patterns; (b) autocorrelation variance structure that accounts for the autocorrelation among the observations at different time points; and (c) the cluster-specific aandom effects which incorporate the coregulation within the clusters. In particular, the latter two elements corresponding to the correlations between time-points and between genes are crucial for reliable and accurate clustering of time-course data. We have demonstrated in the simulation and real examples that the accuracy of clustering is improved if the autocorrelation among the time dependent gene expression profiles has been accounted for along the time points; this is also demonstrated in Kim et al.

Simulated gene expression profiles for the three models.

**Simulated gene expression profiles for the three models.**

As an additional empirical comparison, we applied a simple

For the purpose of comparison, the periods of the signal of gene expression are assumed to be known in the simulation study and applications to real data. In practice, there are several ways to estimate the periods for each cluster _{1}
_{2}, …, _{
g
})^{
T
}, representing the component periods, where _{
h
}can take all possible values (grid points). For example, for the yeast cell cycle data, the possible periods are 60,61, …, 90. Then for each fixed (_{1}
_{2}, …, _{
g
})^{
T
}, we estimate the parameters as if the periods for each component were known. Finally, we compare the log likelihood and choose the one with the highest log likelihood as the final result. Since it is very slow if there are too many elements in

The proposed model is very flexible through the different specifications of design matrices or model options as originally available in Ng et al.

**Supplementary file for code and data supp2.zip**

Click here for file

Conclusions

Our new extended EMMIX-WIRE model is more reliable and robust for clustering time-course data because it postulates gene-specific random effects with an autocorrelation variance structure that models coregulation within the clusters. The developed R package is flexible in its specification of the random effects through user-input parameters that enables improved modelling and consequent clustering of time-course data.

Availability

An R-program is available on request from the corresponding author.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

All authors contributed to the production of the manuscript. SKN and GJM directed the research. KW wrote the R-program and analysed the simulated and real data. All authors read and approved the final manuscript.

Acknowledgements

This research was supported by a grant from the Australian Research Council.