Department of Chemistry, University of Warwick, Coventry, UK

Systems Biology Centre, University of Warwick, Coventry, UK

Abstract

Background

Post-genomic molecular biology has resulted in an explosion of data, providing measurements for large numbers of genes, proteins and metabolites. Time series experiments have become increasingly common, necessitating the development of novel analysis tools that capture the resulting data structure. Outlier measurements at one or more time points present a significant challenge, while potentially valuable replicate information is often ignored by existing techniques.

Results

We present a generative model-based Bayesian hierarchical clustering algorithm for microarray time series that employs Gaussian process regression to capture the structure of the data. By using a mixture model likelihood, our method permits a small proportion of the data to be modelled as outlier measurements, and adopts an empirical Bayes approach which uses replicate observations to inform a prior distribution of the noise variance. The method automatically learns the optimum number of clusters and can incorporate non-uniformly sampled time points. Using a wide variety of experimental data sets, we show that our algorithm consistently yields higher quality and more biologically meaningful clusters than current state-of-the-art methodologies. We highlight the importance of modelling outlier values by demonstrating that noisy genes can be grouped with other genes of similar biological function. We demonstrate the importance of including replicate information, which we find enables the discrimination of additional distinct expression profiles.

Conclusions

By incorporating outlier measurements and replicate values, this clustering algorithm for time series microarray data provides a step towards a better treatment of the noise inherent in measurements from high-throughput genomic technologies. Timeseries BHC is available as part of the R package 'BHC' (version 1.5), which is available for download from Bioconductor (version 2.9 and above) via

Background

Post-genomic molecular biology has resulted in an explosion of typically high dimensional, structured data from new technologies for transcriptomics, proteomics and metabolomics. Often this data measures readouts from large sets of genes, proteins or metabolites over a time course rather than at a single time point. Most biological time series aim to capture information about processes which vary over time, and temporal changes in the transcription program are often apparent

Grouping together genes which exhibit similar variations in expression over time can identify genes that are likely to be co-regulated by the same transcription factors

McLachlan

The Bayesian Hierarchical Clustering (BHC) algorithm

Measurement error is not the only source of noise to consider. Genes regulated by the same transcription factor(s) are unlikely to have identical expression profiles for the duration of the time series, which leads to inherent variation in the expression data of co-regulated genes. Liu

Methods

Bayesian Hierarchical Clustering

Agglomerative hierarchical clustering is a commonly used approach to group genes according to their expression levels. In this algorithm, each gene begins in its own cluster and at each stage the two most similar clusters are merged.

The BHC algorithm

The prior probability, _{
k
}, that a given pair of clusters, _{1 }and _{2}, should be merged is defined by the DPM and is determined solely by the concentration hyperparameter for the DPM and the number of genes currently in each partition of the clustering (see Savage _{
k
}, that the pair of clusters should be merged.

where **
y
**= {

where _{
i
}and _{
j
}are previously merged clusters containing subsets of the data in **
y
**.

While _{
k
}is greater than 0.5, it is more likely that the data points contained in the clusters _{1 }and _{2 }were generated from the same underlying function, _{
k
}is less than 0.5 for all remaining pairs of clusters, the number of clusters and partition best described by the data has been found.

Gaussian Process Regression

Gaussian process regression (GPR) is a non-linear regression method with several previous applications in the analysis of gene expression data

In our GPR model a single observation at time point _{
i
}is represented as _{
i
}) = _{
i
}) + **
f
**is drawn from an infinite dimensional Gaussian distribution, where the correlation structure between the points is determined by a covariance function, Σ, with hyperparameters,

Let **
y
**= [

where **
y
**. We have implemented both the squared exponential and cubic spline covariance functions into BHC. The probability

Covariance Functions

The covariance function _{
SE
}, which is a widely-used choice for

where _{
ij
}is the Kronecker delta function and _{
i
}and _{
j
}are two time points for _{
SE
}the hyperparameter _{
SE
}increases and tends to unity, meaning these values of _{
CS
}, to facilitate comparison with the clustering method of Heard

where _{
i
},_{j}). _{
CS
}only has two hyperparameters,

Using replicate data to learn the noise hyperparameter

For each cluster, we learn the hyperparameters **
θ
**

The total noise variance,

where _{
r,g,t
}} is the set of replicates for an observation.

It is these averages of the replicate values,

where

Gamma prior on the total noise variance

**Gamma prior on the total noise variance**. A Gamma prior is assumed for the hyperparameter

The hyperparameters, **
θ
**|

where ^{-1}
_{
j
}is a matrix of element-wise derivatives and '_{
j
}), is assumed, and therefore the corresponding partial gradients contain only the trace term above. If replicate information is not required to be included in BHC, a flat prior is also assumed for

Modelling outliers

We have so far considered the total noise in microarray measurements to have a Gaussian distribution. However, despite averaging replicate values, microarray data typically contain some outliers that are not well modelled by the Gaussian noise distribution used for the majority of the data.

Kuss **
y
**|

We simplify our notation to denote, _{
n
}. Following the reasoning in Kuss _{
n
}, was generated by an unknown likelihood function, _{
o
}, producing outlier measurements, and a probability _{
n
}is a regular value, which was generated by a Gaussian likelihood function, _{
r
}. This mixture likelihood function is therefore:

The expression for the marginal likelihood then becomes:

Multiplying out the likelihood function product would result in 2^{
N
}terms. In the case that _{
o
}is a conjugate distribution to _{
r
}, evaluation of this integral would be analytically solvable, but computationally intractable for large numbers of observations. However, if the proportion of outlier measurements is small, this series can be approximated. Making the following simplifications to notation: _{
n
}= _{
r
}(_{
n
}|_{
n
}, **
θ
**) and

The term with coefficient ^{
N
}represents the case where no observations are outliers. Terms with coefficient ^{
N-1}

Terms with ^{2 }or higher order in their coefficients represent the case that two or more observations are outliers. Since

The likelihood function for the outlier terms, _{
n
}, is modelled as the same constant function for all measurements,

When the _{
n
}represent Gaussian distributions, it follows that

Where ^{th }observation and _{-n
}is the corresponding covariance matrix.

After optimisation of the hyperparameters for the covariance function, **
y
**|

Datasets

For the computational experiments we have used time series data sets from four published microarray studies, which we refer to as

The

Performance metrics

When comparing BHC to other clustering methods, we are interested in identifying which method produces the most biologically meaningful clusters, and therefore use the Biological Homogeneity Index (BHI)

The average Pearson correlation coefficient,

The BHI and average PCC both represent mean values of a large number of pairwise similarity comparisons. For BHI, we considered whether or not pairs of (annotated) genes that have been allocated to the same cluster share GO annotations. For each such pair of genes, we thereby obtained a 1 or 0, depending on whether or not the genes do (1) or do not (0) have the same annotation. The confidence intervals for the BHI scores provided in Table

Comparison of clustering methods using performance metrics

**#**

**
S. cerevisiae 1
**

**#**

**
S. cerevisiae 2
**

**#**

**
H. sapiens
**

**#**

**
E. coli
**

**Clustering method**

**clusts**

**clusts**

**clusts**

**clusts**

BHC-SE

13

**0.68 ± 0.005**

58

**0.883 ± 0.003**

6

**0.75 ± 0.009**

24

**0.84 ± 0.003**

BHC-C

9

0.66 ± 0.004

40

0.877 ± 0.002

2

0.55 ± 0.009

15

0.80 ± 0.003

SC-linear

7

0.60 ± 0.006

40

0.881 ± 0.002

4

0.69 ± 0.009

17

0.78 ± 0.004

SC-cubic

4

0.49 ± 0.005

22

0.852 ± 0.002

2

0.44 ± 0.010

8

0.67 ± 0.004

HCL

13*

0.53 ± 0.009

58*

0.881 ± 0.002

6*

0.66 ± 0.016

24*

0.68 ± 0.006

SSClust

13*

0.60 ± 0.008

58*

0.846 ± 0.003

6*

0.69 ± 0.015

24*

0.72 ± 0.010

CAGED

2

0.42 ± 0.042

6

0.606 ± 0.003

3

0.55 ± 0.020

2

0.47 ±0.005

MCLUST

8

0.60 ± 0.004

30

0.858 ± 0.002

6

**0.75 ± 0.011**

11

0.73 ± 0.004

Zhou

13*

0.60 ± 0.008

58*

0.853 ± 0.004

6*

**0.75 ± 0.011**

24*

0.74 ± 0.006

#

#

#

#

Clustering method

clusts

BHI ± stdev

clusts

BHI ± stdev

clusts

BHI ± stdev

clusts

BHI ± stdev

BHC-SE

13

0.70 ± 0.07

58

**0.57 ± 0.03**

6

0.62 ± 0.06

24

0.46 ± 0.06

BHC-C

9

**0.73 ± 0.11**

40

0.55 ± 0.03

2

**0.78 ± 0.05**

15

**0.47 ± 0.04**

SC-linear

7

0.69 ± 0.10

40

0.55 ± 0.02

4

0.66 ± 0.07

17

0.35 ± 0.03

SC-cubic

4

0.64 ± 0.02

22

0.53 ± 0.01

2

0.70 ± 0.03

8

0.32 ± 0.02

HCL

13*

0.50 ± 0.04

58*

0.56 ± 0.04

6*

0.52 ± 0.07

24*

0.44 ± 0.07

SSClust

13*

0.65 ± 0.03

58*

0.56 ± 0.02

6*

0.64 ± 0.05

24*

0.36 ± 0.03

CAGED

2

0.64 ± 0.02

6

0.52 ± 0.02

3

0.68 ± 0.04

2

0.21 ± 0.01

MCLUST

8

0.69 ± 0.02

30

0.55 ± 0.02

6

0.61 ± 0.06

11

0.47 ± 0.04

Zhou

13*

0.66 ± 0.03

58*

0.54 ± 0.02

6*

0.61 ± 0.06

24*

0.43 ± 0.07

#

#

#

#

Clustering method

clusts

log marginal likelihood

clusts

log marginal likelihood

clusts

log marginal likelihood

clusts

log marginal likelihood

BHC-SE

13

**-3293**

58

**-3956**

6

**-633**

24

**-2497**

BHC-C

9

-3356

40

-4294

2

-734

15

-2622

Table 1 shows the average Pearson correlation Coefficient (** y**|

Over-represented GO annotations were found using the GOstat web-based interface

Results and Discussion

Comparison of BHC to other clustering methods

For each of the four data sets, we compared the BHC time series algorithm using squared exponential (BHC-SE) and cubic spline (BHC-C) covariances to the clustering methods of SplineCluster

Freely available software is available for each method, and all but HCL estimate the number of clusters for a data set. However, the BIC score in SSClust generally continued to improve with an increasing number of clusters, suggesting overfitting. For the method of Zhou

**The clustering method of Zhou et al**. Further details for running the method of Zhou

Click here for file

Table

GO annotation matrices

**GO annotation matrices**. Over-represented GO annotations,

At each stage, the BHC algorithm calculates the marginal likelihood of the tree structure for the data, **
y
**|

**Genes lists and cluster plots**. Gene lists and cluster eps files for the

Click here for file

**GO annotation matrix for S. cerevisiae 1 data set clustered using BHC with cubic spline covariance**. A large version of Figure 2, left panel.

Click here for file

**GO annotation matrix for S. cerevisiae 1 data set clustered using SplineCluster with linear splines**. A large version of Figure 2, right panel.

Click here for file

BHC clustering of simulated data sets

An advantage of the BHC algorithm is that it allows simulated data with realistic noise and expression profiles to be generated from the Gaussian processes inferred from the BHC clustering of real biological data.

To demonstrate that the BHC algorithm can find the correct number of clusters for a synthetic data set, we analysed simulated data sets with the same number of genes, timepoints and noise levels, which were generated from the 6 and 13 Gaussian processes inferred from the BHC-SE clustering of the

H. sapiens simulated data

**H. sapiens simulated data**. Relative frequencies of the estimated number of clusters obtained when a variety of clustering algorithms (BHC-C, BHC-SE, SplineCluster with linear and cubic splines, MCLUST and SSClust) were applied to simulated data sets (due to slow running times, we only used 100 of the 1000 simulated data sets to obtain the SSClust results). For each clustering algorithm, we draw lines between relative frequency values to aid interpretability. Each simulated data set was generated from the 6 Gaussian processes obtained from the BHC-SE clustering of the

S. cerevisiae 1 simulated data

**S. cerevisiae 1 simulated data**. As Figure 3, except that simulated data sets were generated from the 13 Gaussian processes obtained from the BHC-SE clustering of the

BHC-SE finds the correct number of clusters for the simulated data generated from the 6 Gaussian processes in 80% of cases. For the simulated data generated from the 13 Gaussian processes, BHC-SE finds between 11-13 clusters in 89% of cases. For the

Modelling outlier measurements

We investigated the effect of using the mixture model likelihood. Figure

Effect of a mixture model likelihood on noisy gene classification

**Effect of a mixture model likelihood on noisy gene classification**. Using a mixture model likelihood allows BHC to model certain time points as outlier measurements for the genes shown, and assign the noisy gene to a cluster which is more coherent in its expression profiles and biological function. Outlier time points are time point 11 for

In the

In the

Inclusion of replicate information

We investigated the effect of including the replicate information. Figure

Effect of including replicate information on noisy clusters

**Effect of including replicate information on noisy clusters**. Using replicate information can split a noisy cluster into smaller more biologically homogeneous clusters with distinct profiles. The examples shown use BHC-C for the

The standard BHC cluster from the

Including the replicate information for the

An unusually noisy cluster (bottom left Figure

Run time

Table

Run time

**Data set**

**BHC-SE**

**BHC-SE mixture model**

**Genes**

**Time points**

**Replicates**

6 m 3 s

38 m 49 s

169

17

N/A

24 m 8 s

5 h 48 m

440

15

2

19 s

49 s

58

10

44

7 m 6 s

34 m 39 s

200

13

6

Run times of data sets for BHC-SE and BHC-SE with a mixture model likelihood in hours (h), minutes (m) and seconds (s) on a 2.40 GHz Intel Xeon CPU. The run times for BHC-C were very similar to BHC-SE. Using replicate information did not increase the run times. Also shown are the number of genes, time points and replicates for each dataset.

Conclusions

We have presented an extension to the BHC algorithm

BHC facilitates the inclusion of replicate information, and our results clearly demonstrate an improvement in the ability to distinguish between distinct expression profiles when this replicate information is included. Microarray data typically contain outlier observations, which should not be treated together with the majority of observations. One unique aspect of the BHC algorithm presented in this paper is its ability to model these noisy outlier measurements using a mixture model likelihood. The result is that genes with a small number of noisy values, which would otherwise have been assigned to a noisy cluster, are assigned to a biologically relevant cluster, where the noisy gene shares biological functions with the other cluster members. This method provides a step towards a better treatment of the noise inherent in measurements from high-throughput post-genomic technologies.

Availability

Timeseries BHC is available as part of the R package 'BHC' (version 1.5), which is available for download from Bioconductor (version 2.9 and above) via

Authors' contributions

EJC and RSS wrote the clustering code, EJC and PDWK analysed the simulated data and performed bootstrapping, EJC performed the clustering analysis, RD optimised the C++ code and updated the BHC Bioconductor package, DLW designed and directed the research. All authors contributed ideas, participated in writing this article, and read and approved the final manuscript.

Acknowledgements and Funding

We thank Francesco Falciani and Gianni Dehò for providing the