Abstract
Background
Time course data from microarrays and highthroughput sequencing experiments require simple, computationally efficient and powerful statistical models to extract meaningful biological signal, and for tasks such as data fusion and clustering. Existing methodologies fail to capture either the temporal or replicated nature of the experiments, and often impose constraints on the data collection process, such as regularly spaced samples, or similar sampling schema across replications.
Results
We propose hierarchical Gaussian processes as a general model of gene expression timeseries, with application to a variety of problems. In particular, we illustrate the method’s capacity for missing data imputation, data fusion and clustering.The method can impute data which is missing both systematically and at random: in a holdout test on real data, performance is significantly better than commonly used imputation methods. The method’s ability to model inter and intracluster variance leads to more biologically meaningful clusters. The approach removes the necessity for evenly spaced samples, an advantage illustrated on a developmental Drosophila dataset with irregular replications.
Conclusion
The hierarchical Gaussian process model provides an excellent statistical basis for several geneexpression timeseries tasks. It has only a few additional parameters over a regular GP, has negligible additional complexity, is easily implemented and can be integrated into several existing algorithms. Our experiments were implemented in python, and are available from the authors’ website: http://staffwww.dcs.shef.ac.uk/people/J.Hensman/ webcite.
Background
Gene expression time course experiments have been used to investigate fundamental biological processes which are often dynamic in nature. For example, the cell cycle [1], cell signalling [2], circadian rhythms [3] and developmental processes [4] have been studied extensively using gene expression timeseries data.
Many computational approaches to timeseries analysis are not always well suited to gene expression data, where missing measurements are common and time points may not be spaced regularly. In many conventional timeseries models such as statespace models [5,6] there is no straightforward manner to deal with missing data, and time points must occur at regular intervals. Whilst gene expression experiments can be sampled regularly, such designs may not be optimal from a statistical or cost perspective. A method for modelling arbitrarily sampled time points may elicit more information from fewer samples, where time points are selected to capture pertinent temporal features.
Furthermore, existing timeseries models do not necessarily capture the structure of gene expression data. Many gene expression timeseries are performed with multiple biological replicates: the crude method of simply averaging the replicates may be discarding interesting information. It is also unclear what to do when the replicates are not sampled at the same times. There is a need for a temporal model which deals with the replicate structure.
Our proposed model is based upon two important ideas: Gaussian process (GP) regression allows for parsimonious temporal inference, whilst a hierarchical structure accounts for (temporally structured) covariance between biological replicates. Additional layers can be added to the hierarchy to model more structure in the data. For example, in a data fusion application, a layer of hierarchy can be used to account for differences between gene expression measurement platforms; or in a clustering application, a hierarchical layer can be added to account for temporal covariance of genes within a cluster.
GPs have been successfully applied to the analysis of gene expression time series by several authors [79]. There is little doubt that they provide a coherent and principled framework for regression: for an introduction see [10]. Our contribution is to propose hierarchical Gaussian processes to deal with structure in the data. We provide an introduction to the idea, deriving a novel covariance function which accounts for structure. The idea is simple to implement yet highly effective as we demonstrate on several problems. A hierarchical GP model could easily be integrated with existing GPbased applications, allowing them to properly account for replicate structure.
In a further contribution, we manipulate the marginal likelihood expression for the hierarchical GP model for the case where each part of the structure is sampled at the same time, leading to an expression with reduced computational complexity. This situation is most likely to occur during clustering of genes, which must all be measured simultaneously using high throughput methods.
Short time series are prevalent in geneexpression data sets [11]. Our GPbased model is well suited to short timeseries, and the behaviour of GPs can be set to mimic that of other temporal models (such as autoregressors) through the covariance function [10], though in this work we use a simple form for the covariance which assures smoothness of the underlying dynamics.
Unlike other timeseries based approaches, GPs are not restricted to data which has been sampled at evenly spaced time points. The model therefore removes any restriction on temporal sampling — it can be totally irregular and differ between replicates. This also allows our method to deal with both randomly and systematically missing data. We show how the model can be used for data fusion where the temporal sampling differs between experiments.
Related work
Hierarchical models are an important idea in Bayesian statistics [12], allowing information to be exchanged between related groups of data. The idea is that by performing inference on the structure as a whole, rather than on each part of the structure independently, inference is improved. GPs have been successfully used in models of gene expression timeseries before; for example for inferring transcriptional regulation [8], and to identify differential expression in timeseries [7,13]. A key contribution of this work is to combine hierarchical structures with GPs to provide a parsimonious and elegant method for dealing with replicated gene expression timeseries.
An alternative to our method was proposed by [14]. In this model, uncertainty is assumed in the time of data collection, and the timeshift in each replicate is estimated. In our model, the times are assumed correct whilst the shift is assumed to occur in the expression. Our model has significant computational advantages, since we can marginalise the shifts in expression analytically under the GP framework, whilst the method proposed in [14] required optimisation of a large number of variables (one for each observation). Further, our model is easily included in more complex GPbased models, such as the clustering application which we shall demonstrate. The estimation of timeshifts would be difficult to incorporate into a clustering method, especially considering the very large numbers of parameters which require optimising.
Clustering expression data while modelling withincluster variance is one of the primary applications of our model. Previously, [15] proposed a random effects model to account for variance between observations of genes and also within clusters of genes. Further, [16] and [17] explored clustering methods with hierarchical structures to model replicate variance. In these models, replicate variance was modelled as multivariate Gaussian around some genespecific mean, and the gene’s expression was considered multivariate Gaussian around a clusterspecific mean. This paper presents a similar but more powerful idea: we use a hierarchy of GPs to model genespecific and replicatespecific temporal covariance. We demonstrate that the introduction of a GP prior makes inference of clusters more viable by reducing the number of parameters required to model the data within a cluster, and we also provide a method for dramatically reducing the computational cost of evaluating clusters under our model. Previous methods for clustering temporal data (e.g. [18]) have not used the replicate structure in the model.
Methods
Background: Gaussian processes
Gaussian processes (GPs) have been used extensively in a variety of regression problems, and have been applied to gene expression timeseries by several authors [79]. We briefly introduce GP regression and introduce some notation: for an indepth introduction one may consult [10].
To perform regression using GPs, we adopt a Bayesian approach. Starting with a prior directly over functions, we update the distribution in light of observed data, moving to a posterior distribution. Using standard results for Gaussian distributions, regression involves only some simple linear algebra. The GP prior is fully specified by two functions, a mean function m(t) and a covariance function k(t,t^{′}), and is denoted
For practicality, a zeromean function is often assumed; throughout this work, the squaredexponential covariance function will be used: k(t,t^{′})=α exp{−γ(t−t^{′})^{2}}.
Our choice of covariance functions represents a prior belief that the underlying functions are smooth. Other covariance functions could be selected using a modelselection procedure [10]. The parameters of the covariance function (referred to as hyperparameters) control the amplitude (α) and relative lengthscale (γ) of the functions (see Figure 1 for an illustration). The form of the covariance function captures a very simple assumption about the function: that function points which are close to each other (t−t^{′} is small) are highly correlated, whilst points which are distant(t−t^{′} is large) are less correlated.
Figure 1. An illustration of a simple hierarchical GP. Top: the prior over the underlying function g_{n}(t), with the mean μ(t)=0 shown as a heavy solid line, and the shaded area representing the variance (amplitude) of the function, controlled by the hyperparameter α_{g}. A single sample from the prior is shown as a narrow line, and the lengthscale of the function, inversely controlled by the hyperparameter γ_{g}, is marked. Middle: three functions, representing three replicates are shown, along with samples conditioned on the sample shown in g_{n}(t). The three replicates follow the trend of g_{n}(t), but deviate independently by a small amount (variance α_{f}) with a short lengthscale, marked in the third replicate. Bottom: the covariance matrix used to generate the samples Σ_{n}. Note the blockwise relationship to the replicates.
Regression can be performed by using the marginal and conditional properties of multivariate Gaussian distributions. Supposing we have observations f of a function at times t, and wish to predict the values of that function at times t_{⋆}, which we denote f_{⋆}: the joint probability of f and f_{⋆} is given by
where the covariance matrix K_{t,t} has elements derived from the covariance function k(t,t^{′}), such that the (i,j)^{th} element of K_{t,t} is given by k(t[ i],t[ j]). Consistency of the GP means that it is not necessary to consider the values of the function where we do not have data: these values are trivially marginalised. To perform regression, the conditional property of the multivariate Gaussian gives:
In practice, we are presented with a measurement vector y which is a noise corrupted version of f. Assuming Gaussian noise^{a} it is possible to write , where β is the variance of the noise and I the appropriately sized identity matrix, and then marginalise the variable f. Equivalently, one can consider y to be observations of the Gaussian process y(t), whose mean function is the Gaussian process f(t), and covariance function is . This hierarchical structure is used later in this publication to build GP priors over replicates and clusters. Either interpretation gives a joint density:
and regression follows from the conditional property similarly to (3).
Gaussian process regression is a Bayesian method. We move from a prior over functions to a posterior, and a significant attraction of the method is that this occurs in closed form as (3). However we must still deal with hyperparameters of the covariance function. Here, we make use of the usual technique which is to optimise the hyperparameters using typeII maximum likelihood. That is, collecting the hyperparameters α,β and γ in to a vector θ, we use gradient methods to optimise p(yθ) with respect to θ. This is given by
which depends on θ through the covariance matrix K_{t,t}.
A hierarchy across replicates
Gene expression timeseries may be collected in multiple replicates, to account for biological variation. The idea is that there exists some common trend, present in all replicates, which we wish to identify, and the measurements made of each replicate vary due to biological differences as well as technical noise.
We shall use the notation y_{nr} to denote the vector of measurements of gene expression of the n^{th} gene, in the r^{th} biological replicate; these measurements were made at times which we collect into a vector t_{nr}. The data for the n^{th} gene is denoted , .
Our proposed methodology mimics the structure of the data, directly modelling underlying timeseries as well as the biological variation, and accounting for (uncorrelated) measurement noise. First consider a timeseries model of a single gene. To combine replicates of a particular gene’s timeseries, we use a Bayesian hierarchical approach: the underlying expression profile of the n^{th} gene g_{n}(t) is presumed to be drawn from a zeromean GP with covariance k_{g}(t,t^{′}), whilst the expression profile of a particular replicate f_{nr}(t) is drawn from a GP whose mean is g_{n}(t). Thus
Note that the two covariance functions k_{g} and k_{f} may in general be different: we have used the squared exponential function for both, with independent parameters. This simple model is illustrated in Figure 1, where the dependent nature of the functions is illustrated, as well as the effects of the hyperparameters.
The elegance of the hierarchical approach lies in its linearity: it is simple to show that two points on the function f_{nr}(t) are jointly Gaussian distributed with zero mean and covariance k_{g}(t,t^{′})+k_{f}(t,t^{′}). Furthermore, two points in separate replicates are jointly distributed with covariance k_{g}(t,t^{′}). Thus, given a set of N_{n} replicates of gene expression timeseries for a particular gene, , taken at different time points it is possible to write the likelihood:
where has been used to denote the concatenation of Y_{n}, , θ represents the parameters of the covariance functions k_{g} and k_{f}, and the block of Σ_{n} corresponding to is given by
In order to make inferences about the functions g_{n}(t) and f_{nr}(t), the covariances between the data Y and the functions are required. Using the superscripted ynr(t) to denote the element of y_{nr} observed at time t:
Inferences about functions can then be made using the standard methods described above, and hyperparameters of the covariance functions can be optimised.
Fitting a hierarchical model to a set of replicates can be used as a diagnostic tool. In particular, by examining the maximumlikelihood values of the covariance function parameters, we can assess how noisy the experiment is, and how similar the replicates are. Figure 2 shows three examples of hierarchical regression of the time course data described the Results section, for three genes (modelled independently), with each gene shown in one row in the figure. The leftmost panes in the figure show the inferred function for the gene, g_{n}(t), and subsequent panes show the inferred functions for each replicate, f_{nr}(t).
Figure 2. Hierarchical GP regression on the expression of three illustrative genes during Drosophila Melanogaster development, using eight replicates with different time sampling. Each row represents one gene. The leftmost panes show the inferred function g_{n}(t), subsequent panes represent the replicates. Solid lines show the mean of the predicted function, and the shaded area represents the 95% confidence interval. The parameters of the covariance function were optimised by maximum likelihood. Note that the replicates contain different numbers of data and different times.
These examples demonstrate different behaviours of the timeseries which are captured by the model. For the first gene, CG18135, the replicates are quite similar, and most deviation of the data from g_{n}(t) is attributed to noise. The model attributes 87% of the data’s variation to the underlying function g_{n}(t), and only 6% each to replicate variance and noise, i.e. the model ‘recognised’ the similarity of the replicates.
The gene represented by the middle row is AP47, and it can be seen that there is considerable replicate variance: although each replicate follows a similar pattern, the pattern is ‘amplified’ differently in the replicates. Here, the model attributes 60% of the data’s variance to the function g_{n}(t) and 34% to f_{nr}(t), with 6% to noise.
The gene represented by the bottom row of Figure 2 is OstStt3. Here, the variances of g_{n}(t), f_{nr}(t) and noise are 55%, 36% and 8% respectively. The model recognises the differences in the replicates, but uses a long length scale for f_{nr}(t). In this gene, the detailed pattern of the timeseries is captured entirely by g_{n}(t), and f_{nr}(t) is used to account for amplitude shifts between replicates. Note that these cannot be simply ‘normalised out’ because not all replicates cover the same temporal region. These genes were selected using a simple filtering procedure. The model was fitted independently to each gene on a microarray, and the genes were ranked according to the ratio of signal variance (a hyperparameter of k_{g}) and replicateplusnoise variance (hyperparameters from k_{f}).
Deeper hierarchies
In many cases, gene expression timeseries may have more structure than simply biological replicates. For example, we could incorporate previous studies in a hierarchical fashion. In general, suppose that there is some underlying function g_{n}(t) which models the general gene expression activity for the n^{th} gene. Subsequently, we define the functions e_{ni}(t) for each experiment which we want to model, and finally f_{nir}(t) for the r ^{th} replicate in the i ^{th} experiment.
With every layer of the hierarchy, we have introduced new parameters corresponding to the covariance function for that layer. Note that the hierarchy can be extended arbitrarily to represent the structure of the data. For example, we might want to model biological variation where the lineage is known, or to model interspecies variation, or to build a hierarchy which reflects the phylogenetic relationship between species.
An efficient model of clusters
Clustering of gene expression timeseries is often performed with a view to finding groups of coregulated or associated genes. The central assumption is that genes which are involved in the same biological processes will be expressed together: they share an underlying timeseries.
In order to model a group of genes as defined by a cluster, the hierarchical model is extended to a threelayer hierarchy across the cluster, individual genes and replicates.
All genes in the i^{th} cluster are presumed to share an underlying profile h_{i}(t), and subsequently each gene follows a profile g_{n}(t) and each replicate of that gene follows a profile f_{nr}(t). The mean of each level in the hierarchy is given by the level above, so the data Y_{i} in cluster i is modelled by:
If is the concatenation of all of the representing genes in the i^{th} cluster, noting that each is itself a concatenation of the biological replicates, then the marginal likelihood of the expression data in the i^{th} cluster, Y_{i} is given by
where the covariance matrix Σ_{i} is structured such that the block corresponding to the two genes n and n^{′} is given by
Note that the diagonal blocks of Σ_{i} are themselves blockstructured, reflecting the double hierarchy in the model.
The computational complexity of this model grows cubically as the size of the cluster increases, which is an undesirable property. To reduce the computational load, it is possible to exploit a known property of the data. In each array all genes are simultaneously measured, although we allow different times for each replicate. Denote the concatenation of the times in all replicates, define K_{h} as the covariance matrix formed by evaluating k_{h} on the grid of , and Σ_{n} a covariance matrix structured as (8), modeling the variance of a single gene. The marginal likelihood can then be written
where is the mean of the in the cluster, D is the length of , and N_{i} is the number of genes in the cluster (see appendix for a derivation). This expression has reduced the computational complexity of the model from to .
An example of this model is shown in Figure 3. The inferred function h(t), shown in the bottomleft pane has a single wide peak at around 15 hours; all of the functions g_{n}(t) (leftmost column) show a similar pattern, though the functions are each ‘distorted’ a little, with the width of the peak varying from gene to gene. Similarly, each replicate shows a similar pattern to the mean function for the corresponding gene, with smaller variations. The bottom row shows the predictive density for a new gene within the cluster.
Figure 3. A hierarchical model of expression across multiple genes within a cluster. Each row represents one gene (gene names are to the right) and each box within that row represents a biological replicate. Data are represented as black points, the shaded area represents 95% confidence interval and a solid line represents a posterior mean function. The leftmost box on each row shows the inferred function for each gene g_{n}(t), and the bottom row shows the mean function for the cluster h(t) (left) and the predicted function(s) for a hypothetical new gene.
Clustering
To use our model for clustering, the partitioning of genes into clusters needs to be inferred. Dunson [18] proposed a clustering scheme where a GP is used to model the function within a cluster, and a Dirichlet process prior is placed on the partitioning. This leads to a Gibbs sampling scheme where each Gibbs step involves removing a gene from the clustering and then stochastically reallocating it. Our model potentially improves on Dunson’s formulation since we consider a structured covariance across the genes and replicates (which was treated as iid noise by Dunson), though it would be possible to use the same Gibbs sampling scheme to infer the cluster partitions.
Heller and Ghahramani [19] showed that inference in a DP can be effectively approximated using an agglomerative clustering scheme dubbed Bayesian Hierarchical Clustering (BHC)^{b}. Cooke et al. [20] applied this hierarchical clustering scheme to gene expression timeseries with a GP prior, and we extend their work using a hierarchically structured GP to model the clusters, as well as the efficient computation of the marginal likelihood as per equation (15).
The algorithm is depicted in Algorithm 1, and works in a ‘bottomup’ fashion. Starting with each gene in an individual cluster, the clusters are merged by greedily selecting the merge which maximises the log marginal likelihood of the data (by summing the log marginal likelihood over all clusters). Once no more merges are available to improve the overall marginal likelihood, the hyperparameters are optimised, and the procedure is repeated with the new covariance function in an EM fashion.
Results and discussion
In a recent study of Drosophila development [21], gene expression was measured in eight replicates measured across six species at differing timepoints. 3695 genes (with orthologs across the species) were investigated using Agilent microarrays. Here we focus on Melanogaster development: time courses for typical genes are shown in Figure 2 and 3, with eight replicates at up to thirteen timepoints. Note in particular that no replicate contains all the timepoints: some replicates cover only the last few points, whilst some have broader coverage.
For all the missing data experiments, the covariance function hyperparameters were set to the maximum likelihood value using gradient basednumerical optimisation. Whilst we show that the hierarchical GP has better performance than the GP in all cases, it does not require any extra computation. All experiments took only a few minutes on a desktop PC.
Missing data imputation
The imputation of missing data is a straightforward method for validation of our model. In this Section, we remove data systematically, effectively removing entire microarrays from the experiment and predicting what was on them. Most missing data imputation methods cannot handle this type of missing data, highlighting an advantage of our method. This experiment also validates our assertion that it is important to include the replicate structure in modelling microarray timeseries, and that simply averaging the data on a timepoint basis is not satisfactory.
Algorithm 1 Cluster replicated gene expression timeseries
Whilst systematically missing data are not common in the laboratory, this test does examine the performance of our model in some potentially interesting applications. For example, we may wish to predict the future gene expression levels of a patient given the time series observed in other patients.
The results of imputing missing data are compared with the simple but oftused technique of averaging the replicates, using both the mean and median of the nonmissing replicates. The method is also compared with a simple GP model which does not account for replicate structure. We investigated the effectiveness of our algorithm using varying amounts of missing data, removing 1, 5, 10 and then 20 of the 56 microarrays at random. Each experiment was repeated 10 times with different randomisations; for each we computed the RMSE (root mean square error) averaged over all missing arrays and over all genes. The mean RMSE and two standard deviations as measured over the randomisations are shown in Table 1.
Table 1. RMSE for missing data imputation for differing numbers of randomly removed arrays
Table 1 shows that the hierarchical GP performs better at imputing the missing data in all examined cases. Although the Table shows only the average over randomisations, the HGP algorithm gave the lowest RMSE for every randomisation that we tried. The standard deviations in the Table generally decrease as the number of missing points increases. This reflects the degree to which the missing data imputation depends on which timepoints are missing, which may be due to the different temporal sampling schemes employed in the different replicates.
We note from Table 1 that our contribution of adding replicate structure to the GP methodology makes a significant difference to the results, since the standard GP offers only modest improvement over the simple averaging methods. We also note that the averaging methods are only possible where timepoints are duplicated between replicates, a restriction which the (H)GP methodology removes.
Randomly missing data imputation
Our proposed model is novel in the sense that it can impute entire missing arrays, as above. Most missing data algorithms assume randomly missing data and use correlation between genes for imputation. To compare our algorithm with those from the literature, we randomly removed 100 values from the Melanogaster dataset, and measured the error on imputation. For comparison, we also used two popular methods, Knearestneighbour (KNN) [22] and Bayesian principal component analysis (PCA) [23].
Gene expression experiments usually contain many types of effect aside from the one under study. In this case, the data includes crosssectional effects which arise from arrayspecific and samplespecific causes, and are not due to the underlying timeseries. These are treated as noise by our model, whereas PCA and KNN make no distinction between longitudinal and crosssectional variance and will freely impute these effects. This is illustrated in Figure 4, where crosssectional effects mean that the missing datum’s true value lies below that seen by averaging the replicates, or imputed by HGP. The HGP and KNN methods, being sensitive to these effects impute the true value well, despite it being inconsistent across replicates.
Figure 4. An example of randomly missing data imputation. Here, the leftmost pane shows the inferred mean function for the gene g_{n}(t), and the next three panes show three replications, in the last of which the missing data imputation occurs. Other replications are omitted for brevity. The final panel shows a zoom of the imputed region (boxed in the previous panel). Imputation by the HGP is marked by a circle on the mean function; replicate median is denoted by a square; PCA is denoted by a diamond and KNN by a triangle. The HGP imputation is closer to the replicable statistic, being less affected by crosssectional noise effects than the other methods.
To test the methods’ abilities to impute the replicable part of the signal, we tested the imputed values of the three methods against the median value for the missing timepoint, averaged across replicates. We measured the RMSE over the 100 imputations, and repeated the experiment 10 times with different randomisations. The mean RMSE (over randomisations) and two standard deviations are shown in the first column of Table 2.
Table 2. RMSE of missing data imputation for hierarchical Gaussian processes (HGP) principal component analysis (PCA) and Knearest neighbour (KNN)
Another way to investigate the ability of the model to deal with missing data is to examine the difference between the model as inferred with some missing points to that inferred with all the data: the results of doing so are shown in the second column of Table 2. Whilst this method may not give a completely fair reflection of the performance of the PCA and KNN methods, the small size of the errors on imputation imply that our model is relatively insensitive to missing data: because the model can borrow statistical strength from other replicates, small amounts of data missing at random make little difference to the model.
Data fusion
The data under investigation are sampled at twohour intervals. To improve our knowledge of the system, it is possible to perform data fusion with existing data sets. We demonstrate this with two previous studies: [4,24], which offer more tightly temporallyspaced data, but with fewer replications (three and one respectively).
We constructed a hierarchical model across the three experiments, and across replicates within the experiments. The data were considered on a genebygene basis, and the model was optimised for each gene. An example gene (Acer) is shown in Figure 5. The figure shows the inferred function for each replicate in each experiment, as well as the inferred mean function for each experiment (first column) and the inferred ‘toplevel’ function (inset) which underlies all the experiments.
Figure 5. Hierarchical GP regression on across three data sets, for the gene Acer. Each data set is represented by one row, and each replicate within a data set is represented by a single pane in that row. Shaded regions represent 95% confidence intervals. Inset: the fused timeseries. (Yscales removed for clarity but are consistent between plots).
Clustering
In order to investigate the usefulness of our model in a clustering task, we first selected 300 dynamically differentially expressed genes using a method similar to [7].
We computed the marginal likelihood using our hierarchical model and a simpler GP model without replicatespecific or genespecific variance. This model, simply fitting a GP to the lumped data is similar to the method proposed in [20], which represents the current stateoftheart. Cooke et al[20] compared their method to several other algorithms and concluded that the GP method allowed for the discovery of clusters in a more effective manner than nontemporal models. Here we show that this method can be improved further by considering the genewise and replicatewise intracluster structure of the data’s variance. For both models, we applied the EM algorithm with several restarts (varying the covariance function hyperparameters on each restart), selecting the solution with the highest loglikelihood. We used our own implementation for Cooke et al.’s method to ensure that results were comparable: i.e. that improvements in the method were due to the HGP structure, rather than specifics of the implementation. We are primarily concerned with the improvements that the HGP model can offer in explaining cluster variance, and this allows for a direct comparison.
For further comparison, we used the R package Mclust [25]. Mclust fits a range of Gaussian models of increasing complexity; in the first instance, we simply concatenated the replicates and Mclust struggled to fit some models in the 56 dimensional space. Subsequently, we provided Mclust with shorter vectors formed by averaging the replicates at each time point, which gave similar results. We include the validation for both methods. In both cases, we used all available covariance structures for Mclust, and let the package pick the best using its BIC (Bayesian Information Criterion) approach.
In order to validate the different clusterings, we use the biological homogeneity index (BHI) [26], which assesses the number of genes with common function within each cluster, assigning the entire clustering a score from 0 to 1 with larger values corresponding to more biologically homogeneous clusters. For biological annotation, we used the three gene ontology (GO) categories (data obtained using biomaRt [27]), which correspond to molecular function (MF), biological process (BP) and cellular component (CC). Computation of the BHI is then straightforward: it is the proportion of withincluster pairs that share at least one biological annotation. The BHIs and loglikelihoods for all the experiments are shown in Table 3. We note that other comparisons between the clusters and the Gene Ontology may be possible, for example the GO term overlap score [28,29], but we use the BHI here for ease of interpretability.
Table 3. Results of clustering 300 genes using the four proposed algorithms
From the Table, it can be seen that HGP method improves the biological homogeneity for all three GO categories. By directly comparing with the standard GP method, we have demonstrated that the improvement in clustering performance is not due simply to the clustering methodology or the GP correlations which give the GP method a small improvement over Mclust, but the hierarchical structure of intracluster variance which allows genes and replicates to differ in a temporallycorrelated fashion.
Conclusions
We have presented a method based on hierarchically structured GPs, which are a practical and flexible framework for modelling replicated timeseries. The framework has a wide range of applications, and can be extended for various data structures besides biological replications.
The method performed well in several tasks, including missing data imputation and clustering. We have shown that the method performs particularly well in missing data imputation, and that small amounts of missing randomly data have only a minor effect of the model. Biological validation through the BHI confirms the importance of modelling intracluster variance in a hierarchical fashion.
Above we showed how fitting the simplest of our proposed models can lead to a quantitative assessment of how biological replications are behaving, as well as illustrating how our method deals with different types of replicate variation. Of course, if the replicate variance is truly independent – e.g. if only technical variation is present – then we recover standard GP regression. In this case the hierarchical approach requires the inclusion of an extra parameter, but we find that the additional computational complexity is negligible.
A problem with standard GP regression is that the computational complexity grows cubically with the number of data. We have presented a method which exploits the necessary condition that all genes in a cluster are measured on the same timepoints in order to significantly reduce the computational complexity and make our clustering algorithm applicable to large data sets. We note that the complexity of the clustering algorithm scales poorly with the number of genes: the initial step of the algorithm must compare the likelihood of merging of each pair. To address this, randomised versions of the same algorithm can be adopted [30,31], and our hierarchical model and its efficient computation as (15) could be used with no modification.
Whilst we have demonstrated that our model is useful in several applications, we envisage a number of extensions. For example, our model could be used for data fusion of microarray data with highthroughput sequencing data. Or, the hierarchical structure could be used in models of pathway activity [32], which may include prior information about gene groupings from Gene Ontology.
Although we have only used simple GP models within our hierarchical structure, the idea can be applied to more complex GP models, such as those proposed to model gene interactions [9,33].
Endnotes
^{a} Other noise distributions are possible, but break the conjugacy of the model and thus complicate inference, see [10].
^{b} Note that hierarchical in this sense means a hierarchical partitioning of the genes, distinct from our Bayesian hierarchical model applied within the cluster.
Appendix
Efficient computation of a cluster likelihood
The expression for the marginal likelihood of a cluster of genes as given in equation (15) can be derived by considering the values of the underlying function h(t) at the time points t, which we denote h. The model (for a single cluster) can be written:
This consists of a prior for the latent variable h and a likelihood for the data associated with each gene in the cluster, conditioned on the latent variable. The objective here is to marginalise (integrateout) the latent variable to achieve a tractable expression. Expanding equation (16),
Some rearrangement and completing the square inside the exponent leads to
where we have defined for brevity and . The first and third lines of this expression can be moved outside the integral, and we recognise the Gaussian nature of . Substituting this and the expressions for and Λ back into (18) leads to the expression given in (15).
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
JH designed the studies, implemented the algorithms in python and drafted the manuscript. MR and NDL assisted in design and analysis of the experiments. All authors read and approved the final manuscript.
Acknowledgements
This work was funded by EU FP7KBBE Project Ref 289434, EraSysBio+ project SYNERGY (BBSRC award BB/I004769/2) and EU FP7 RADIANT (award no. 305626).
References

Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown P, Botstein D, Futcher B: Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.
Mol Biol Cell 1998, 9(12):3273. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Barenco M, Tomescu D, Brewer D, Callard R, Stark J, Hubank M: Ranked prediction of p53 targets using hidden variable dynamic modeling.
Genome Biol 2006, 7(3):R25. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Straume M: DNA microarray time series analysis: automated statistical assessment of circadian rhythms in gene expression patterning.
Methods Enzymol 2004, 383:149166. PubMed Abstract  Publisher Full Text

Tomancak P, Beaton A, Weiszmann R, Kwan E, Shu S, Lewis S, Richards S, Ashburner M, Hartenstein V, Celniker S, et al.: Systematic determination of patterns of gene expression during Drosophila embryogenesis.

Sanguinetti G, Lawrence N, Rattray M: Probabilistic inference of transcription factor concentrations and genespecific regulatory activities.
Bioinformatics 2006, 22(22):2775. PubMed Abstract  Publisher Full Text

Beal M, Falciani F, Ghahramani Z, Rangel C, Wild D: A Bayesian approach to reconstructing genetic regulatory networks with hidden factors.
Bioinformatics 2005, 21(3):349. PubMed Abstract  Publisher Full Text

Kalaitzis A, Lawrence N: A simple approach to ranking differentially expressed gene expression time courses through Gaussian process regression.
BMC Bioinformatics 2011, 12:180. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Gao P, Honkela A, Rattray M, Lawrence N: Gaussian process modelling of latent chemical species: applications to inferring transcription factor activities.
Bioinformatics 2008, 24(16):i70i75. PubMed Abstract  Publisher Full Text

Honkela A, Girardot C, Gustafson E, Liu Y, Furlong E, Lawrence N, Rattray M: Modelbased method for transcription factor target identification with limited data.
Proc Natl Acad Sci 2010, 107(17):7793. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Rasmussen C, Williams C: Gaussian Processes for Machine Learning. Cambridge, Massachusetts and London, England: MIT press; 2006.

Ernst J, Nau G, BarJoseph Z: Clustering short time series gene expression data.
Bioinformatics 2005, 21(suppl 1):i159. PubMed Abstract  Publisher Full Text

Gelman A, Carlin JB, Stern HS, Rubin DB: Bayesian Data Analysis. Boca Raton: CRC press; 2004.

Stegle O, Denby K, Cooke E, Wild D, Ghahramani Z, Borgwardt K: A robust Bayesian twosample test for detecting intervals of differential gene expression in microarray time series.
J Comput Biol 2010, 17(3):355367. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Liu Q, Lin K, Andersen B, Smyth P, Ihler A: Estimating replicate time shifts using Gaussian process regression.
Bioinformatics 2010, 26(6):770776. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ng S, McLachlan G, Wang K, Jones LBT, Ng SW: A mixture model with randomeffects components for clustering correlated geneexpression profiles.
Bioinformatics 2006, 22(14):17451752. PubMed Abstract  Publisher Full Text

Medvedovic M, Yeung K, Bumgarner R: Bayesian mixture model based clustering of replicated microarray data.
Bioinformatics 2004, 20(8):1222. PubMed Abstract  Publisher Full Text

Lin K, Chudova D, Hatfield G, Smyth P, Andersen B: Identification of hair cycleassociated genes from timecourse gene expression profile data by using replicate variance.
Proc Natl Acad Sci USA 2004, 101(45):15955. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dunson D: Nonparametric Bayes applications to biostatistics. In Bayesian Nonparametrics. Edited by Hjort L, Holmes C, Muller P, Walker S. Cambridge: Cambridge University Press; 2010.

Heller K, Ghahramani Z: Bayesian hierarchical clustering. In Proceedings of the 22nd International Conference on Machine Learning. ACM press; 2005:297304.

Cooke E, Savage R, Kirk P, Darkins R, Wild D: Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements.
BMC Bioinformatics 2011, 12:399. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Kalinka A, Varga K, Gerrard D, Preibisch S, Corcoran D, Jarrells J, Ohler U, Bergman C, Tomancak P: Gene expression divergence recapitulates the developmental hourglass model.
Nature 2010, 468(7325):811814. PubMed Abstract  Publisher Full Text

Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman R: Missing value estimation methods for DNA microarrays.
Bioinformatics 2001, 17(6):520. PubMed Abstract  Publisher Full Text

Oba S, Sato M, Takemasa I, Monden M, Matsubara K, Ishii S: A Bayesian missing value estimation method for gene expression profile data.
Bioinformatics 2003, 19(16):2088. PubMed Abstract  Publisher Full Text

Hooper S, Boué S, Krause R, Jensen L, Mason C, Ghanim M, White K, Furlong E, Bork P: Identification of tightly regulated groups of genes during Drosophila melanogaster embryogenesis.
Mol Syst Biol 2007, 3:72. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fraley C, Raftery AE: MCLUST: Software for modelbased cluster analysis.
J Classif 1999, 16(2):297306. Publisher Full Text

Brock G, Pihur V, Datta S, Datta S: clValid: An R package for cluster validation.

Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, Huber W: BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis.
Bioinformatics 2005, 21(16):34393440. PubMed Abstract  Publisher Full Text

Mistry M, Pavlidis P: Gene Ontology term overlap as a measure of gene functional similarity.
BMC bioinformatics 2008, 9:327. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Kirk P, Griffin JE, Savage RS, Ghahramani Z, Wild DL: Bayesian correlated clustering to integrate multiple datasets.
Bioinformatics 2012, 28(24):32903297. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Heller K, Ghahramani Z: Randomized algorithms for fast Bayesian hierarchical clustering.
PASCAL Statistics and Optimization of Clustering Workshop 2005.

Darkins R, Cooke EJ, Ghahramani Z, Kirk PD, Wild DL, Savage RS: Accelerating Bayesian hierarchical clustering of time series data with a randomised algorithm.
PloS one 2013, 8(4):e59795. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Shi Y, Klustein M, Simon I, Mitchell T, BarJoseph Z: Continuous hidden process model for time series expression experiments.
Bioinformatics 2007, 23(13):i459i467. PubMed Abstract  Publisher Full Text

Lawrence N, Girolami M, Sanguinetti G, Rattray M: Learning and Inference in Computational Systems Biology. Cambridge: MIT press; 2010.