The Sheffield Institute for Translational Neuroscience, 385A Glossop Road, Sheffield, S10 2HQ, UK

Abstract

Background

The analysis of gene expression from time series underpins many biological studies. Two basic forms of analysis recur for data of this type: removing inactive (quiet) genes from the study and determining which genes are differentially expressed. Often these analysis stages are applied disregarding the fact that the data is drawn from a time series. In this paper we propose a simple model for accounting for the underlying temporal nature of the data based on a Gaussian process.

Results

We review Gaussian process (GP) regression for estimating the continuous trajectories underlying in gene expression time-series. We present a simple approach which can be used to filter quiet genes, or for the case of time series in the form of expression ratios, quantify differential expression. We assess via ROC curves the rankings produced by our regression framework and compare them to a recently proposed hierarchical Bayesian model for the analysis of gene expression time-series (BATS). We compare on both simulated and experimental data showing that the proposed approach considerably outperforms the current state of the art.

Conclusions

Gaussian processes offer an attractive trade-off between efficiency and usability for the analysis of microarray time series. The Gaussian process framework offers a natural way of handling biological replicates and missing values and provides confidence intervals along the estimated curves of gene expression. Therefore, we believe Gaussian processes should be a standard tool in the analysis of gene expression time series.

Background

Gene expression profiles give a snapshot of mRNA concentration levels as encoded by the genes of an organism under given experimental conditions. Early studies of this data often focused on a single point in time which biologists assumed to be critical along the gene regulation process after the perturbation. However, the

With the decreasing cost of gene expression microarrays time series experiments have become commonplace giving a far broader picture of the gene regulation process. Such time series are often irregularly sampled and may involve differing numbers of replicates at each time point

Primary analysis of gene expression profiles is often dominated by methods targeted at

The analysis of gene expression microarray time-series has been a stepping stone to important problems in systems biology such as the genome-wide identification of direct targets of transcription factors

Testing for Expression

A primary stage of analysis is to characterize the activity of each gene in an experiment. Removing inactive or

Temporal information removed from the profile of gene Cyp1b1 in the experimental mouse data

**Temporal information removed from the profile of gene Cyp1b1 in the experimental mouse data**. **(a) **The centred profile of the gene **(b) **The same profile with its timepoints randomised.

Failure to capture the signal in a profile, irrespective of the amount of embedded noise, may be partially due to

A recent significant contribution in regards to the estimation and ranking of differential expression of time-series in a **Simulated data**).

Gene Expression Analysis with Gaussian Processes

In the context of expression trajectory estimation, a Gaussian process coupled with the

In a different context, Gaussian process priors have been used for modeling transcriptional regulation. For example in

For example **Methods **section.

Results and Discussion

We apply standard Gaussian process (GP) regression and the Bayesian hierarchical model for the analysis of time-series (BATS) on two in-silico datasets simulated by BATS and GPs, and on one experimental dataset coming from a study on primary mouse keratinocytes with an induced activation of the TRP63 transcription factor, for which a reverse-engineering algorithm was developed (TSNI: time-series network identification) to infer the direct targets of TRP63

We assume that each gene expression profile can be categorized as either quiet or differentially expressed. We consider algorithms that provide a rank ordering of the profiles according to which is most likely to be non-quiet (or differentially expressed). Given ground truth we can then evaluate the quality of such a ranking and compare different algorithms. We make use of

From the output of each model a ranking of differential expression is produced and assessed with ROC curves to quantify how well in accordance to each of the three ground truths (BATS-sampled, GP-sampled, TSNI-experimental) the method performs. The BATS model can employ three different noise models, where the marginal distribution of the error is assumed to be either Gaussian, Student-

Simulated data

The first set of in-silico profiles are simulated by the BATS software

BATS simulation

We reproduce one instantiation of the simulations performed in

The other 7400 non-differentially expressed profiles (labeled as "0" in the ground truth) are essentially zero functions with additive i.i.d. noise. The three simulated datasets are induced with different kinds of i.i.d. noise; respectively, Gaussian N(0, ^{2}), Student-

GP vs. BATS on simulated data

**GP vs. BATS on simulated data**. ROC curves for the GP and BATS methods on data simulated by BATS induced with **(a) **Gaussian noise, **(b) **Student's-**(d) **data simulated by Gaussian processes. Each panel depicts one ROC curve for the GP method and three for BATS, each using a different noise model indicated by the subscript in the legend ("G" for Gaussian, "T" for Student's-

GP simulation

In a similar setup, the second in-silico dataset consists of 8000 profiles sampled from Gaussian processes, with the same number of replicates and time-points, among which 600 were setup as differentially expressed. To generate a differentially expressed profile, each of the **Methods**) is sampled from separate Gamma distributions. The three Gamma distributions are fitted to sets of their corresponding hyperparameters, which are observed for the true positive profiles under a near zero FPR during the first test on BATS-generated profiles. In this way, we attempt to resemble the behaviour of the BATS-sampled profiles. Table

Parameters of the Gamma distributions for sampling the RBF-hyperparameters.

**Sampling Gamma distribution Γ(a, b)**

Sampled

RBF-

Hyperparameters

ℓ^{2 }(characteristic lengthscale)

1.4

5.7

2.76

0.2

23

0.008

These are the parameters of the Gamma distributions from which we sample the RBF- hyperparameters. For example, the characteristic lengthscale is sampled from a Gamma with scale 1.4 and shape 5.7. The hyperparameters are then used in the RBF covariance function to sample/simulate a profile from the Gaussian process.

The other 7400 non-differentially expressed profiles are simply zero functions with additive white Gaussian noise of variance equal to the sum of two samples from the Gamma distribution for the

Experimental data

We apply the standard GP regression framework and BATS on an experimental dataset coming from a study on primary mouse keratinocytes with an induced activation of the TRP63 transcription factor (GEO-accession number [GEOdataset:GSE10562]), where a reverse-engineering algorithm was developed (TSNI: time-series network identification) to infer the direct targets of TRP63

(genome.cshlp.org/content/suppl/2008/05/05/gr.073601.107.DC1/DellaGatta_SupTable1.xls) and used here as a

We label the top 100 position of the TSNI ranking as "1" in the ground truth as they are the most likely to be direct targets of the TRP63 transcription factor and because the

Distribution of binding scores along the TSNI ranking

**Distribution of binding scores along the TSNI ranking**. By inspection, the distribution of the binding scores is mostly dense along the first 100 positions of the TSNI ranking. The authors in

GP vs. BATS on experimental data

**GP vs. BATS on experimental data**. ROC curves for the GP and BATS methods on experimental data from **(a) **Ground truth consists of 22690 labels among which only the 786 profiles chosen to be ranked by TSNI (based on the area under their curves) are labeled as "1", cf. **Experimental data**. **(b) **Same number of labels; here only the top 100 profiles ranked by TSNI are labeled as "1".

Discussion

On BATS-sampled data, Figure **Conclusions**). Furthermore, there is a modeling bias in the underlying functions of the simulated profiles, which contain a finite small degree of differentiability (maximum degree of Legendre polynomial is 6). This puts the GP in a disadvantaged position as it models for (smooth) infinitely differentiable functions when its covariance function is a

On GP-sampled data, Figure _{T}
_{DE}
_{G}

Conclusions

We presented an approach to estimating the continuous trajectory of gene expression time-series from microarray data through

This ranking scheme presented here is reminiscent of the work in

Future work

A natural next step would be to add a robust noise mechanism in our framework. In this regard, fine examples can be found in

Methods

As we mentioned earlier, analysing time-course microarray data by means of Gaussian process (GP) regression is not a new idea (cf. **Background**). In this section we review the methodology to estimating the continuous trajectory of a gene expression by GP regression and subsequently describe a likelihood-ratio approach to ranking the differential expression of its profile. The following content is based on the key components of GP theory as described in

The Gaussian process model

The idea is to treat trajectory estimation given the observations (gene expression time-series) as an interpolation problem on functions of one dimension. By assuming the observations have Gaussian-distributed noise, the computations for prediction become tractable and involve only the manipulation of linear algebra rules.

A finite parametric model

We begin the derivation of the GP regression model by defining a standard ^{2})^{⊤}, i.e. a line mapped to a quadratic curve)

where gene expression measurements in time **y **= {_{n}
_{
n = 1..N
}are contaminated with white Gaussian noise and the inputs (of time) are mapped to a feature space **Φ **= {_{n}
^{⊤}}_{
n = 1..N
}. Furthermore, if we assume the noise to be i.i.d. (identically and independently distributed) as a Gaussian with zero mean and variance

then the probability density of the observations given the inputs and parameters (

Where

Introducing Bayesian methodology

Now turning to **w **by specifying a zero mean, isotropic Gaussian distribution as a

By integrating the product of the

which is jointly Gaussian. Hence the

By computing the

Notice in eq. (7) how the structure of the covariance implies that choosing a different feature space Φ results in a different **K**
_{y}
**K**
_{y }

• **Kolmogorov consistency**, which is satisfied when _{ij }
_{i}
_{j}
**K **are **y ^{⊤ }Ky **≥ 0).

• **Exchangeability**, which is satisfied when the data are i.i.d.. It means that the order in which the data become available has no impact on the

Definition of a Gaussian process

More formally, _{1}
_{2}),..., _{n}
_{1}, _{2}, ..., _{n}

If we remove the **I **from **K**
_{y }
**K**
_{f }
**K**
_{y }

where the

The squared-exponential kernel

In this paper we only use the univariate version of the squared-exponential (SE) kernel. But before embarking on its analysis, the reader should be aware of the existing wide variety of kernel families, and potential combinations of them. A comprehensive review of the literature on covariance functions is found in [21, chap. 4].

Derivation and interpretation of the SE kernel

In the GP definition section we mentioned the possibility of an **K _{f }
**can have at most as many non-zero eigenvalues as the number of parameters in the model. Hence for any problem of any given size, the matrix is non-invertible. Ensuring

then by considering a feature space defined by

where one ends up with a smooth (infinitely differentiable) function on an infinite-dimensional space of (radial basis function) features. Taking the constant out front as a

The SE is a _{i }
_{j }
_{ij }
^{2 }is the _{i}
_{j}
^{2 }governs the amount that

Gaussian process fit on expression profile of gene Cyp1b1 in the experimental mouse data

**Gaussian process fit on expression profile of gene Cyp1b1 in the experimental mouse data**. Figure 5: A GP fitted on the centred profile of the gene ^{2}. The blue crosses represent zero-mean hybridised gene expression in time (log2 ratios between treatment and control) and the shaded area indicates the point-wise mean plus/minus two times the standard deviation (95% confidence region). **(a) **Mean function is constant as ℓ^{2 }→ ∞ (0 inverse lengthscale in eq. (14)) and all of the observed data variance is attributed to noise (**(b) **The lengthscale is manually set to a local-optimum large value (ℓ^{2 }= 30) and thus the mean function roughly fits the data-points. The observed data variance is equally attributed to signal (**(c) **The lengthscale is manually set to a local-optimum small value (ℓ^{2 }= 15.6) and thus the mean function tighly fits the data-points with high certainty. The interpretation from the covariance function in this case is that the profile contains a minimal amount of noise and that most of the observed data variance is attributed to the underlying signal (**(d) **The contour of the corresponding LML function plotted by an exhaustive search of ℓ^{2 }and SNR values. The two main local-optima are indicated by the green dots and a third optimum that corresponds to the 1st panel appears almost as flat in the contour and its vicinity encompasses the whole lengthscale axis for very small values of SNR (i.e. given that SNR ≈ 0, the lengthscale is trivial).

One can also combine covariance functions as long as they are

Gaussian process prediction

To interpolate the trajectory of gene expression at non-sampled time-points, as illustrated in Figure _{* }at a new input (non-sampled time-point) _{*}, given the knowledge of function estimates **f **at known time-points **x**. The joint distribution _{*}, **f **) is Gaussian, hence the conditional distribution _{*}| **f **) is also Gaussian. In this section we consider predictions using noisy observations; we know the noise is Gaussian too, so the noisy conditional distribution does not differ. By Bayes' rule

where the Gaussian process prior over the noisy observations is

Predictive equations for GP regression

We start by defining the _{* }and each of the ^{th }

For every new time-point a new vector k_{* }is **K**
_{C }

where _{* }is incremented with every new k_{* }added to **K**
_{C}

Finally, to derive the

and _{f }
_{f }
**x**, **x**). These equations can be generalised easily for the prediction of function values at multiple new time-points by augmenting **k**
_{* }with more columns and **x***, **x***) with more components, one for each new time-point **x***.

Hyperparameter learning

Given the SE covariance function, one can learn the hyperparameters from the data by optimising the log-marginal likelihood function of the GP. In general, a non-parametric model such as the GP can employ a variety of kernel families whose hyperparameters can be adapted with respect to the underlying intensity and frequency of the local signal structure, and interpolate it in a probabilistic fashion (i.e. while quantifying the uncertainty of prediction). The SE kernel allows one to give intuitive interpretations of the adapted hyperparameters, especially for one-dimensional data such as a gene expression time-series, see Figure

Optimising the marginal likelihood

In the context of GP models the marginal likelihood results from the marginalisation over function values **f**

where the **f|x**) is given in eq. (9) and the likelihood is a factorised Gaussian

We notice that the marginal here is explicitly conditioned on **K**
_{f}

We use

Ranking with likelihood-ratios

Alternatively, one may choose to go "fully Bayesian" by placing a **
θ
**|

based on some initial beliefs, such as the functions having large lengthscales, and optimise the marginal likelihood so that the optimum lengthscale tends to a large value, unless there is evidence to the contrary. Depending on the model

In the case where one is using different types of models (e.g. with different number of hyperparameters), a Bayesian-standard way of comparing between such two models is through Bayes factors

where the models

In this paper we present a much simpler -- but effective to the task -- approach to ranking the differential expression of a profile. Instead of integrating out the hyperparameters, we approximate the Bayes factor with a log-ratio of marginal likelihoods (cf. eq. (25))

with each LML being a function of different instantiations of **
θ
**. We still maintain hypotheses

Local optima of the log-marginal likelihood (LML) function

These two configurations correspond to two points in the three-dimensional function that is the LML, both of which usually lie close to local-optimum solutions. This assumption can be verified, empirically, by exhaustively plotting the LML function for a number of profiles, see Figure **
θ
**

In most cases the LML (eq. (25)) is not convex. Multiple optima do not necessarily pose a threat here; depending on the data and as long as they have similar function values, multiple optima present alternative interpretations on the observations. To alleviate the problem of spurious local optimum solutions however, we make the following observation: when we explicitly restrict the signal variance hyperparameter (**y**)) is shared between the signal and noise variance hyperparameters, i.e. ^{2 }and one of signal-to-noise ratio

Figure ^{2 }and the SNR. It features two local optima, one for a small lengthscale and a high SNR, where the observed data are explained with a relatively complex function and a small noise variance, and one optimum for a large lengthscale and a low SNR, where the data are explained by a simpler function with high noise variance. We also notice that the first optimum has a lower LML. This relates to the algebraic structure of the LML (eq. (25)); the first term (dot product) promotes data fitness and the second term (determinant) penalizes the complexity of the model [21, sec.5.4]. Overall, the LML function of the Gaussian process offers a good fitness-complexity trade-off without the need for additional regularisation. Optionally, one can use multiple initialisation points focusing on different non-infinite lengthscales to deal with the multiple local optima along the lengthscale axis, and pick the best solution (max LML) to represent the

Source code

The source code for the GP regression framework is available in MATLAB code

Authors' contributions

AAK designed and implemented the computational analysis and ranking scheme presented here, assessed the various methods and drafted the manuscript. NDL pre-processed the experimental data and wrote the original Gaussian process toolkit for MATLAB and AAK rewrote it for the R statistical language. Both AAK and NDL participated in interpreting the results and revising the manuscript. All authors read and approved the final manuscript.

Acknowledgements

The authors would like to thank Diego di Bernardo for his useful feedback on the experimental data. Research was partially supported by a EPSRC Doctoral Training Award, the Department of Neuroscience, University of Sheffield and BBSRC (grant BB/H018123/2).