Department of Computational & Systems Biology, John Innes Centre, Norwich Research Park, Norwich NR4 7UH, UK
Department of Disease & Stress Biology, John Innes Centre, Norwich Research Park, Norwich NR4 7UH, UK
Abstract
Background
A first step in building a mathematical model of a biological system is often the analysis of the temporal behaviour of key quantities. Mathematical relationships between the time and frequency domain, such as Fourier Transforms and wavelets, are commonly used to extract information about the underlying signal from a given time series. This onetoone mapping from time points to frequencies inherently assumes that both domains contain the complete knowledge of the system. However, for truncated, noisy time series with background trends this unique mapping breaks down and the question reduces to an inference problem of identifying the most probable frequencies.
Results
In this paper we build on the method of Bayesian Spectrum Analysis and demonstrate its advantages over conventional methods by applying it to a number of test cases, including two types of biological time series. Firstly, oscillations of calcium in plant root cells in response to microbial symbionts are nonstationary and noisy, posing challenges to data analysis. Secondly, circadian rhythms in gene expression measured over only two cycles highlights the problem of time series with limited length. The results show that the Bayesian frequency detection approach can provide useful results in specific areas where Fourier analysis can be uninformative or misleading. We demonstrate further benefits of the Bayesian approach for time series analysis, such as direct comparison of different hypotheses, inherent estimation of noise levels and parameter precision, and a flexible framework for modelling the data without preprocessing.
Conclusions
Modelling in systems biology often builds on the study of timedependent phenomena. Fourier Transforms are a convenient tool for analysing the frequency domain of time series. However, there are wellknown limitations of this method, such as the introduction of spurious frequencies when handling short and noisy time series, and the requirement for uniformly sampled data. Biological time series often deviate significantly from the requirements of optimality for Fourier transformation. In this paper we present an alternative approach based on Bayesian inference. We show the value of placing spectral analysis in the framework of Bayesian inference and demonstrate how model comparison can automate this procedure.
Background
Pattern recognition is central to many scientific disciplines and is often a first step in building a model that explains the data. In particular, the study of periodic phenomena and frequency detection has received much attention, leading to the wellestablished field of spectral analysis.
Biology is rich with (near) periodic behaviour, with sustained oscillations in the form of limit cycles playing important roles in many diverse phenomena such as glycolytic metabolism, circadian rhythms, mitotic cycles, cardiac rhythms, hormonal cycles, population dynamics, epidemiological cycles, etc.
Wavelet Transforms
Bayesian inference provides another approach for analysing data (for an introduction to Bayesian analysis, see
There are several advantages of the Bayesian approach, including an inherent mechanism for estimating the accuracy of the result and all parameters, as well as the ability to compare different hypotheses directly. Focus is shifted to the question of interest by integrating out other parameters and noise levels. Initial knowledge of the system can be incorporated in the analysis and expressed in the prior probability distributions. There has been a recent flood of Bayesian papers with some convincing applications and promising developments in systems biology (see
In this paper, we describe the development, implementation and testing of Bayesian model development coupled with BSA and Nested Sampling, in a biological context. We present a comparison of this approach with the FT, applied to a number of simulated test cases and two types of biological time series that present challenges to accurate frequency detection. We first present some necessary background, upon which we build to develop to our approach.
Bayesian inference
Data is rarely available in sufficient quantity and quality to allow for exact scientific deduction. Instead we are forced to infer models from incomplete knowledge. Bayesian inference is based on Bayes' Rule, which evaluates a hypothesis,
where
Bayesian Spectrum Analysis
Our presentation in this section follows closely that of Bretthorst
A general model for observed data sampled at
The signal function will usually be unknown and may be complicated, but can be approximated by a linear combination of
in which
Similarly, the background function,
where
Since a and b are not the main focus of the analysis, we will aim to integrate them out of the equations by marginalisation. Parameters that are treated in this manner as often referred to as nuisance parameters, which we denote here by
such that now
The likelihood function,
A noise model of this form ensures that the accuracy of the results is maximally conservative for a given noise power. We will later integrate over all possible noise levels to remove the dependence on
where
where
The goal of the analysis is to compute the posterior probability for frequencies in the data, i.e. to go from the joint probability distribution to a posterior probability of
To integrate over the σ and c values, priors must first be assigned to them. We chose uniform priors for c and
Using the general model, equation (5), assigning the priors, calculating the likelihood function, equation (8), and integrating out the amplitudes and noise parameters, the posterior probability distribution of
where
Results and Discussion
We employed the framework developed by Jaynes
Model comparison
After evaluating the probability of parameters in light of a certain hypothesis, it is important to question the validity of that hypothesis. Thus, the next step in Bayesian inference is to compare the probability of different hypotheses. The hypothesis is now a particular model of the signal,
Then two different models,
The probability of the data given our prior information,
Model development
It will often not be obvious which function to choose to model trends in the data, so an approach using basis functions and expanding these to different orders will be of advantage, as in equation (4). Each expansion represents a different model,
This model ratio can be used to determine the number of background model functions for each time series. The posterior probability ratio is calculated between model
where
When the model ratio,
This model development approach used for the background functions above can also be used to decide on the number of underlying frequencies in the data. The model ratios of a time series containing one frequency (case A) and a time series containing two (case B) are presented in Table
Model ratios for number of frequencies in the data
Case
Models
Model ratio
A
1
66936
B
2
0
B
2
686
Case A has only one frequency,
We point out that the proposed method stops once the current best model has been found but is not guaranteed to find the global maximum from a predefined set of models. The procedure is thus part of model development rather than model selection. If the set of hypotheses are known in advance then the posterior ratios over the full set should be used to find the best model.
Testing
We first show the power of the BSA approach on test cases using simulated data. In these tests, we sought to recover known input parameters from the simulated data, to validate the BSA approach. We employed sines and cosines as model functions (
Representative cases of noise levels and background trends are shown in Table
BSA and FFT results from simulated harmonic data with noise and background trends
No.
sn
1
0.5
1


0.49
0.06
0.5
0
0.5
0.0002
70
2
0.5
10


0.49
0.20
0.5
0.0002
0.5
0.0004
6.5
3
0.5
40


0.49
0.54
0.5
0.0005
0.5
0.0011
1.9
4
0.5
10
10

0.49
0.27
0.5
0
0.5
0.0003
4.2
5
0.5
10
40

0.49
0.57
0.5
0.0002
0.5
0.0007
2.2
6
0.5
100
40

0.49
0.89
0.5
0.0006
0.5
0.0020
0.7
7
0.3, 0.5
10
10

0.29, 0.51
0.14
0.3, 0.5
0.0003
0.34
0.0832
1
8
0.5
10


0
0.15
0.5
0.0002
0.5
0.0002
110
9
0.5
10


0
0.19
0.5
0.0002
0.5
0.0002
90
10
0.5
10


0.02
0.24
0.5
0.0003
0.5
0.0002
35
Each time series was generated with a sine function of angular frequency,
BSA has a clear advantage over FFT when the data is nonuniformly sampled. FFT requires uniform sampling, whilst BSA is less stringent and delivers the correct result with higher precision. Bretthorst also noted that nonuniformly sampled data removes aliases from the frequency domain, another significant advantage
Background trends
Additional file
Time series with background trend. Time series including a background trend, simulated from
Click here for file
Automated model development
Models
Model ratio
1.9459e06
1.0256e167
622.5
566.3
501.8
99.2
Posterior probability ratios of models including a different a number of background functions,
Examples 810 in Table
Short time series
Additional file
Short time series. A: A short time series simulated from
Click here for file
High noise levels
BSA is also successful at handling high levels of noise, as highlighted in Examples 16 in Table
Effects of noise on precision
Effects of noise on precision. The effect of noise on
Multiple frequencies
Example 7 in Table
Higher harmonics. A: Time series with higher harmonic frequencies, simulated from
Click here for file
As another example, Additional file
Multiple frequencies. A: A time series containing two distinct frequencies, simulated from
Click here for file
Additional file
Multiple close frequencies with noise. A: A time series containing two close frequencies,
Click here for file
To develop BSA further, we used windowing of the time series to compute the posterior probability distribution of
Frequency changing over time
Frequency changing over time. A: A time series with a changing frequency, simulated from
One sharp frequency change
One sharp frequency change. A: A time series with a sharp change in frequency half way through the observed time frame, simulated from
Nonharmonic oscillations
BSA results for oscillations with a nonharmonic shape are superior to the FFT. It highlights an essential difference in the two methods since biological data is often repetitive, but with a wide range of oscillatory patterns. To demonstrate this further, Figure
Nearperiodic oscillations
Nearperiodic oscillations. A: A nearperiodic time series simulated from a set of ODEs describing Ca^{2+ }oscillations in animal cells
This highlights the differences between frequencies in the data and spike intervals. ISI are a common way of characterizing spike data, however, multiple ISI need not correspond to multiple frequencies in the data. Of the four strong ISI shown here, both BSA and FFT identify only one of these as a regular period.
Summary
After extensive test cases we find that BSA delivers superior results in cases where the FFT assumptions are too constraining, most notably in the five cases above. BSA is a flexible method allowing the underlying hypothesis to be changed depending on the focus of the analysis, and to directly compare the validity of different hypotheses. It can handle nonuniformly sampled data and has no need for preprocessing procedures. The price of these superior results comes at a computational cost that ranged from tens to hundreds of seconds for the examples shown here.
Calcium spiking data
The first biological data set comes from intracellular signalling in plantmicrobe interactions. Symbiotic bacteria induce calcium oscillations, called Ca^{2+ }spiking, in legume root cells (for a review, see
The Ca^{2+ }spiking has background trends present due to fluorescence bleaching and cell movements, which are assumed to be unrelated to the underlying signal in the cell. Therefore, accounting for the background functions plays a key role in the analysis. Example time series are shown in Figure
Example results of data from calcium spiking
Example results of data from calcium spiking. A: Time series of Ca^{2+ }oscillations measured in a
The FFT of the Ca^{2+ }data results in a very broad periodogram, due to multiple frequencies and high noise levels (Figure
BSA on calcium data
Cell
BSA Period ±
BSANS Period ±
1
97.4 ± 0.23
97.3 ± 0.15
2
80.9 ± 0.63
75.2 ± 10.1
3
74.6 ± 0.19
74.6 ± 0.85
4
123.8 ± 0.16
124.2 ± 1.18
5
88.9 ± 0.22
123.9 ± 0.61
6
74.6 ± 0.21
113.7 ± 16.16
7
121.9 ± 0.22
146.1 ± 21.53
8
74.4 ± 0.92
75.2 ± 2.69
9
48.2 ± 0.3
64.5 ± 13.94
Analysed calcium oscillations in
Circadian data
The second biological data set shows gene expression of socalled clock genes. Many processes in plants follow a circadian rhythm (for reviews see e.g.
For these circadian rhythms, we chose to analyse RTPCR data from four clock genes in two genotypes of
Example results of data from a clock gene's RNA levels
Example results of data from a clock gene's RNA levels. A: RNA levels of a clock gene,
BSA on circadian data
Gene
Genotype
BSA Period ±
BSANS Period ±
22.75 ± 0.18
22.58 ± 0.43
23.36 ± 0.20
23.26 ± 0.42
23.58 ± 0.15
23.67 ± 0.94
23.98 ± 0.16
24.23 ± 0.72
22.39 ± 0.14
22.54 ± 0.86
23.41 ± 0.16
23.61 ± 0.83
23.84 ± 0.16
23.82 ± 1.54
25.74 ± 0.19
24.03 ± 1.23
The BSA results of RNA levels of four socalled clockgenes in
Conclusions
Bayesian inference offers a powerful way of analysing biological time series. Despite the undisputed value of Fourier theory, there are cases when the necessary requirements for its optimality for time series analysis are not met. This is a consequence of the underlying assumptions of a Fourier Transform, causing it to work optimally only for uniformly sampled, long, stationary, harmonic signals that have either no or white noise. In biology these requirements are rarely fulfilled, requiring preprocessing of the data, such as noise reduction and detrending techniques, with the risk of convoluting the signal and losing valuable information.
By placing the problem of frequency extraction in the framework of Bayesian inference, the known and welldocumented problems of Fourier analysis can be overcome. This approach also breaks the resolution and precision limitations inherent to the FFT by introducing a continuous probability distribution instead of the fixed number of points maintained by the discrete Fourier Transform. As we demonstrated here, BSA coupled with automated model development can give superior results to the FFT when faced with short, noisy time series, nonstationarity and nonharmonic signals. The suggested automated model development worked well in our hands but must be used with caution in practice as the approach is not guaranteed to find a global optimum in model space. Alternate models should be explored and compared using posterior probability ratios or approximations thereof. We found Nested Sampling
BSA calculates signaltonoise ratios, provides parameter precision estimates, and can handle high noise levels as well as background trends and therefore has no need for preprocessing. More importantly, the Bayesian framework offers flexibility in the underlying model and enables direct comparison of hypotheses. The work presented here is a merely a first step in this direction. We have employed conservative priors (uniform, Jeffreys, Gaussian) that make an analytical treatment tractable but in some cases more information could warrant a different choice of prior that might require substantial alternations to our approach to handle the numerics of marginalisation.
There are many known examples in biology in which oscillations play a key role and methods for their detection will be of value, especially in cases where subtle differences are of importance and for short, noisy time series. In the presented examples, we demonstrated the improvements that can be gained from employing this approach. Although in these cases, the biological conclusions would not have changed, one can envision scenarios in which a higher accuracy in frequency detection may allow subtle changes to be detected, which may otherwise have been swamped by noise and less powerful techniques. We believe that the presented methodology offers an attractive alternative to other approaches and will be a useful addition to the toolbox of systems biologists.
Methods
All programming was done using Octave
FT
The DFT was computed using the
There are a number of sophisticated FT methods beyond the standard FFT, developed to avoid specific problems. For example, we also present results from the multitaper method (MTM), shorttime Fourier Transforms (STFT) and wavelet analysis. For the MTM, only the MTM spectrum is presented, but it should be noted that the Singular Spectrum Analysis  MultiTaper Method (SSAMTM) toolkit provides additional features such as significance levels of the frequencies, relative to the estimated noise levels
BSA
A flowchart of the BSA code is shown in Figure
BSA and automated background function determination
BSA and automated background function determination. Flowchart of the automated model development procedure for BSA. We point out that the proposed method for detecting the best number of background functions may give rise to local rather than global solutions for complex background trends and/or poor choices of background basis functions.
The next step is to specify the frequency domain of interest. This domain is then sampled with a chosen interval, and the posterior probability is computed at each frequency. Since the
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
RJM and GEDO conceived the experiments. RJM and EG developed and implemented the code. EG performed all the tests and analyses. EG and RJM wrote the paper. All authors read and approved the final manuscript.
Acknowledgements
We thank Dr David Richards and Nick Pullen for critical reading of the manuscript and useful suggestions. We thank Dr Jongho Sun for the provision of calcium spiking data. Thanks are also due to four anonymous reviewers for their detailed, constructive criticism and insightful comments. We would like to thank the Free Software Foundation and all authors of software packages who generously make their tools freely available (LATEX, gnuplot, emacs, Octave, gcc, and many many others). EG acknowledges PhD funding from the John Innes Foundation. RJM and GEDO are grateful for support from the BBSRC.