Abstract
Background
Detection of periodically expressed genes from microarray data without use of known periodic and nonperiodic training examples is an important problem, e.g. for identifying genes regulated by the cellcycle in poorly characterised organisms. Commonly the investigator is only interested in genes expressed at a particular frequency that characterizes the process under study but this frequency is seldom exactly known. Previously proposed detector designs require access to labelled training examples and do not allow systematic incorporation of diffuse prior knowledge available about the period time.
Results
A learningfree Bayesian detector that does not rely on labelled training examples and allows incorporation of prior knowledge about the period time is introduced. It is shown to outperform two recently proposed alternative learningfree detectors on simulated data generated with models that are different from the one used for detector design. Results from applying the detector to mRNA expression time profiles from S. cerevisiae showsthat the genes detected as periodically expressed only contain a small fraction of the cellcycle genes inferred from mutant phenotype. For example, when the probability of false alarm was equal to 7%, only 12% of the cellcycle genes were detected. The genes detected as periodically expressed were found to have a statistically significant overrepresentation of known cellcycle regulated sequence motifs. One known sequence motif and 18 putative motifs, previously not associated with periodic expression, were also over represented.
Conclusion
In comparison with recently proposed alternative learningfree detectors for periodic gene expression, Bayesian inference allows systematic incorporation of diffuse a priori knowledge about, e.g. the period time. This results in relative performance improvements due to increased robustness against errors in the underlying assumptions. Results from applying the detector to mRNA expression time profiles from S. cerevisiae include several new findings that deserve further experimental studies.
Background
Several different algorithms for detection of periodically expressed genes in DNA microarray temporal profiles have been proposed [17]. Theoretical and algorithmic foundations for the detection algorithms include for example Fourier analysis [1,2], spline modelling [6], singlepulse models [5], and partial least squares classification [7]. One group of algorithms, including those in [1,3,5,6], use supervised learning methods [8] that exploit labelled expression profiles of genes known to be periodically expressed in the experiment to find other genes that also are periodic. This supervised learning approach precludes many potential applications where labelled training examples are not available e.g. for poorly characterised organisms.
In the subgroup of recently proposed learningfree algorithms which do not rely on supervised learning, prior knowledge in the form of a known angular frequency ω is presumed. For example, in [2] the power (amplitude) of frequency ω in the expression profile Fourier spectrum is used in creating a score for detection. However, since the period time usually is not exactly known, a novel method was proposed in [4] to resolve this problem by employing Fisher's gtest [8] for detection of periodic temporal profiles. This approach seems attractive but is designed to detect any periodic temporal profile, even if its period differs significantly from the period of the cellcycle. In other words, it does not use any prior knowledge about the period time.
The algorithm used in the work by Spellman et al. (1998) [1] partly utilises prior knowledge about the cellcycle period. This is achieved by averaging the power of Fourier spectrum frequencies across a discrete set of frequencies over a frequency interval that is thought to encompass the true frequency (corresponding to the period time as determined by auxiliary experimental techniques). The detector test statistic is modulated by the Pearson correlations between the present temporal profile and a set of labelled temporal profiles for genes known to be periodically expressed. Thus, the algorithm belongs to the subgroup of supervised methods.
The performance of various algorithms used for detection of cellcycle genes, i.e. genes regulated by and/or regulating the cellcycle, were recently compared by de Lichtenberg et al. (2005) [2]. Importantly, one should remember that not all genes, required for the cellcycle are periodically expressed. One example is cdc28 kinase whose activity is phase specific but regulated by the cyclins on the protein level [9]. Moreover, there are many periodically expressed genes that are not part of the cellcycle machinery. For example, the gene encoding the cell wall mannoprotein TIR1 (ORF id YER011W) is known to be periodically expressed (it is included in set B1 of de Lichtenberg et al. [2]). In the Gene Ontology [10] TIR1 is annotated to the biological process of stressresponse, but not to any annotation linked to the cell cycle. With this in mind, inferred periodic expression from temporal patterns of mRNA expression may be used to study to what degree cellcycle genes are periodically expressed, but it is not expected that a detector for periodic expression will recover all cellcycle genes.
Any detector for periodic transcription transforms the expression time profile into a real number, the test statistic, which is used to make a decision or to rank the genes. The sensitivity and specificity of the detector thus depends on the particular test statistic threshold employed and should therefore be evaluated for all relevant threshold settings in a Receiver Operating Characteristic (ROC) analysis. The ROC is usually displayed in the form of a curve that shows the probability of detection as a function of the probability of false alarm for all possible values of the threshold [11]. Due to the small number of labelled temporal gene expression profiles presently available, it is not possible to perform reliable evaluation of the performance of the detectors on real biological data. For budding yeast S. cerevisiae there are compilations of genes known to be periodically expressed during the cellcycle [2] but there are no compilations of genes known to be nonperiodic. Thus, using the few genes known to be periodically expressed, a rough estimate of the probability of detection can be obtained for any threshold value but there will be no estimates at all of the probability of false alarm.
In addition to the lack of labelled data for performance validation, another difficulty is that a gene may very well be periodically expressed in one experimental condition but not in another [12]. In conclusion, meaningful evaluations of detectors must be performed on simulated data. Since the exact distribution of the temporal profiles of genes that are periodically expressed in any experiment is not known, it is important to study the robustness of the method when it is applied to time profiles generated from models different from those assumed in the construction of the detector.
In this work we demonstrate how Bayesian inference [13,14] can use diffuse prior knowledge about the cellcycle period time to achieve improved detection of periodically expressed genes with a period close to that of the cellcycle, without help from any labelled training examples provided by a supervisor. In a recently proposed empirical Bayesian approach to detection of periodically expressed genes [3], Bayesian inference is used to compensate for undesirable phase desynchronization among the cells in the population during the cellcycle experiments. However, Bayesian inference is not used in the critical step of designing a likelihood ratio based optimal detector that maximizes the probability of detection P_{D }for a given choice of the probability of false alarm P_{FA}. The new Bayesian detector we introduce here calculates an approximation of the desired likelihood ratio and is based on diffuse but prior knowledge about the cellcycle period time that has not been employed in earlier approaches.
We compare the performance of the Bayesian detector introduced here to two stateoftheart algorithms for learningfree detection of periodically expressed genes. The first is a detector based on Fisher's gtest [4]. It is the only detector proposed for this purpose that is learningfree without requiring the period time to be exactly known. The second detector is based on a combination of two separate score values. It requires the period time to be exactly known and has performed well in a recently reported study [2].
Results
Evaluation on simulated data
Simulated periodic gene expressions were generated by adding random noise drawn from a normal distribution to fixed periodic temporal waveforms (sinus, sawtooth, square). Simulated nonperiodic gene expressions were generated by samples from another normal distribution. We do not believe this choice of waveforms to correspond to naturally occurring time profiles. However, using several different waveforms provides information about the robustness of the detectors against model errors. In addition to the periodic waveforms, we also simulated gene expression profiles that were periodic with a multiplicative amplitude attenuation similar to what has been observed in experimental data [1,15].
Comparisons of the Bayesian detector to Fisher's gtest [4,8] and a combination test proposed by de Lichtenberg et al. [2] were performed. In Fisher's gtest, the sampling distribution of the magnitude I(ω_{k}) of the strongest Fourier spectrum component ω_{k }in samples from a sequence of independent, identically distributed Gaussian distributions (white noise), is used in a classical hypothesis test. Specifically, the statistic
is calculated for each gene where N is the number of time points and ω_{k }= 2π k/N. The sampling distribution of this statistic for timeseries from a white noise Gaussian process has been determined analytically and is used in a hypothesis test.
In the approach by de Lichtenberg et al. [2], expression profiles are ranked by the combination test statistic
that depends on two pvalues obtained via resampling, P_{regulation }and P_{periodicity}. P_{regulation }is intended to reflect the probability that the standard deviation of a particular gene can be obtained by pure chance. P_{periodicity }is intended to reflect the chance of obtaining a random signal with more energy in the user defined frequency component (period time) of interest than the corresponding energy in the particular gene time profile of interest. P_{total }is defined as the product P_{periodicity}·P_{regulation}.
Figure 1 shows ROC curves for the three detectors studied using simulated data for three different waveforms (sinusoid, sawtooth and square). The data emulates typical microarray experiment sampling procedures with 20 samples spaced 5 mins apart encompassing two periods (100 min for 50 min period signal). In this case, we assumed that the estimated period time was 40 min and that the Bayesian and the combination detector were designed based on this information with the Bayesian detector also using a diffuse prior that reflects uncertainty about the period time estimate. The Bayesian detector used a Gaussian prior with standard deviation of 0.1 and a mean of 2π /T, where T is indicated in the figure legends. Clearly, the Bayesian detector outperformed both Fisher's gtest and the combination test in all instances and appears to be more robust than the other detectors since performance was better on all three waveforms. As can also been seen in Figure 1 all detectors have poor performance on the sawtooth waveform. This clearly demonstrates the limitations of detectors based on a single sinusoidal frequency. It should be noted that the signaltonoise ratio used in the simulations corresponds to a relatively high level of experimental noise and that all detectors performs much better when the noise level is decreased (data not shown).
Figure 1. ROC curves for simulated data with different wave forms. The Bayesian detector (BIC 40) with a Gaussian prior with mean 40 and standard deviation 0.1, the Fisher's gtest detector (Fisher), and the combination test detector (Lichtenberg) for the period time 40 min, were all applied to 1000 samples each from a set of simulated periodic signals and nonperiodic signals. The periodic signals had period 50 min and amplitude 1, sampled at 5 min intervals over two periods (100 min). Gaussian white noise with standard deviation 1.0 was added. The set of nonperiodic signals was formed by sampling a Gaussian distribution with standard deviation equal to the standard deviation of the periodic class and mean zero. Results are shown for three different waveforms, sawtooth (a), sinusoid (b) and square (c). As can be expected the detectors perform best on the sinusoidal waveform (b). The relatively larger robustness of the Bayesian detector is clearly revealed as it outperforms the other two detectors in all three cases. We also studied the effects of attenuation by multiplying each of the waveforms with an exponentially decreasing factor e^{αt }(d, e, f). The attenuation coefficient α of the exponential was chosen such that the amplitude at 100 min was 70% of that at 0 min. As is seen this modification has little effect on the performance of the detectors, but naturally its performance will degrade more with faster attenuation.
It has been observed that the time profiles of yeast gene expression in synchronized populations exhibit attenuation [1,15]. In order to study the effect of this attenuation, we also evaluated the result of including a multiplicative exponential term e^{αt }to the simulated time profiles from the periodic class. The attenutation coefficient α was chosen such that that the amplitude of expression at 100 min was 70% of that at 0 min. The attenuation coefficient α was selected to reflect the rate of attenuation that has been observed in microarray time profiles of periodically expressed genes in S. cerevisiae [1,15]. As seen in Figure 1, the Bayesian detector performs well compared to the other two detectors also on these generative models. The performances are of course expected to become worse for all detectors with growing values of α (faster attenuation).
Exact knowledge of the period time results in a sharp prior in the Bayesian detector (a Dirac delta function located at the correct angular velocity). As shown in Figure 2, in this case the combination test and the Bayesian detector have essentially the same performance. The ROC for the Bayesian detector when the standard deviation of the prior has been increased to 0.1 is also presented in Figure 2. Now the performance drops but is still not far from the performance obtained when the period time is exactly known.
Figure 2. Impact of certainty in prior information. Using the same settings as those used to generate Figure 1 we analysed the performance of the combination test and Bayesian detectors for the correct period time of 50 min. When the standard deviation of period time prior of the Bayesian detector was set to zero, the combination test detector (Lichtenberg) and the Bayesian detector (BIC Exact 50) perform almost identically. Also shown is the performance of the Bayesian detector when the standard deviation was changed to 0.1(BIC 50).
We also evaluated the performance of the Bayesian detector when the simulated noise was nongaussian (Laplace and uniform) and obtained in similar results (data not shown). Since the Bayesian detector performance is more robust against the exact waveform than the other detectors, it is expected to work better also on real data.
Periodically expressed genes in S. cerevisiae
The Bayesian detector was applied to three timecourse experiments where mRNA expression was measured using DNA microarrays during the cellcycle in S. cerevisiae [1]. In these experiments, yeast cells had been synchronized by a method based on either αfactor (estimated period 55–77 min), cdc15 (estimated period 60–80 min) or cdc28 (estimated period 80–100 min). The Bayesian prior for period time was chosen so that the reported interval of frequencies encompassed 70% of the probability mass in a Gaussian distribution, with its mean value equal to the average of the reported frequency interval. For illustrative purposes we show in Figure 3 the temporal profiles for the 300 ORFs with the largest support for periodicity in the αfactor experiments as well as the corresponding 300 genes with the least support. The time profiles have been sorted by the maximum a posteriori value of the phase angle.
Figure 3. Visualisation of time profiles from the αfactor experiment. a) Time profiles of the 300 ORFs with the largest support for periodicity. The experiment covers roughly two cell cycles and a global periodic pattern with two peaks is clearly observed. b) Time profiles of the 300 ORFs with the least support for periodicity. No time dependency may be discerned. The genes in each plot were ordered by the maximum a posteriori estimate of the phase angle θ.
Detection of cellcycle annotated genes
As discussed in the introduction, it is well known that some genes involved in the cellcycle regulation are periodically expressed as well as some that are known not be periodically expressed. To what extent genes involved in the cellcycle machinery are periodically expressed remains an open question. We addressed this question by detecting periodically expressed genes using the Bayesian detector and then analyzing how many cellcycle genes were detected.
The cellcycle genes were defined as the list of open reading frames annotated to the biological process "cellcycle" (GO:0007049) or one of its descendants in the Gene Ontology [10]. Out of the 6178 open reading frames (ORFs) on the microarray used, 290 were annotated to "cellcycle". Selection was stringent in that only ORFs annotated with the evidence code inferred from mutant phenotype (IMP) were used.
The detector test statistic (in Equation 8, Methods section) was calculated for each ORF in each of the experiments. By varying the detection threshold, a ROC curve for each synchronization method was calculated. These ROC curves show detection of cellcycle genes from expression data using our detector (Figure 4). As an example, for a high value of the detection threshold τ = 0.95 (corresponding to strong evidence for periodic expression) in the αfactor experiment, 452 ORFs were detected as periodic out of which 36 are annotated to cellcycle with IMP evidence code (all ORFs detected as periodic in each experiment is available in the supplement). This corresponds to P_{D }= 0.12 and P_{FA }= 0.07 (see Figure 4). This result shows that there is only strong evidence for periodic expression for about 12% of the cellcycle genes. However, the cellcycle genes are highly overrepresented among the genes detected as periodically expressed. The probability of finding 36 (or more) out of 290 ORFs with cellcycle IMP among 452 out of 6178 genes calculated by the hypergeometric sampling distribution is ~0.001. Similarly, finding 56 among 516 genes in the cdc28 experiment corresponds to a probability of ~1e9. For lists of the ORFs with τ = 0.95 in the αfactor, cdc28 and cdc15 experiments see 1, 2 and 3.
Additional File 1. Detected genes in the αfactor experiment List of the ORFs detected as periodic in the alphafactor experiment studied, i.e. the ORFs for which τ = 0.95. This is an ASCII file, each line containing the ORF identifier, terminated by newline.
Format: LST Size: 4KB Download file
Additional File 2. Detected genes in the cdc28 experiment As Additional file 1 but for the cdc28 experiment.
Format: LST Size: 4KB Download file
Additional File 3. Detected genes in the cdc15 experiment As Additional file 2 but for the cdc28 experiment.
Format: LST Size: 2KB Download file
Figure 4. Detection of ORFs having a "cellcycle" annotation. Receiver operator characteristic (ROC) curves showing the detection of ORFs annotated as "cellcycle" (IMP). The Bayesian detector is most successful in detecting "cell cycle" genes in the cdc28 experiment, whereas detection in the cdc15 seems to occur at random.
Detection of genes with at least one phasespecific sequence motif
We also addressed the issue of whether promoter regions of genes detected as periodically expressed are enriched with respect to cellcycle transcription factor binding motifs. For this we used the compilation of 37 known and 319 putative sequence motifs described in [16]. Out of the 37 known motifs, seven are known to be bound by phasespecific transcription factors: MCM1, ECB, MCB, SWI5, SFF, CCA and SCB [17,18]. The compilation encompasses 5650 ORFs in the S. cerevisiae genome, out of which 5592 are present in the microarray dataset. There are 4390 ORFs having at least one of the seven cellcycle motifs in their upstream region. The ROC curve reflecting the detection of this group is shown in Figure 5. For example, with τ = 0.95 (corresponding to P_{FA}≈ 0.05 and P_{D}≈ 0.1 for αfactor and cdc28 in Figure 5), the probability of finding at least as many genes with either one of the cellcycle motifs as we do among the ORFs detected as periodic is 1.6e6 (αfactor) and 7.3e8 (cdc28). In the cdc15 experiment, ORFs with cellcycle motifs in the upstream region are not detected more often than what could be expected at random (pvalue ~0.5). Note in Figure 5 that for αfactor and cdc28, only a relatively small fraction of the genes with known cellcycle motifs are detected as periodically expressed when the probability of false alarm is low. One explanation could be that there are additional layers of regulation that prevents the periodic expression of genes whose promoter contain cellcycle motifs, such as epigenetic regulation. Therefore we decided to determine if such effects were equally prevalent for all known phasespecific motifs when studied individually.
Figure 5. Detection of cellcycle motifs. Receiver operator characteristic (ROC) curves showing detection of ORFs having any of the "cellcycle" motifs, clearly detection rates are very low overall, indicating a low abundance of functional motifs in the compilation of sequence motifs of Hughes et al. [16] (see text).
Detection of genes containing a specific cellcycle motif
We investigated how often the seven known cellcycle motifs occurred in the upstream region of genes detected as periodically expressed. Results for τ = 0.95 are shown in Table 1. None of the cellcycle motifs were overrepresented in cdc15 and cellcycle motifs CCA and SWI5 were clearly not overrepresented in upstream regions of the ORFs detected in the cdc28 experiment. Furthermore, for αfactor and cdc28 we noted that the detector was most successful at detecting genes with the MCB motif. Figure 6 shows the ROC curves for detection of genes with the MCB motif using the three different synchronization methods. The expression of genes with the MCB motif thus appears to be less influenced by other types of regulation than just regulation through the MCB motif, at least in the αfactor and cdc28 experiments.
Figure 6. Detection of the MCB motif. Receiver operator characteristic (ROC) curves corresponding to detection of ORFs having the MCB motif which appears to be the motif with the largest number of functional sites (see text). Detection in the cdc15 experiment appears close to random though.
Discovering novel phasespecific motifs
Since the probability of detecting a gene with at least one known cellcycle motif is relatively low for a low rate of false alarm (Figure 5) we investigated whether this could be explained by the presence of other motifs, still unknown to have phasespecific expressions. We thus tested for overrepresented motifs in the sets of ORFs detected as periodic in each of the experiments. We found that among the 319 putative motifs, 18 motifs were significantly overrepresented among the genes detected in the αfactor experiment. The putative motif MCM1' was also overrepresented in the cdc28 experiment, see Table 3.
Rather surprisingly, we also found the LYS14 motif, bound by the LYS14 transcription factor and involved in regulation of genes of the lysine biosynthesis, to be significantly overrepresented among ORFs detected as periodic in the αfactor experiment.
Table 1. Detector performance and significance for individual motifs Detector performance and probability of overrepresentation of motifs for detection threshold τ = 0.95.
Table 3. Overrepresented putative motifs in αfactor experiment Only motifs significant at a Bonferroni corrected 0.05 level shown. * = Not significant. Italics indicates motif also significant in cdc28, see the text.
Discussion
A periodic gene expression during the cellcycle corresponds to a phasespecific expression that in turn indicates phasespecific regulation of expression. Therefore, detection of periodic expressions in cellcycle experiments has gained a lot of attention as a computational tool for improved understanding of cellcycle regulation. Often the investigator knows the cellcycle period time roughly, and if not, it could be estimated experimentally. However, such estimates will in general be uncertain due to the small size of the sample sets available. In the Bayesian framework presented here, this kind of diffuse prior information about the period time can be taken into account. As shown by means of the simulations, this results in a detector which is more robust than detectors that rely on an almost exact estimate of the cellcycle period time.
One should note that although the prior on the period time used in the simulations presented in Figure 1 is not very precise it results in superior performance in comparison with the Fisher's gtest which considers all possible periodicities. One explanation for the superior performance of the Bayesian detector on varying wave forms is due to a higher rate of false positive classifications using the gtest, since it is designed to detect all possible single periodicities. When the signals encountered contain multiple peaks in their Fourier spectra, the Bayesian detector still identifies them well, since the dominant term in the Fourier expansion is that of the desired period. However, the Fisher's gtest is designed for detection of one and only one sinusoidal signal with unknown period time. The test statistic used is the ratio of the maximum power in the Fourier spectrum to the sum of powers at all frequencies. Thus, for waveforms with the same fundamental frequency, the gtest statistic will decrease in magnitude if the Fourier spectra contain more peaks.
As expected from theory, the Bayesian detector also outperforms the combination test by de Lichtenberg et al. when the estimate of the period time deviates from the true value. This deterioration of performance might be explained by the decrease in the combination test output caused by the deviation between the estimated and true period time. When the period time estimate is correct, then the Bayesian detector and the combination test have essentially the same performance but having access to the correct period time is not a realistic possibility in most cases.
Although the present work considers detection of periodically expressed genes with a period close to that of the cellcycle, it is important to note that the suggested Bayesian approach can easily be adjusted for design of other detectors. For example, it would be straightforward to design detectors for genes with periodic expression at a particular phase of the cellcycle by applying a nonuniform prior on the phase angle. Furthermore, the phenomena of attenuating signals due to desynchronization of phase in the cell population under study may be incorporated in the model. This would be an interesting direction for future work. However, we noted that the decrease in performance of the detector on sinusoidal waveforms with slow attenuation was small for a low rate of false alarm. Nevertheless we investigated time profiles of genes from set B1 of de Lichtenberg et al. [2] that were not detected (for high values of the detection threshold) but could not find a predominant signal of periodicity with attenuation for any of those genes. This suggests that those genes would not have been detected even if the Bayesian detector would have been modified to take attenuation into account.
The present work shows that the Bayesian detector detects more genes necessary for regulation of the cellcycle than a random detector, but that the probability of detection is low for reasonable rates of false alarm. A trivial explanation is that some genes having a mutant cellcycle phenotype are not periodically expressed as they are needed throughout the entire cellcycle, or regulated otherwise. Nevertheless, our approach provides quantitative estimates as to what extent genes necessary for cellcycle regulation are periodically expressed. As an example, for a stringent condition on periodicity, 36 out of 290 cellcycle genes are detected as periodic in the αfactor experiment. In addition the reasonable but still relatively high level of false alarm rate may have the trivial explanation that many genes not required for the cellcycle are also periodically expressed.
It is interesting to note the lack of strong coupling between periodic expression and cellcycle transcription factor motifs in budding yeast. There are many genes having cellcycle motifs in their promoter regions that are not detected as periodically expressed. This observation can be explained by complexities of gene regulation, e.g. regulation of chromatin accessibility [15,19]. It could also be expected that a large set of genes are conditionally cellcycle regulated depending on the environment.
It has been argued that the synchronization methods used induce many side effects not related to the core functionality of the cellcycle [12], thus it is to be expected that different genes are periodically expressed in different experiments. Our analysis supports this notion, since we found that there are a number of genes that are clearly periodically expressed in only one experiment. Expression could also be regulated by combinations of transcription factors. One study supporting this idea has shown that genes sharing pairs of motifs show more similar expression patterns than when considering genes sharing single motifs [17].
It is also interesting to note that many genes are detected as periodically expressed while their promoter regions contain no known cellcycle regulated elements. Wolfsberg et al. (1999) used a set of genes determined to be periodically expressed from microarray experiments to search for novel cellcycle sequence motifs in yeast [20]. Consistent with the present study, the presence of most known cellcycle motifs were not well correlated with periodic expression and a similar pattern can be expected for novel motifs. However, a strong sequence signal for MCB and MCBlike motifs was found [20]. Our results also indicate that the MCB motif is the motif that is most strongly associated with periodic transcription. One interpretation is that genes regulated by the MCB motif have less complex regulation of expression than genes with other motifs. Similar observations were made by Cho et al. who used manual identification of periodically genes in cdc28 synchronized S. cerevisiae and studied the 500 bp upstream region of the periodically expressed genes [15]. They found that the presence of MCB and SCB motifs were associated with periodicity. However, a large number of the periodically expressed genes did not have MCB, SCB or any other of the known phase specific motifs [15].
We find that several putative sequence motifs [16] are significantly overrepresented in the genes detected as periodically expressed (see Table 3). For instance, genes involved in DNA metabolism, morphogenetic activities and organization of chromosome structure contain some of these motifs. Furthermore, the putative motifs SFF' and MCM1' are associated with the known cellcycle motifs SFF and MCM1 respectively. Thus, our findings support earlier notions that these putative motifs are important for periodic expression of cellcycle related genes.
In addition to our simulations and experiments on real data already described, the Bayesian detector has also been applied to the biological test sets described by de Lichtenberg et al. [2]. Since these test sets only contain positive examples, it is important to note that they do not provide any estimate at all of the probability of false alarm which is as important as the probability of detection in a ROC analysis. Therefore the results obtained using these test sets are available only as supplemental information [see 4, 5, 6].
Additional File 4. Results from the de Lichtenberg test sets, αfactor Results, experiment wise, for application to the B1, B2 and B3 test sets of de Lichtenberg et al, "Comparison of computational methods for the identification of cell cycle regulated genes", Bioinformatics (2005). The fraction of the test set detected vs the number of ORFs from the top of the ranking list of periodicity is plotted for each of the experiments. It is important to note when comparing results to those presented in by de Lichtenberg (ibid) that no period time fitted to the expression profiles of genes known (presumed) to be periodically expressed where used. Naturally such a fit will improve detection of those genes.
Format: GIF Size: 5KB Download file
Additional File 5. Results from the de Lichtenberg test sets, cdc28 As Additional file 4 but for the cdc28 experiment.
Format: GIF Size: 5KB Download file
Additional File 6. Results from the de Lichtenberg test sets, cdc15 As Additional file 4 but for the cdc15 experiment.
Format: GIF Size: 5KB Download file
Conclusion
Learningfree detectors that do not rely on a set of training examples are perhaps the only alternative when trying to identify the periodically expressed genes of poorly characterized organisms for which there is also limited knowledge available from closely related organisms. Our simulations indicate that systematic incorporation of a priori knowledge about the cellcycle period time using Bayesian inference may improve detection performance in comparison with two other recently proposed learningfree detectors.
When applying our detector to real data from Spellman et al. [1], we found that genes detected as periodically expressed only contain a small fraction of cellcycle genes inferred from mutant phenotype. We also found a statistical overrepresentation of cellcycle regulated sequence motifs among the genes detected as periodically expressed. Moreover, among the genes detected, we found 19 motifs that have not previously been implicated in regulation of periodic expression in budding yeast. Their roles can now be tested by direct experiments and improve our understanding of cisacting mechanisms underlying periodic transcription.
Methods
The detection problem
The Bayesian detector derived here should, after collection of a temporal profile consisting of samples y(t) from a time interval [a, b], determine whether the temporal profile contains a periodic signal with a period close to the cellcycle time, or not. We express this as the binary hypothesis test
H_{1 }: y_{t }= A_{0 }+ A_{1 }cos(ω t + θ) + u(t)
H_{0 }: y_{t }= B + v(t) (1)
where both u(t) and v(t) are white Gaussian stochastic processes (i.e. no time dependencies) with unknown standard deviations σ_{u }and σ_{v}, respectively. Model H_{1 }is regarded as a truncated Fourier series of the actual periodic signal. Adding higher order Fourier series terms as in [3] would be possible but it is expected that for the signals of interest to the investigator, the term containing the fundamental frequency which corresponds to the cellcycle period time will be much larger than the rest. Support for this assumption comes from the reported success of the detectors used in [1,2] and from the results reported here. Furthermore, in typical microarray assays of reported cellcycle experiments, a conventional principal component analysis (PCA) of the expression data covariance matrix shows that the variance in the data can be ascribed to sinusoidal periodic expression with a period time close to that of the cellcycle [21,22].
One should note that the terms u and v in Eq. (1) may contain both measurement noise and model errors. We would also like to stress that by employing a Gaussian distribution for u and v, the detector becomes sensitive only to the first and second order moments of the observed errors ε_{1}(t)=y_{t }A_{0}A_{1 }cos(ω_{o}t+θ) and ε_{0}(t) = y_{t } B respectively. This partly explains the robustness of our approach against different distributions of measurement noise and model errors. As long as the first and second order statistics are equal for different problems, the Bayesian test statistic will stay unchanged.
In the Bayesian framework the probabilities for model H_{1 }and H_{0 }respectively are obtained as
where i = 0 or i = 1. P(DH_{i}) is the likelihood and P(H_{i}) the prior probability of model H_{i}. The likelihood is obtained by marginalization of the unknown parameters. If model H_{i }is parametrized by ψ_{i }then
where P(ψ_{i }H_{i}) reflects the prior information about the parameters. Here we assume that the investigator has no prior information about the parameters except for the period time T = 2π /ω. Furthermore, we assume the parameters are independent. Hence the prior distribution factors with respect to the parameters; for model H_{0 }we have p(B, σ_{v}) = p(B)p(σ_{v}), and for H_{1 }we get p(A_{0}, A_{1}, ω, θ, σ_{u}) = p(A_{0})p(A_{1})p(ω)p(θ)p(σ_{u}). As described before, the prior p(ω) is chosen to reflect the knowledge the investigator has about the cellcycle period time T. If it is known for certain that T is in the interval [T_{1}, T_{2}], a uniform prior on this interval is suitable. In the present work we focus on the situation when the investigator knows the expected mean and variance of the prior distribution of ω and we therefore use a Gaussian prior distribution for ω (the maximum entropy distribution). The hyperparameters of p(ω), the mean μ_{ω }and standard deviation σ_{ω }must thus be chosen to reflect the prior knowledge of the investigator. As mentioned earlier, the prior for the cellcycle period time is chosen so that the reported interval of corresponding frequencies encompassed 70% of the probability mass in a Gaussian distribution, with its mean value equal to the average of the reported frequency interval
The priors for the remaining parameters are chosen to reflect the lack of information, thus (improper) uniform priors are used for B and A_{i}. The prior for θ is uniform on [0, π]. For the standard deviations σ_{i }we use Jeffrey's prior which is suitable due to its invariance to scaling of the variables involved [14]. Thus the resulting integrals (see Eq. 3) to be computed become
for model H_{1 }and
for model H_{0}. The last double integral is analytically tractable but the former integral over the five parameters is not [23]. To avoid time consuming numerical integration of that integral, we are using the standard Bayesian Information Criterion (BIC) approximation [13] which is defined as:
where superscript MAP indicates that the log likelihood should be evaluated at the maximum a posteriori (MAP) parameter setting. d is the number of parameters in model H_{i }and N the number of samples available. Now the log odds of model H_{1 }over H_{0, }log(P(H_{1}D)/P(H_{0}D)), may be approximated as BIC(H_{1}) BIC(H_{0})  τ where τ= P(H_{i})/P(H_{0}). Since the detector assigns a gene as periodic when this approximation is above some predefined threshold τ ', the detection rule is to assign a gene as periodic if BIC(H_{1}) BIC(H_{0})  τ > τ'. Equivalently, assign a gene as periodic if
g(D) = BIC(H_{1}) BIC(H_{0}) > τ (7)
where g(D) is the detector test statistic and τ now is a redefined arbitrary detector threshold. Also, note that the test statistic g(D) in (7) is an approximation of the loglikelihood ratio which should be used to maximize P_{D }for a given P_{FA }(the NeymanPearson criterion) [24]. The output of the detector is a function of the time (t) and gene expression (y) measurements. Finally, in order to obtain a test statistic that is bounded to the interval [0, 1], the value of g(D) was converted into a probability by the monotonous transformation
The detector was applied to the simulated and real data. Although the test statistic is an approximation of the log likelihood ratio, as the simulation results demonstrate, it is still powerful for detection. Choice of detection threshold is arbitrary and must be decided by the investigator. Classical pvalues for this test statistic may always be generated by means of resampling techniques but this is not a topic covered in this work.
Implementation
The detector was implemented in the R language statistical processing environment (http://www.rproject.org webcite) and is available upon request from the authors. The algorithm simply involves locating the MAP estimates for the models (the integrands of Eq. 4 and 5) and the evaluation of Eq 8. For H_{0}, an analytical solution is used. For H_{1 }a numerical approximation was found by optimisation using the NelderMead simplex method [25]. The initial conditions were chosen such that the standard deviation σ_{u }and A_{0 }were equal to the sample standard deviation and mean of the expression measurement. The mean of the prior distribution of ω was used as its initial value. The initial values of A_{1 }and θ were determined from the data by an analytical least squares fit of the model A_{1 }cos(ω_{o}t+θ) to the time series where ω_{o }is the initial value of ω.
Receiver operator characteristics
The Receiver Operator Characteristic (ROC) has been developed by the signal detection community to study how the probability of detection (P_{D}) varies with the probability of false alarm (P_{FA}) [24] and is also commonly used in medical decision making [11]. For each selected value of the decision threshold τ∈ [0, 1], we obtain different pairs of values(P_{FA}, P_{D}). A ROC graph displays a curve indicating these pairs as τ is changed continuously. Note that if the detector is inferring signal (hypothesis H_{1}) at random with probability P_{D}, the false alarms will occur at a rate P_{D }resulting in P_{FA }= P_{D}. Thus, a random detector will yield a ROC curve consisting of a diagonal line.
Authors' contributions
CRA and MGG derived the Bayesian detector. CRA implemented the software and performed the experiments. AI analysed the results and supervised the study together with MGG. All authors wrote, read, and approved the paper.
Table 2. Overrepresented known motifs Only motifs significant at a Bonferroni corrected 0.05 level shown. * = Not significant. Note that LYS14 is not a known cellcycle motif.
Acknowledgements
This work was supported by the Swedish Research Council, Wallenberg Consortium North, Carl Tryggers stiftelse (Stockholm), the Göran Gustafsson foundation (Stockholm), and the faculty of science and technology (Uppsala University).
We are grateful to Prof J Komorowski at the Linnaeus Center for Bioinformatics, Uppsala University for providing a stimulating research environment and valuable comments on the manuscript as well as to Dr T.R. Hvidsten for providing us with the compilation of sequence motifs. Finally we would like to thank the anonymous reviewers for their valuable comments.
References

Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.
Mol Biol Cell 1998, 9:32733297. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

de Lichtenberg U, Jensen LJ, Fausboll A, Jensen TS, Bork P, Brunak S: Comparison of computational methods for the identification of cell cycleregulated genes.
Bioinformatics 2005, 21:11641171. PubMed Abstract  Publisher Full Text

Lu X, Zhang W, Qin ZS, Kwast KE, Liu JS: Statistical resynchronization and Bayesian detection of periodically expressed genes.
Nucleic Acids Res 2004, 32:447455. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wichert S, Fokianos K, Strimmer K: Identifying periodically expressed transcripts in microarray time series data.
Bioinformatics 2004, 20:520. PubMed Abstract  Publisher Full Text

Zhao LP, Prentice R, Breeden L: Statistical modeling of large microarray data sets to identify stimulusresponse profiles.
Proc Natl Acad Sci U S A 2001, 98:56315636. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Luan Y, Li H: Modelbased methods for identifying periodically expressed genes based on time course microarray gene expression data.
Bioinformatics 2004, 20:332339. PubMed Abstract  Publisher Full Text

Johansson D, Lindgren P, Berglund A: A multivariate approach applied to microarray data for identification of genes with cell cyclecoupled transcription.
Bioinformatics 2003, 19:467473. PubMed Abstract  Publisher Full Text

Arellano M, Moreno S: Regulation of CDK/cyclin complexes during the cell cycle.
Int J Biochem Cell Biol 1997, 29:559573. PubMed Abstract  Publisher Full Text

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, IsselTarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.
Nat Genet 2000, 25:2529. PubMed Abstract  Publisher Full Text

Zweig MH, Campbell G: Receiveroperating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine.
Clin Chem 1993, 39:561577. PubMed Abstract

Shedden K, Cooper S: Analysis of cellcycle gene expression in Saccharomyces cerevisiae using microarrays and multiple synchronization methods.
Nucleic Acids Res 2002, 30:29202929. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gelman A: Bayesian data analysis. In Texts in statistical science. 2nd edition. Boca Raton, Fla., Chapman & Hall/CRC; 2004:xxv, 668.

Jaynes ET, Bretthorst GL: Probability theory : the logic of science. New York, Cambridge University Press; 2003:xxix, 727.

Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, Wolfsberg TG, Gabrielian AE, Landsman D, Lockhart DJ, Davis RW: A genomewide transcriptional analysis of the mitotic cell cycle.
Mol Cell 1998, 2:6573. PubMed Abstract  Publisher Full Text

Hughes JD, Estep PW, Tavazoie S, Church GM: Computational identification of cisregulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae.
J Mol Biol 2000, 296:12051214. PubMed Abstract  Publisher Full Text

Pilpel Y, Sudarsanam P, Church GM: Identifying regulatory networks by combinatorial analysis of promoter elements.
Nat Genet 2001, 29:153159. PubMed Abstract  Publisher Full Text

Breeden LL: Periodic transcription: a cycle within a cycle.
Curr Biol 2003, 13:R318. PubMed Abstract  Publisher Full Text

Lemon B, Tjian R: Orchestrated response: a symphony of transcription factors for gene control.
Genes Dev 2000, 14:25512569. PubMed Abstract  Publisher Full Text

Wolfsberg TG, Gabrielian AE, Campbell MJ, Cho RJ, Spouge JL, Landsman D: Candidate regulatory sequence elements for cell cycledependent transcription in Saccharomyces cerevisiae.
Genome Res 1999, 9:775792. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Holter NS, Maritan A, Cieplak M, Fedoroff NV, Banavar JR: Dynamic modeling of gene expression data.
Proc Natl Acad Sci U S A 2001, 98:16931698. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Holter NS, Mitra M, Maritan A, Cieplak M, Banavar JR, Fedoroff NV: Fundamental patterns underlying gene expression profiles: simplicity from complexity.
Proc Natl Acad Sci U S A 2000, 97:84098414. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bretthorst GL: Bayesian spectrum analysis and parameter estimation. New York, SpringerVerlag; 1988:xii, 209.

Van Trees HL: Detection, estimation, and modulation theory. New York,, Wiley; 1968:3 v..

Nelder JA, Mead R: A simplex algorithm for function minimization.